Workshops*: 26 or 28th of February – date of your choice upon registration
Time:
9 am – 5 pm
Place:
Golden Floor a Conference & Workshops Center, Aleje Jerozolimskie 123A, 02-017 Warsaw.
*We will work in a group of no more than 20 people.
Developing production-ready Spark application
26 Feb. – 15 th floor, room no.2
28 Feb. – 15 th floor, room no.5
DESCRIPTION
During this workshop we will create fully functioning, production-ready Spark application using day-to-day tools like Scala, sbt or Intellij.
Workshop’s targeted audience are semi-professionals with little or more background with programming. We will provide necessary project setup and introduction to Scala language and required tools for building the application. Previous Scala knowledge is not mandatory, merely general IT skills.
REQUIREMENTS
- We use Scala as the main programming language for this course. A basic understanding of the language is recommended or other programming languages – Python, Java.
- It would be beneficial to have some knowledge of Spark SQL, Datasets, and Dataframes – it’s not an introduction to Apache Spark.
- Laptop with pre-installed JDK8 and IntelliJ IDEA (https://www.jetbrains.com/idea/) with Scala Plugin.
- We will provide git repository 1-2 weeks before the scheduled training.
AGENDA
Session #1 Introduction to Scala and Spark. Present workshops goals.
– brief introduction to Scala programming,
– discuss workshops’ project structure
– present e2e setup for testing processing logic
Session #2 Write application code to process JSON data from HDFS to HIVE with Spark
– implement input data processor and formatting,
– apply custom transformations to the data,
– tune processing logic and performance
Session #3 Implement testing logic to validate processing
– run and test application code,
– exercise testing skills
Session #4 Wrap up
– quick overview,
– discuss deplyoment and maintanence of Spark Jobs
TIME BOX
This is one-day event, there will be coffee breaks and one-hour lunch break (included in price)
We will work in a group of no more than 20 people.
Workshop trainers:
Paweł Kubit
Data Engineer, GetInData
Patrycjusz Sienkiewicz
Data Engineer, GetInData
Real-Time stream processing
26 Feb. – 15 th floor, room no. 4
28 Feb. – 15 th floor, room no. 6
DESCRIPTION
In this one day workshop you will learn how to process unbounded streams of data in real-time using popular open-source frameworks. We focus mostly on Apache Flink and Apache Kafka – the most promising open-source stream processing framework that is more and more frequently used in production.
During the course we simulate real-world end-to-end scenario – processing logs generated by users interacting with a mobile application in real-time. The technologies that we use include Kafka, Flink, HDFS and YARN. All exercises will be done on the remote multi-node clusters.
TARGET AUDIENCE
Data engineers who are interested in leveraging large-scale and distributed tools to process streams of data in real-time.
REQUIREMENTS
- Experience with programming in Java or Scala
- Basic familiarity with Big Data tools (HDFS, YARN)
- Working computer (rather personal than company)
- Possibility to log in to any machines via SSH port (here corporate rules can mix a lot)
- Installed on your machine:
- Java JDK >= 1.8
- IDE – preferrably Intellij, but Eclipse is also fine
- Maven
- SSH client (eg. Putty for Windows)
- git
- SwitchyOmega plugin in web browser
PARTICIPANT'S ROI
- Concise and practical knowledge of applying stream processing to solve business problems.
- Hands-on coding experience under supervision of experience Flink engineers.
- Tips about real world applications and best practices.
TRAINING MATERIALS
All participants will get training materials in the form of PDF files containing slides with theory and exercise manual with the detailed description of all exercises. During the workshop exercises will be done on remote Hadoop cluster. If you want to redo exercises later on your own you can use virtual machine (eg. Hortonworks Sandbox or Cloudera Quickstart that can be downloaded from each vendor’s site)
TIME BOX
The workshop will last for 8 full hours, so you should reserve yourself a full 1 day. Of course there will be coffee and lunch breaks during the training.
We will work in a group of no more than 20 people.
AGENDA
8.45 - 9.15
Coffee and socializing
9.15 - 10.15
Session #1 - Introduction to Apache Kafka + hands-on exercises
10.15 - 10.30
Coffee break
10.30 - 12.30
Session #2 - Apache Flink
- Introduction and key concepts
- Basic Flink API
- Hands-on exercises
12.30 - 13.30
Lunch
13.30 - 15.00
Session #3 - Flink cont.
- Time & Windows
- Integration with Kafka
- Hands-on exercises
15.00 - 15.15
Coffee break
15.15 - 16.45
Session #4 - Flink c.d.
- Stateful operations
- Best practices
- Daemons and cluster infrastructure
- Hands-on exercises
16.45 - 17.00
Coffee break
17.00 - 17.30
Session #5 - Summary and comparison with other stream processing engines
- Stateful operations
- Best practices
- Daemons and cluster infrastructure
- Hands-on exercises
Keywords: Kafka, Flink, Real Time Processing, Low Latency Stream Processing,
Workshop trainer:
Grzegorz Kołakowski
Data Engineer, GetInData
Krzysztof Zarzycki
Big Data Architect, GetInData
Krzysztof Zarzycki
Big Data Architect, CTO and Co-founder, GetInData
Big Data on Kubernetes
26 Feb. – 15 th floor, room no. 5
28 Feb. – 15 th floor, room no. 8
DESCRIPTION
This one day workshop teaches participants how to use Kubernetes in AWS and run different Big Data tools on top of it.
During the course we simulate real-world architecture – data processing real-time pipeline: reading data from web applications, processing it and storing results to distributed storage.
The technologies that we will be using include Kafka, Spark and S3.
All exercises will be done on the remote Kubernetes clusters.
TARGET AUDIENCE
Engineers who are interested in Big Data and Kubernetes.
REQUIREMENTS
- Some experience with Docker and programming
- Working computer (rather personal than company)
- Working SSH client (for Windows it may be PuTTy)
- Possibility to log in to any machines via SSH port (here corporate rules can mix a lot)
PARTICIPANT'S ROI
- Concise and practical knowledge of using Kubernetes
- Hands-on experience on simulated real-life use-cases
- Tips about real world applications and best practices from experienced professionals.
TRAINING MATERIALS
All participants will get training materials in the form of PDF files containing slides with theory and exercise manual with the detailed description of all exercises. During the workshop exercises will be done on remote Kubernetes cluster. If you want to redo exercises later on your own you can use minikube.
TIME BOX
This is one-day event, there will be coffee breaks and one-hour lunch break (included in price).
We will work in a group of no more than 20 people.
AGENDA
Session 1 – Introduction to Kubernetes
- Docker recap
- Basic Kubernetes concepts and architecture
- Hands-on exercise: connecting to Kubernetes cluster
Session 2 – Helm
- Introduction to Helm
- Hands-on exercise: deploying Helm app
Session 3 – Apache Kafka
- Running Apache Kafka on Kubernetes
- Using Kafka Connect to migrate data from Kafka to S3
- Leverage Kafka REST in your web application
- Hands-on exercise: deploying data pipeline on Kubernetes
Session 4 – Apache Spark
- Spark as streaming processing engine
- Deploying Spark on Kubernetes
- Hands-on exercise: Real-time data aggregation using Spark Streaming
Keywords: Kubernetes, Docker, Helm, Kafka, Spark
Workshop trainer:
Maciej Bryński
Big Data Architect, DXC Technology