The conference (February 22) and workshops* (February 20 and 21) will take place at the Airport Hotel Okęcie (Komitetu Obrony Robotników 24 Street, Warsaw)
*Because of the big interest in the workshops, we are lanuching new date – February 20th
VENUE
STRATUS, 9th floor at Airport Hotel Okęcie
DESCRIPTION
Big Data Workshop is a one-day event dedicated to everyone who wants to understand and get some hands-on taste of working with Big Data and Hadoop ecosystem. We will be talking about technologies such as Hadoop, Hive, Spark and Kafka.
During the workshop you’ll act as a Big Data specialist working for a fictional company called StreamRock that creates an application for music streaming (Spotify alike). The main goal of your work is to take advantage of Big Data technologies such as Hadoop, Spark or Hive to analyze data about the users and the song they played. You will process the data to get discover answers to many business questions and power product features that StreamRock is building. Every exercise will be executed on a remote multi-node Hadoop cluster.
The workshop is highly focused on a practical experience. Instructor will share with you interesting and practical insights gained while working with Big Data technologies for several years.
TARGET AUDIENCE
Workshop is dedicated to everyone who is interested in Big Data, analytics, engineers, managers and others.
REQUIREMENTS
All you need to fully participate in the workshop is a laptop with the web browser, terminal (e.g. Putty) and the wi-fi connection. Any prior knowledge of Big Data technologies is not assumed.
PARTICIPANT'S ROI
- Carefully curated knowledge of the most popular Big Data technologies
- Intuition about when and why use different Big Data tools
- Hands-on experience on simulated real-life use-cases
- Tips about real world applications and best practices from experienced professionals.
TRAINING MATERIALS
All participants will get training materials in the form of PDF files containing slides with theory and exercise manual with the detailed description of all exercises. During the workshop exercises will be done on remote Hadoop cluster. If you want to redo exercises later on your own you can use virtual machine (eg. Hortonworks Sandbox or Cloudera Quickstart that can be downloaded from each vendor’s site).
TIME BOX
This is one-day event, there will be coffee breaks and one-hour lunch break (included in price).
AGENDA
8.45 - 9.15
Coffee and socializing
9.15 - 10.45
Session #1 - Introduction to the Big Data and Apache Hadoop
- Description of the StreamRock company along with all its opportunities and challenges that come from the Big Data technologies
- Introduction to core Hadoop technologies such as HDFS or YARN
- Hands-on exercise: Accessing a remote multi-node Hadoop cluster
10.45 - 11.00
Coffee break
11.00 - 12.30
Session #2 - Providing data-driven answers to business questions using SQL-like solution
- Introduction to Apache Hive
- Hands-on exercise: Importing structured data into the cluster using HUE
- Hands-on exercise: Ad-hoc analysis of the structured data with Hive
- Hands-on exercise: The visualisation of results using HUE
12.30 - 13.30
Lunch
13.30 - 15.30
Session #3 - Implementing scalable ETL processes on the Hadoop cluster
- Introduction to Apache Spark, Spark SQL and Spark DataFrames.
- Hands-on exercise: Implementation of the ETL job to clean and massage input data using Spark.
- Quick explanation of the Avro and Parquet binary data formats.
- Practical tips for implementing ETL processes like process scheduling, schema management, integrations with existing systems.
15.30 - 15.45
Coffee break
15.45 - 16.45
Session #4 - Other essential tools from Hadoop Ecosystem
- Scheduling and orchestration of tasks with Oozie
- Data collection with Apache Kafka
- Real-time random read-write with Apache HBase
16.45 - 17.00
Coffee break
17.00 - 17.30
Session #5 - Summary and Q&A
- Big Data Jeopardy game
Keywords: Hadoop Ecosystem, Hive, Spark, Big Data Analytics, Big Data ETL
Workshop speaker, Getindata Instructor:
Piotr Krewski
Data Consultant and Co-founder, GetInData
Piotr Krewski
Big Data Consultant and Co-founder, GetInData
VENUE
CUMULUS, 9th floor at Airport Hotel Okęcie
DESCRIPTION
This one day workshop teaches data engineers how to process unbounded streams of data in real-time using popular open-source frameworks. We focus mostly on Apache Flink – the most promising open-source stream processing framework that is more and more frequently used in production.
During the course we simulate real-world end-to-end scenario – processing logs generated by users interacting with a mobile application in real-time. The technologies that we use include Kafka, Flink, HDFS and YARN. All exercises will be done on the remote multi-node clusters.
TARGET AUDIENCE
Data engineers who are interested in leveraging large-scale and distributed tools to process streams of data in real-time.
REQUIREMENTS
Some experience coding in Java or Scala and basic familiarity with Big Data tools (HDFS, Yarn).
PARTICIPANT'S ROI
- Concise and practical knowledge of applying stream processing to solve business problems.
- Hands-on coding experience under supervision of experience Flink engineers.
- Tips about real world applications and best practices.
TRAINING MATERIALS
All participants will get training materials in the form of PDF files containing slides with theory and exercise manual with the detailed description of all exercises. During the workshop exercises will be done on remote Hadoop cluster. If you want to redo exercises later on your own you can use virtual machine (eg. Hortonworks Sandbox or Cloudera Quickstart that can be downloaded from each vendor’s site).
TIME BOX
The workshop will last for 8 full hours, so you should reserve yourself a full 1 day. Of course there will be coffee and lunch breaks during the training.
AGENDA
8.45 - 9.15
Coffee and socializing
9.15 - 10.15
Session #1 - Introduction to Apache Kafka + hands-on exercises
10.15 - 10.30
Coffee break
10.30 - 12.30
Session #2 - Apache Flink
- Introduction and key concepts
- Basic Flink API
- Hands-on exercises
12.30 - 13:30
Lunch
13.30 - 15.00
Session #3 - Flink cont.
- Time & Windows
- Integration with Kafka
- Hands-on exercises
15.00 - 15.15
Coffe break
15.15 - 16.45
Session #4 - Flink c.d.
- Stateful operations
- Best practices
- Daemons and cluster infrastructure
- Hands-on exercises
16.45 - 17.00
Coffee break
17.00 - 17.30
Session #5 - Summary and comparison with other stream processing engines (Spark Streaming and Storm)
Keywords: Kafka, Flink, Real Time Processing, Low Latency Stream Processing
Workshop speaker, Getindata Instructor:
Krzysztof Zarzycki
Big Data Architect, GetInData
Krzysztof Zarzycki
Big Data Architect, CTO and Co-founder, GetInData
Dawid Wysakowicz
Software Engineer, Ververica (former data Artisans)
VENUE
ATLANTIS, 1st floor at Airport Hotel Okęcie
DESCRIPTION
This one day workshop teaches participants how to apply data scientist methods to Large amount of data. We focus mostly on Apache Spark, Spark ML (Spark Library dedicated for machine learning) and machine learning tuning using Apache Spark.
During the course we simulate real-world end-to-end scenario – how to create a working, production ready model for text categorisation. The technologies that we use include Python, Apache Spark, Spark ML, Zeppelin. Exercises will be done on the remote multi-node clusters.
TARGET AUDIENCE
Workshop is dedicated to everyone who is interested in Big Data, analytics, text mining and data science.
REQUIREMENTS
All you need to fully participate in the workshop is a laptop with the web browser, terminal (e.g. Putty) and the wi-fi connection. Any prior knowledge of Big Data technologies or data science techniques is not assumed.
PARTICIPANT'S ROI
- Knowledge how to approach data analysis and data science with Apache Spark
- Knowledge how to work with text data.
- Knowledge about two approaches to machine learning with Spark – using Spark ML and sklearn.
- Hands-on experience on simulated real-life use-case
- Ability to tackle business problems requiring text mining techniques
TRAINING MATERIALS
All participants will get training materials in the form of PDF files containing slides with theory and exercise manual with the detailed description of all exercises. During the workshop exercises will be done on remote Hadoop cluster. If you want to redo exercises later on your own you can use virtual machine (eg. Hortonworks Sandbox or Cloudera Quickstart that can be downloaded from each vendor’s site)
TIME BOX
This is one-day event, there will be coffee breaks and one-hour lunch break (included in price).
AGENDA
8.45 - 9.15
Coffee and socializing
9.15 - 11.15
Session #1 - Working with text data
- What is text data / how we can used it / how store it
- Popular method for text embeddings
- Popular classification method for text
- Hands-on exercise: First small model for text categorization
11.15 - 11.30
Coffee break
11.30 - 12.30
Session #2 - General Apache Spark
- Create and transform DataFrames
- ETL process with Spark
- Perform exploratory data analysis (EDA)
- Hands-on exercise: Load and looking at data for ads categorization
12.30 - 13.30
Lunch
13.30 - 15.30
Session #3 - Basic Machine Learning on Spark
- Introduction to sklearn library
- Hands-on exercise: use Spark for searching the best model parameters for sklearn
- Describe Spark ML and the difference between sklearn and Spark ML
- Hands-on exercise: Build first model with Spark ML
15.30 - 15.45
Coffee break
15.45 - 16.45
Session #4 - Spark ML pipeline
- Spark ML pipeline for text analysis
- Hands-on exercise: Build Spark ML pipeline for ads categorisation
16.45 - 17.00
Coffee break
17.00 - 17.30
Session #5 - Summary and Q&A
Keywords: Spark, Machine Learning, Text Mining, MlLib, Data Science
Workshop speakers:
Rafał Prońko
Machine Learning Developer, YND
Tomasz Żukowski
Data Analyst, GetInData