The conference (February 9) and workshops* (February 8) will take place at the Sheraton Warsaw Hotel.

*Because of the big interest in “Introduction BigData technology” and “Data Science” workshops, we are lanuching new date – 7th February 2017

EVENING PARTYdinner_party_caterers_caterer_search_browarmia_krlewska_brow4_204672276

At the end of the workshops (February 8) we would like to invite all the attendees
for the informal evening meeting in Browarmia. The party starts at 6:00 PM.

DESCRIPTION

Big Data Workshop is a one-day event prepared by Getindata and dedicated to everyone who wants to get to know with Big Data and Hadoop ecosystem. Participants will discover technologies such as Hadoop, Hive, Spark and Kafka by the most practical approach.

During the workshop you’ll act as a Big Data engineer and analyst working for a fictional company StreamRockTM that creates an application for music streaming (Spotify alike). The main goal of your work is to take advantage of Big Data technologies such as Hadoop, Spark or Hive to analyze various datasets about the users and the song they played. We will be processing our data in batch and streaming manners to get data-driven answers to many business questions and power product features that StreamRockTMbuilds. Every exercise will be executed on a remote multi-node Hadoop cluster.

The workshop is highly focused on a practical experience. The instructor will also introduce you to his own practical experience gained while working with Big Data technologies for several years.

TARGET AUDIENCE

Our workshop is dedicated to everyone who is interested in Big Data, analytics, engineers, managers and others.

REQUIREMENTS

All you need to fully participate in our training program is a laptop with the web browser, Shell terminal (e.g. Putty) and the wi-fi connection. Our workshops are mostly technical (and some business), however you do not need to have previous experience with Big Data technologies.

TRAINING MATERIALS

All participants will get training materials in the form of PDF files containing slides with theory and exercise manual with the detailed description of all exercises. During the workshop exercises will be done on remote Hadoop cluster. If you want to redo exercises later on your own you can use virtual machine (eg. Hortonworks Sandbox or Cloudera Quickstart that can be downloaded from each vendor’s site)

TIME BOX

The workshop will last for 8 full hours, so you should reserve yourself a full 1 day. Of course there will be coffee and lunch breaks during the training.

AGENDA

8.45 - 9.15

Registration and coffee

9.15 - 10.45

Session #1: Introduction to the Big Data and Apache Hadoop

  • Description of the StreamRock company along with all its opportunities and challenges that come from the Big Data technologies
  • Introduction to core Hadoop technologies such as HDFS or YARN
  • Hands-on exercise: Accessing a remote multi-node Hadoop cluster

10.45 - 11.00

Coffee break

11.00 - 12.30

Session #2: Providing data-driven answers to business questions using SQL-like solution

  • Introduction to Apache Hive
  • Hands-on exercise: Importing structured data into the cluster using HUE
  • Hands-on exercise: Ad-hoc analysis of the structured data with Hive
  • Hands-on exercise: The visualisation of results using HUE

12.30 - 13.30

Lunch

13.30 - 15.30

Session #3: Implementing scalable ETL processes on the Hadoop cluster

  • Introduction to Apache Spark, Spark SQL and Spark DataFrames.
  • Hands-on exercise: Implementation of the ETL job to clean and massage input data using Spark.
  • Quick explanation of the Avro and Parquet binary data formats.
  • Practical tips for implementing ETL processes like process scheduling, schema management, integrations with existing systems.

15.30 - 15.45

Coffee break

15.45 - 16.45

Session #4: Other essential tools from Hadoop Ecosystem

  • Scheduling and orchestration of tasks with Oozie
  • Data collection with Apache Kafka
  • Real-time random read-write with Apache HBase

16.45 - 17.00

Coffee break

17.00 - 17.30

Session #5 - Summary and Q&A

  • Big Data Jeopardy game

Workshop speaker, Getindata Instructor:

Maciej Arciuch

Senior Data Engineer, Grupa Allegro, GetInData

Maciej Arciuch

Starszy inżynier oprogramowania, Grupa Allegro

Piotr Krewski

Data Consultant and Co-founder, GetInData

Piotr Krewski

Big Data Consultant and Co-founder, GetInData

DESCRIPTION

This one day workshop prepared by Getindata teaches data engineers how to process unbounded streams of data in real-time using popular open-source frameworks. We focus mostly on Apache Flink – the most promising open-source stream processing framework that is more and more frequently used in production

During the course we simulate real-world end-to-end scenario – processing logs generated by users interacting with a mobile application in real-time. The technologies that we use include Kafka, Flink, HDFS, YARN and Elasticsearch. All exercises will be done on the remote multi-node clusters.

TARGET AUDIENCE

Data engineers who are interested in leveraging large-scale and distributed tools to process streams of data in real-time.

REQUIREMENTS

Some experience coding in Java or Scala and basic familiarity with Big Data tools (HDFS, Yarn).

TRAINING MATERIALS

All participants will get training materials in the form of PDF files containing slides with theory and exercise manual with the detailed description of all exercises. During the workshop exercises will be done on remote Hadoop cluster. If you want to redo exercises later on your own you can use virtual machine (eg. Hortonworks Sandbox or Cloudera Quickstart that can be downloaded from each vendor’s site)

TIME BOX

The workshop will last for 8 full hours, so you should reserve yourself a full 1 day. Of course there will be coffee and lunch breaks during the training.

TOOLS

Apache Kafka, Apache Flink, Elasticsearch, Java, Scala

AGENDA

8.45 - 9.15

Registration and coffee

9.15 - 10.15

Session #1 - Introduction to Apache Kafka + hands-on exercises

10.15 - 10.30

Coffee break

10.30 - 11.30

Session #2 - Introduction to Elasticsearch + hands-on exercises

11.30 - 11.45

Coffee break

11.45 - 12.30

Session #3 - Apache Flink

  • Introduction and key concepts
  • Basic Flink API

12.30 - 13.30

Lunch

13.30 - 15.00

Session #3 - Flink cont.

  • Time & Windows
  • Integration with Kafka and Elasticsearch
  • Hands-on exercises

15.00 - 15.15

Coffe break

15.15 - 16.45

Session #4 - Flink cont.

  • Stateful operations
  • Best practices
  • Daemons and cluster infrastructure
  • Hands-on exercises

16.45 - 17.00

Coffee break

17.00 - 17.30

Session #5 - Summary and comparison with other stream processing engines (Spark Streaming and Storm)

Workshop speaker, Getindata Instructor:

Tomasz Sosiński

Scala Developer, GetInData

Krzysztof Zarzycki

Big Data Architect, GetInData

Krzysztof Zarzycki

Big Data Architect, CTO and Co-founder, GetInData

DESCRIPTION

The workshop will be based on real business case and real data (kaggle’s competition). Imagine that your role is data scientist and you have data (input) and your goal is delivery business value based on data. If you asked what does mean business value? This is right question and you will learn, why is important to define success metric clearly.

The main focus will be based on a pragmatic approach, that means to show how to delivery valuable things and explain on intuitive level how it works. During the whole workshop all complex formulas or definition will be skipped. Only practical things.

TARGET AUDIENCE

Might be you heard a lot about data science, machine learning and might be about deep learning… a lot of analogy how good this is, but you have no idea how to start use it. Might you try learn course on coursera or other resources, but still is challenge to map this knowledge for real business case. Might you do some machine learning in your job, but feel that you miss something what could improve your deliverability (in terms of business). Might you want understand better things on intuitive level, instead of mathematical. At least one yes means that this workshop is might be for you 🙂

What you will learn?

  1. Understand better how to think in order to delivery value for business
  2. Understand better a machine learning pipeline and how to scale a solution
  3. Understand better on intuitive level why do this or that (without math’s formulas or other complex definition)

REQUIREMENTS

  1. Basic knowledge of python
  2. Nice to know. R and/or python pakcages: numpy, pandas, sklearn
  3. A lot of enthusiasm to learn new things

TIME BOX

The workshop will last for 8 full hours, so you should reserve yourself a full 1 day. Of course there will be coffee and lunch breaks during the training.

TOOLS

Spark, Python (pandas, numpy, sklearn, scipy, matplotlib, seaborn, d3js, hyperopt), xgboost, R, Vowpal, Wabbit

AGENDA

8.45 - 9.15

Registration and coffee

9.15 - 10.15

Session #1: Business story, Success metric, Data story

  • Understand better business case
  • Understand better how to measure success and how important this metric is
  • Understand better data

10.15 - 10.30

Coffee break

10.30 - 11.30

Session #2: Visualize data, Sampling, Baseline model

  • Understand how to work with bigger data sets (how to optimize reading, storing and so on)
  • Understand better sampling in practice, in other words how to works only with small parts of data and feel confident in terms of final result
  • Build a simple model – a baseline

11.30 - 11.45

Coffee break

11.45 - 12.30

Session #3: Feature engineering, Power of mean and median, Custom mean/median

  • Understand what is a secret sausage for machine learning
  • Understand why mean and median could be useful
  • Improve a mean/median for your data in terms of feature

12.30 - 13.30

Lunch

13.30 - 14.45

Session #4: Model evaluation, Strategy for training and prediction

  • Understand how to build a valuable model for new data (not only for old one – training data)
  • Understand better how to train model to predict the best possible result
  • Understand how to prevent overfitting

14.45 - 15.00

Coffee break

15.00 - 15.45

Session #5: Advanced models, Gradient Boosting, Random Forest, FTRL, Vowpal Wabbit

  • Understand on high level how to build more advanced models
  • Understand that tool is only tool which allow you to deliver value for business, and discussion like “python vs R” is useless

15.45 - 16.00

Coffee break

16.00 - 16.45

Session #6: Tuning hyperparameter, Power of ensemble Stacking

  • Understand why hyperparameter could help you
  • Understand naive and more advanced hyperparameter tuning process
  • Understand how to combine output from a model like an input to another model

16.45 - 17.00

Session #7 Summary

  • Final words and encourage you go forward…

Workshop speaker:

Vladimir Alekseichenko

Architect, GE Healthcare