Agenda 2015


Day 1, February 25, 2015

Workshop – Introduction to Big Data technology

This facultative meeting is targeted at the participants who want to learn the ropes of Hadoop ecosystem.During the workshops we will get to know better this nice elephant and his true friends (ex. Hive, Spark, Kafka). Specifically, we will see how “B Team” (B for Big Data) can help to store and compute data in Amazon public cloud. Every participant will play the role of a data analyst who implements different data analyses in order to retrieve knowledge locked in large data sets and help his company to make important decisions based on this data. The workshop will consist mostly of practical exercises.All exercises will constitute an opportunity to learn the most important technologies creating Apache Hadoop ecosystem. The workshop will be conducted in groups of max. 25 participants.

1. Introduction to Apache Hadoop ecosystem

  • Main conceptual assumptions
  • The most important technologies of Hadoop ecosystem (incl. HDFS, YARN, Pig, Hive, Spa

2. Using Hadoop ecosystem to solve a business problem

  • Problem definition
  • Hadoop cluster in Amazon public cloud
  • Feeding data to a cluster using Kafka, Sqoop and HDFS CLI
  • Cleaning data using Spark
  • Ad hoc data analysis using Hive

3. Functionality of other Hadoop ecosystem tools and related software

  • Task scheduling using Oozie
  • Parquet column data format
  • Instant streaming data computing using Spark Streaming
  • Fast reading and writing of data using NoSQL data base – HBase

Moderators: Adam Kawa and Piotr Krewski, GetInData

Day 2, February 26, 2015

Big Data Infrastructure

  • Apache Hadoop Technology
    – Current Hadoop ecosystem and its tools
    – Technology trends and advanced functionalities of Hadoop (incl. security, data mining, multitenancy)
    – Hadoop – its place in strategy and offer of recognized data base and BI vendors
  • Not only Hadoop
    – Scalable logging (ex. Apache Flume or Apache Kafka)
    – Streaming data computing in real time (ex. Apache Storm, Spark Streaming)
    – NoSQL data bases (ex. Apache HBase, Apache Cassandra)
    – Lambda architecture
  • Implementation, installation and configuration
    – Planning and development of Hadoop clusters (selection of hardware and software, implementation methods, practical experience)
    – Hadoop cluster administration (toolsets, biggest challenges, practical advice and best practices)
    – Tuning – increasing stability and performance of a cluster, resolving technical problems
    – Integration with data warehouse and BI tools
  • Implementation, installation and configuration
    – Planning and development of Hadoop clusters (selection of hardware and software, implementation methods, practical experience)
    – Hadoop cluster administration (toolsets, biggest challenges, practical advice and best practices)
    – Tuning – increasing stability and performance of a cluster, resolving technical problems
    – Integration with data warehouse and BI tools


Big Data applications and their implementation

  • Data engineering
    – Real examples of applications solving actual business problems
    – Frameworks used to compute big data (incl. Hive, Pig, Cascading, Scalding, Crunch, Scrunch)
    – Creating computation flows (incl. Oozie, Azkaban, Luigi)
    – Data formats and compression algorithms (incl. ORC, Parquest, Avro, TSV)
    – Best practices related to implementation, testing and debugging of big data applications
  • Data science
    – Roles and tasks of data scientist in an organization
    – Technologies and tools available to data scientist (Data Mining Libraries, incl. Spark MLib, Apache Mahout, H20 and visualization tools)
    – Data Science in practice (interesting case studies, practical advice and conclusions, Data Science on a large scale)


Professional development

  • Possible career paths in Big Data and available specializations
  • Global and local labour market for Big Data specialists


The program includes:

How to build a Hadoop cluster

Workflow in a team – getting rid of silos. Solution architecture: things to consider (hardware, network). Physical servers – automatic configuration exemplified
by Cisco UCS. Virtual servers – automatic installation of RHEL servers. Automatic configuration of systems exemplified by BMC BladeLogic. Hadoop installation with Ambari – Hortonworks HDP. Monitoring: integration with external systems – from Nagios to OP. Security: implementation of Kerberos and LDAP using IdM/IPA. Development plans: integration with Active Directory and a comment on Ranger (importance of privileges management).

The Evolution of Hadoop at Spotify: Through Failures and Pain

Spotify Hadoop cluster story – from a few machines in an office closet aggregating events for financial reports, to our current 900-node cluster playing a significant role
in many features that you see in our application today.
Two members of Spotify’s Hadoop squad will reminisce about failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and its evolution over time. We are going to explain how and why we use a number of tools (including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive with Tez) to build features and provide analytics (also worth mentioning are Snakebite and Luigi – two in-house tools created to overcome common frustrations).

What is the value of data? It depends on how we can use them.

In order to make better decisions based on data, we need to collect the data and understand it. How to make it easier to unlock the story hidden in your data? We need to show
them properly. Good visualization of data can lead to important discovery, while bad visualization can lead you astray. Rules enabling proper presentation of data and practical examples.

Questions, Big Data, and a Quest for Answers

How to plan and manage a big data analysis project? A de facto standard: Cross-Industry Standard Process for Data Mining (CRISP-DM). During the presentation I will present CRISP-DM methodology and Ada Lab’s experience with it.

Usage of Big Data in the gaming industry

Processing of large quantities of data identifying devices, traits and behavior of mobile/online gamers. The analysis of collected data. Why we host our solutions in Amazon Web Services.

Apache Flink: Fast and reliable large-scale data processing

Apache Flink (incubated) as one of the latest addition to the Apache family of data processing engines, complementing the ecosystem of MapReduce-like or
Spark-like engines. It is as fast as in-memory engines, while providing the reliability of Hadoop. Compatibility with the Hadoop ecosystem (runs on YARN, reads data from HDFS and HBase, and supports mixing existing Hadoop Map and Reduce functions into Flink programs). Flink from a user perspective: APIs and the most interesting design points behind Flink. Outlook on Flink’s development roadmap.

Flink contains (1) APIs in Java and Scala for both batch-processing and data streaming applications, (2) a translation stack for transforming these programs to parallel data flows and (3) a runtime that supports both proper streaming and batch processing for executing these data flows in large compute clusters.

Flink’s batch APIs build on functional primitives (map, reduce, join, cogroup, etc), and augment those with dedicated operators for iterative algorithms, and support for logical, SQL-like key attribute referencing (e.g., groupBy(“”). The Flink streaming API extends the primitives from the batch API with flexible window semantics.

Internally, Flink transforms the user programs into distributed data stream programs. In the course of the transformation, Flink analyzes functions and data types (using Scala macros and reflection), and picks physical execution strategies using a cost-based optimizer. Flink’s runtime is a true streaming engine, supporting both batching and streaming. Flink operates on a serialized data representation with memory-adaptive out-of-core algorithms for sorting and hashing. This makes Flink match the performance of in-memory engines on memory-resident datasets, while scaling robustly to larger disk-resident datasets.


Dive into the wilderness of pitfalls and dark corners of storing and processing data in Hadoop

There is a labyrinth of decision that we have to make to run data processing smoothly and easily. As in case of any puzzle it was long and not quite self-tuned process
with many dead ends and very long shortcuts.
Shape of stored data including:
– Directory structure and it implications for Hadoop cluster performance,
– Importance of declaration of a record scheme,
– Versioning and updating of records aka managing historical data.
Calculation process:
– Make it faster by using proper data structures in every single step,
– Bugs are inevitable so easy recalculation is a must,
– Cluster malfunction/maintenance triggers a recalculation process without human intervention,
– Design for an arbitrary time frame but be prepared for stream data as well.
Be a good citizen:
– Small files and empty directories are an issue,
– Make it as small as possible with data compression,
– YARN resource demand based on real needs,
– CSV files should be considered in the last resort but they are not forbidden.

Hadoop ecosystem surprises, and where to expect them. Based on experience from classic DBRMS and single server apps (such intuition is often misleading in context of Hadoop!):
– Killing a hive process with easy select query,
– Retries that are overloading a cluster,
– Non-deterministic measures of MR performance operating on same data,
– MR reduce phase object reusing,
– Client machine and its capacity may sometimes turn to be a bottleneck,

For each troublemaking aspect mentioned I will be providing a sample solution applied or at least evaluated in real life. Problems mentioned above may seem to be a pessimistic and overwhelming vision for applying Hadoop but it is far from truth. We really enjoy working with that framework but I wish to talk about that dark side of Big Data implementation because behind each success story there are hours of problem fixing. I will be more than happy if at least some of you avoid presented pitfalls and successfully use Hadoop ecosystem.


TEZ means fast

An introduction to flexible processing on Hadoop. This talk provides an introduction to the Apache TEZ processing framework – architecture, API and comparison to the MapReduce approach. It will also be a case study of a non-obvious usage of the session mechanism in TEZ.

Simplified Data Management and Process Scheduling in Hadoop

Proper data management and process scheduling are challenges that many data-driven companies underprioritize. Although they might not cause troubles in the short run,
they become a nightmare when your cluster grows. However, even when you realize the problem, you might not see that possible solutions are right under your nose.

How we simplified our data management and process scheduling in Hadoop with useful (but less adopted) open-source tools. How Falcon, HCatalog, Avro, HDFS FsImage, CLI tools and tricks helped us to address typical problems related to orchestration of data pipelines and discovery, retention, and lineage of datasets.


Tales from Dataland, or do I really know what I am doing?

One of the most fascinating aspects of Data Scientist’s everyday work is discovery of secrets hidden behind data. But maybe we are led astray by the results we get? Przemek Maciołek will tell us how to model common business processes effectively using R and how to avoid some of the pitfalls.

Graph analytics in the cloud

Graph analytics in the cloud: modelling and analysing relationships through graphs can bring added value to your business. However, you need the right tool for the job.
Traditional tools for general-purpose analytics like Hive, Pig or in general MapReduce are not suitable for graph analytics. During the last years, a number of graph-specific processing systems have been released. In this talk, we will look at the general programming paradigm they share, and how it can be leveraged to express graph analytics.