Agenda 2015
Workshop – Introduction to Big Data technology
This facultative meeting is targeted at the participants who want to learn the ropes of Hadoop ecosystem.During the workshops we will get to know better this nice elephant and his true friends (ex. Hive, Spark, Kafka). Specifically, we will see how “B Team” (B for Big Data) can help to store and compute data in Amazon public cloud. Every participant will play the role of a data analyst who implements different data analyses in order to retrieve knowledge locked in large data sets and help his company to make important decisions based on this data. The workshop will consist mostly of practical exercises.All exercises will constitute an opportunity to learn the most important technologies creating Apache Hadoop ecosystem. The workshop will be conducted in groups of max. 25 participants.
1. Introduction to Apache Hadoop ecosystem
- Main conceptual assumptions
- The most important technologies of Hadoop ecosystem (incl. HDFS, YARN, Pig, Hive, Spa
2. Using Hadoop ecosystem to solve a business problem
- Problem definition
- Hadoop cluster in Amazon public cloud
- Feeding data to a cluster using Kafka, Sqoop and HDFS CLI
- Cleaning data using Spark
- Ad hoc data analysis using Hive
3. Functionality of other Hadoop ecosystem tools and related software
- Task scheduling using Oozie
- Parquet column data format
- Instant streaming data computing using Spark Streaming
- Fast reading and writing of data using NoSQL data base – HBase
Moderators: Adam Kawa and Piotr Krewski, GetInData
Big Data Infrastructure
- Apache Hadoop Technology
– Current Hadoop ecosystem and its tools
– Technology trends and advanced functionalities of Hadoop (incl. security, data mining, multitenancy)
– Hadoop – its place in strategy and offer of recognized data base and BI vendors - Not only Hadoop
– Scalable logging (ex. Apache Flume or Apache Kafka)
– Streaming data computing in real time (ex. Apache Storm, Spark Streaming)
– NoSQL data bases (ex. Apache HBase, Apache Cassandra)
– Lambda architecture - Implementation, installation and configuration
– Planning and development of Hadoop clusters (selection of hardware and software, implementation methods, practical experience)
– Hadoop cluster administration (toolsets, biggest challenges, practical advice and best practices)
– Tuning – increasing stability and performance of a cluster, resolving technical problems
– Integration with data warehouse and BI tools - Implementation, installation and configuration
– Planning and development of Hadoop clusters (selection of hardware and software, implementation methods, practical experience)
– Hadoop cluster administration (toolsets, biggest challenges, practical advice and best practices)
– Tuning – increasing stability and performance of a cluster, resolving technical problems
– Integration with data warehouse and BI tools
Big Data applications and their implementation
- Data engineering
– Real examples of applications solving actual business problems
– Frameworks used to compute big data (incl. Hive, Pig, Cascading, Scalding, Crunch, Scrunch)
– Creating computation flows (incl. Oozie, Azkaban, Luigi)
– Data formats and compression algorithms (incl. ORC, Parquest, Avro, TSV)
– Best practices related to implementation, testing and debugging of big data applications - Data science
– Roles and tasks of data scientist in an organization
– Technologies and tools available to data scientist (Data Mining Libraries, incl. Spark MLib, Apache Mahout, H20 and visualization tools)
– Data Science in practice (interesting case studies, practical advice and conclusions, Data Science on a large scale)
Professional development
- Possible career paths in Big Data and available specializations
- Global and local labour market for Big Data specialists
The program includes:
Workflow in a team – getting rid of silos. Solution architecture: things to consider (hardware, network). Physical servers – automatic configuration exemplified
Spotify Hadoop cluster story – from a few machines in an office closet aggregating events for financial reports, to our current 900-node cluster playing a significant role
Two members of Spotify’s Hadoop squad will reminisce about failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and its evolution over time. We are going to explain how and why we use a number of tools (including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive with Tez) to build features and provide analytics (also worth mentioning are Snakebite and Luigi – two in-house tools created to overcome common frustrations).
In order to make better decisions based on data, we need to collect the data and understand it. How to make it easier to unlock the story hidden in your data? We need to show
How to plan and manage a big data analysis project? A de facto standard: Cross-Industry Standard Process for Data Mining (CRISP-DM). During the presentation I will present CRISP-DM methodology and Ada Lab’s experience with it.
Processing of large quantities of data identifying devices, traits and behavior of mobile/online gamers. The analysis of collected data. Why we host our solutions in Amazon Web Services.
Apache Flink (incubated) as one of the latest addition to the Apache family of data processing engines, complementing the ecosystem of MapReduce-like or
Comment:
Flink contains (1) APIs in Java and Scala for both batch-processing and data streaming applications, (2) a translation stack for transforming these programs to parallel data flows and (3) a runtime that supports both proper streaming and batch processing for executing these data flows in large compute clusters.
Flink’s batch APIs build on functional primitives (map, reduce, join, cogroup, etc), and augment those with dedicated operators for iterative algorithms, and support for logical, SQL-like key attribute referencing (e.g., groupBy(“user.country”). The Flink streaming API extends the primitives from the batch API with flexible window semantics.
Internally, Flink transforms the user programs into distributed data stream programs. In the course of the transformation, Flink analyzes functions and data types (using Scala macros and reflection), and picks physical execution strategies using a cost-based optimizer. Flink’s runtime is a true streaming engine, supporting both batching and streaming. Flink operates on a serialized data representation with memory-adaptive out-of-core algorithms for sorting and hashing. This makes Flink match the performance of in-memory engines on memory-resident datasets, while scaling robustly to larger disk-resident datasets.
There is a labyrinth of decision that we have to make to run data processing smoothly and easily. As in case of any puzzle it was long and not quite self-tuned process
Shape of stored data including:
– Directory structure and it implications for Hadoop cluster performance,
– Importance of declaration of a record scheme,
– Versioning and updating of records aka managing historical data.
Calculation process:
– Make it faster by using proper data structures in every single step,
– Bugs are inevitable so easy recalculation is a must,
– Cluster malfunction/maintenance triggers a recalculation process without human intervention,
– Design for an arbitrary time frame but be prepared for stream data as well.
Be a good citizen:
– Small files and empty directories are an issue,
– Make it as small as possible with data compression,
– YARN resource demand based on real needs,
– CSV files should be considered in the last resort but they are not forbidden.
Hadoop ecosystem surprises, and where to expect them. Based on experience from classic DBRMS and single server apps (such intuition is often misleading in context of Hadoop!):
– Killing a hive process with easy select query,
– Retries that are overloading a cluster,
– Non-deterministic measures of MR performance operating on same data,
– MR reduce phase object reusing,
– Client machine and its capacity may sometimes turn to be a bottleneck,
Comment:
For each troublemaking aspect mentioned I will be providing a sample solution applied or at least evaluated in real life. Problems mentioned above may seem to be a pessimistic and overwhelming vision for applying Hadoop but it is far from truth. We really enjoy working with that framework but I wish to talk about that dark side of Big Data implementation because behind each success story there are hours of problem fixing. I will be more than happy if at least some of you avoid presented pitfalls and successfully use Hadoop ecosystem.
An introduction to flexible processing on Hadoop. This talk provides an introduction to the Apache TEZ processing framework – architecture, API and comparison to the MapReduce approach. It will also be a case study of a non-obvious usage of the session mechanism in TEZ.
Proper data management and process scheduling are challenges that many data-driven companies underprioritize. Although they might not cause troubles in the short run,
How we simplified our data management and process scheduling in Hadoop with useful (but less adopted) open-source tools. How Falcon, HCatalog, Avro, HDFS FsImage, CLI tools and tricks helped us to address typical problems related to orchestration of data pipelines and discovery, retention, and lineage of datasets.
One of the most fascinating aspects of Data Scientist’s everyday work is discovery of secrets hidden behind data. But maybe we are led astray by the results we get? Przemek Maciołek will tell us how to model common business processes effectively using R and how to avoid some of the pitfalls.
Graph analytics in the cloud: modelling and analysing relationships through graphs can bring added value to your business. However, you need the right tool for the job.