Agenda 2018

Changes in the order of presentation might occur

8.00 - 9.00 -

Registration, coffee and networking session

During registration, we cordially invite you to participate in a networking session, which aims to get to know each other and exchange experiences

9.00 - 9.15

Conference opening

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Adam Kawa

Data Engineer and Founder, GetInData

Adam Kawa

CEO and Co-founder, GetInData

9.15 – 10.45 Plenary session

9.15 - 9.45

Transforming our relationship with clients by AI

Bolke de Bruin

Head of Advanced Analytics Technology, ING

9.45 - 10.15

Never Underestimate the Power of a Single Node

Recent developments in GPU hardware and storage technology have changed how we do data analysis and machine learning. These technologies on a single node have grown many folds in the last five years while the growth in network speed has lagged behind. I will talk about the overall ML lifecycle and challenges we face in doing ML at scale, from protecting your Uber accounts to making self driving cars a reality. Then I want to focus on an important part of ML lifecycle which is data/ML exploration and experimentation. In large companies like Uber, data scientists are inclined to use shared Hadoop infra for all their needs. For data exploration, this is inefficient for the user and also makes the cluster run slow. I will talk about our new solution to tackle this problem by using a high powered node that lets us to work with 100s of GB to few TBs of data interactively without paying the overhead of a distributed system. I will also talk about some of the interesting machine learning and infrastructure problems that I face in my new role in Uber’s self driving team.

Karthik Ramasamy

Machine Learning Engineer, Google

10.15 - 10.45

Assisting millions of active users in real-time

Nowadays many companies become data rich and intensive. They have millions of users generating billions of interactions and events per day.
These massive streams of complex events can be processed and reacted upon to e.g. offer new products, next best actions, communicate to users or detect frauds, and quicker we can do it, the higher value we can generate.

In this talk we will present, how in joint development with our client and in just few months effort we have built from ground up a complex event processing platform for their intensive data streams. We will share how the system runs marketing campaigns or detect frauds by following behavior of millions users in real-time and reacting on it instantly. The platform designed and built with Big Data technologies to infinitely and cost-effectively scale already ingests and processes billions of messages or terabytes of data per day on a still small cluster. We will share how we leveraged the current best of breed open-source projects including Apache Flink, Apache Nifi and Apache Kafka, but also what interesting problems we needed to solve. Finally, we will share where we’re heading next, what next use cases we’re going to implement and how.

Alexey Brodovshuk

Software Development Supervisor, Kcell

Krzysztof Zarzycki

Big Data Architect, GetInData

Krzysztof Zarzycki

Big Data Architect, CTO and Co-founder, GetInData

10.45 - 11.15

Coffee break

11.15 – 15.30 Simultaneous sessions

GALAXY I, 1st floor

GALAXY II, 1st floor

GALAXY III, 1st floor

CARAVELE, ground floor

Architecture, Operations & Deployment

This track is dedicated to system architects, administrators and people with DevOps skills who are interested in technologies and best practices for planning, building, installing, managing and securing their Big Data infrastructure in enterprise environments – both on-premise and the cloud.

Data Engineering

This track is the place for developers to learn about tools, techniques and innovative solutions to collect, store and process large volumes of data. It covers topics like data ingestion, ETL, distributed engineers, process scheduling, metadata and schema management, distributed datastores and more.

Analytics & Data Science

This track includes real case-studies demonstrating how Big Data is used to address a wide range of business problems. You can find here talks about large-scale Machine Learning, A/B tests, visualizing data as well as various analysis that enable making data-driven decisions and feed personalized features of data-driven products.

Real-Time Analytics

This track covers technologies, strategies and use-cases for real-time data ingestion and deriving real-time actionable insights from the flow of events coming from sensors, devices, users, and front-end systems.

Host:
Piotr Bednarek, GetInData

Host:
Łukasz Suchenek, Evention

Host:
Klaudia Zduńczyk, GetInData

Host:
Dawid Wysakowicz, GetInData

11.15 - 11.45

Edge to Enterprise analytics platform – a case study

During the presentation you will learn about real life scenarios of usage Edge to Enterprise analytics platform, and how it simplifies implementation and maintenance of all needed components.

The Cisco and SAS Edge-to-Enterprise IoT Analytics Platform supports the full IoT analytics lifecycle, including the edge, the network, and the data center. Imagine data gathered directly from distributed sensors by an IoT gateway, processed and analyzed in-stream on the same gateway device, and selectively transferred to the data center for further investigations and visualizations.

The platform enables a cohesive guidance layer to align the expertise of partners/providers of sensors, applications, data analysis, security, services oversight and to get them and their assets all operating effectively together, to promote timely decisions supporting revenue generation and cost control. This architecture enables multiple use cases such as Energy and utilities, Connected automobile, Smart manufacturing and many more. Some of them will be covered during this presentation.

(More...)

Ernst Kratky

Big Data Analytics & AI Sales Lead – Datacenter EMEA, Cisco

Michał Kudelski

Senior Business Solutions Manager, SAS Institute

11.15 - 11.45

Building a Modern Data Pipeline: Lessons Learned

Adform is one of the biggest European ad-tech companies – for example, our RTB engine at peak handles ~1m requests per second, each in under 100 ms, producing ~20TB of data daily.

In this talk I will present the data pipeline and the infrastructure behind it, emphasizing our core principles (such as event sourcing, immutability, correctness) as well as the lessons learned along the way while building it and the state it is converging to.

(More...)

Keywords: stream processing, kafka, event sourcing, big data

Saulius Valatka

Technical Lead, Adform

11.15 - 11.45

Executing the Data 180, moving from explaining surprises to predicting the future

The explosion in data and data technologies in the last decade has opened an opportunity for traditional enterprises to exploit their legacy. Nordea bank can trace its history back over 100 years.

A bank generates a lot of data about its customers and the lives they live. This data used to be treated as an overhead, a cost of doing business. Nordea has leveraged advancements in data technologies to turn this overhead into an asset through open source software and agile approaches. In this talk Alasdair will explain the evolution of the platform at Nordea, benefits realized and use cases that have turned data from liability to asset.

(More...)

Alasdair Anderson

Executive Vice President, Nordea

11.15 - 11.45

Apache Flink: Better, Faster & Uncut

This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,

end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.

(More...)

Keywords: Apache Flink, streaming, data processing engine

Piotr Nowojski

Software Engineer, data Artisans

11.45 - 11.50

Technical break

11.50 - 12.20

Data Fabric Bridging On-Premise and Cloud

Ab Initio’s approach is to provide consistent set of capabilities and applications which can span across diverse systems that are hosted either on-premise and off-premise. In other words, Ab Initio acts as the data fabric which simplifies and tightly integrates data movement across systems.

With the increasing focus on off-premise managed Cloud environments and need to create a unified system with on-premise Enterprise and Legacy environments there is a need for an approach that bridges the gap and provides a simplification and unification. This results in hybrid and integrated applications that leverage benefits of cloud and on-premise data services. Ab Initio, acting as the data fabric does not only provide a completely integrated platform but it also supports all major databases and all data types, data formats of any level of complexity (including those typically found in Hadoop/Cloud), as well as metadata and XML. Further, it runs on all major platforms – Linux, mainframe, Hadoop, and Windows so that Hybrid Cloud is made reality at the Enterprise Level. Given that Ab Initio applications can also run in parallel on heterogeneous combinations of all these platforms productivity achieved at unparalleled level. Appropriate data governance, leads to data protection and security whilst reducing risks. This can only be achieved by having an integrated metadata management system with a full data lineage that spans across multiple systems and Ab Initio’s Metadata Hub achieves just that.
.. and all this presented based on out real-world customers’ cases from the banking and media sector.

(More...)

Firat Tekiner

Data Scientist and Big Data Architect, AB Initio

11.50 - 12.20

Time Series Jobs Scheduling at Criteo With Cuttle

At Criteo we run something like 300k jobs, processing around 4PB of logs to produce trillions of new records each day. We do that using several frameworks such as Hive, raw Map/Reduce, Scalding or Spark.

In this presentation I will introduce you to “Cuttle” – our open-source Scala based jobs scheduler. You will learn what it is good for and how you can use it to produce data at scale.

(More...)

Keywords: workflow, scheduling, hadoop, scala

Guillaume Bort

Technical lead, Data Realiability Engineering, Criteo

11.50 - 12.20

7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with Analytics

The next time you find yourself thinking there isn’t enough time in a week, consider what Drinker Biddle did for their client in 7 days.

When a senior executive for a publicly traded company was fired for underperformance, he made a serious allegation on his way out the door. He claimed he was laid off because of his repeated attempts to inform officials that the company was falsifying quarterly financial reports to the public. Instead of waiting for the typical pace of discovery that could potentially cost their client at least a quarter of a million dollars, Drinker Biddle used powerful analytics technology to conduct an intelligent investigation, fast. In this session, you will learn about machine learning that makes digging through large multi-sources data sets possible. You will have a chance to see the backstage of how engineers empower legal teams to organize data, discover the truth and act on it.

(More...)

Keywords: machine learning, analytics, workflow

Elise Tropiano

Senior Technical Product Manager, Relativity

11.50 - 12.20

Thinking in Data Flows

In this presentation we’ll look at how far one can push the notion of batch = streaming, how processor-oriented architectures like Apache NiFi and Apache Streams work

and why they work better than a forced choice between batch and streaming. We’ll close with some real-world of examples of using NiFi and Streams for loading batch and streaming data to HDFS, Elasticsearch, and other data destinations used in modern data pipelines.

(More...)

Keywords: streaming, data flow, NiFi, Streams

Joey Frazee

Solutions Engineer, Hortonworks

Steve Blackmon

VP Technology, People Pattern

12.20 - 12.25

Technical break

12.25 - 12.55

Elephants in the cloud or how to become cloud ready

The way you operate your Big Data environment is not going to be the same anymore. This session is based on our experience managing on-premise environments

and taking the lesson from innovative data-driven companies that successfully migrated their multi PB Hadoop clusters. Where to start and what decisions you have to make to gradually becoming cloud ready. The examples would refer to Google Cloud Platform yet the challenges are common.

(More...)

Keywords: hadoop, private cloud, google compute platform, migration, hybrid platforms

Krzysztof Adamski

Team Lead w zespole Hadoop w ING Services Polska, Konsultant w GetInData

Krzysztof Adamski

Data Infrastructure Architect, ING

12.25 - 12.55

Privacy by Design

Privacy and personal integrity has become a focus topic, due to the upcoming GDPR deadline in May 2018 and it’s requirements for data storage, retention, and access. This talk provides an engineering perspective on privacy and highlights pitfalls and topics that require early attention.

The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.

(More...)

Keywords: Privacy, GDPR, data pipelines, data engineering

Lars Albertsson

Founder & data engineering consultant, Mapflat

12.25 - 12.55

The Factorization Machines algorithm for building recommendation system

One of successful examples of data science applications in the Big Data domain are recommendation systems. The goal of my talk is to present the Factorization Machines algorithm, available in the SAS Viya platform.

The Factorization Machines is a good choice for making predictions and recommendations based on large sparse data, in particular specific for the Big Data. In practical part of the presentation, a low level granularity data from the NBA league will be used to build an application recommending optimal game strategies as well as predicting results of league games.

(More...)

Keywords: SAS Viya, Factorization Machines, recommendation system, sparse data

Paweł Łagodziński

Sr Business Solutions Manager, SAS Institute

12.25 - 12.55

Deriving Actionable Insights from High Volume Media Streams

In this talk we describe how to analyze high volumes of real-time streams of news feeds, social media, blogs in scalable and distributed way using Apache Flink

and Natural Language Processing tools like Apache OpenNLP to perform common NLP tasks like Named Entity Recognition (NER), chunking, and text classification.

(More...)

Keywords: nlp, streaming, news, machine learning

Jörn Kottmann

Senior Software Developer, Sandstone SA

Peter Thygesen

Partner & Senior Software Engineer, Paqle A/S

12.55 - 13.50

Lunch

13.50 - 14.20

Bringing Druid to production; the possibilities and pitfalls

Druid, a high-performance, column-oriented, distributed data store. This database allows you to query petabytes of columnar data in a realtime fashion.

Firstly, an introduction Druid’s architecture and the many components within the database system and their role. Secondly, the two ways (batch/realtime) of ingesting data into Druid and their pro’s and con’s. Finally, a case will be presented of using Druid into production. The focus is a cost effective implementation that allows Druid to scale using an OpenStack private cloud. The take-aways of the session are insights in when to use Druid and help you to identify and common pitfalls when running Druid in Production.

(More...)

Keywords: Druid, Databases, Scale

Fokko Driesprong

Data Engineer , GoDataDriven

13.50 - 14.20

Software Engineer in the world of Machine Learning

Given the example of one of Ocado’s ML projects, called Order Forecasting, I will explain how old software engineering enables the success of ML projects.

Although large-scale ML requires new tricks and a new way of thinking, things like testing, continuous integration, reproducibility, monitoring and ease of maintenance are now more important than ever. It’s something we had to learn in Ocado the hard way and hopefully you will avoid all the traps along the way by leveraging our experience.

(More...)

Keywords: machine learning, software engineering, google cloud platform, user story

Przemysław Pastuszka

Machine Learning Engineer, Ocado Technology

13.50 - 14.20

Machine learning security

Despite rapid progress of tools and methods, security has been almost entirely overlooked in the mainstream machine learning. Unfortunately, even the most sophisticated and carefully crafted models can become victims of using the so-called adversarial examples.

This talk will cover the concepts of adversarial data and machine learning security, go through examples of possible attack vectors and discuss the currently known defence mechanisms.

(More...)

Keywords: machine learning, security, adversarial examples

13.50 - 14.20

Near Real-Time Fraud Detection in Telecommunication Industry

In general, fraud is the common painful area in the telecom sector, and detecting fraud is like finding a needle in the haystack due to volume and velocity of data. There are 2 key factors to detect fraud:

(1). Speed: If you can’t detect in time, you’re doomed to loose because they’ve already got what they need. Simbox detection is one of the use case for this situation. Frauders use it to bypass interconnection fee. In this use case we’re talking about our real time architecture using Spark SQL to detect simbox within 5 minutes.

(2). Accuracy: Frauders changes their method all the time. But our job is finding their behaviour using machine learning algorithms accurately. Anomaly detection is one of the use case for this situation. In this use case we’re talking about data mining architecture to make fraud models using Spark ML within 1 hour. We also discuss some ML algorithm performance on Spark such as K-means, three sigma rule, T-digest and so on. In order to accomplish these factors, we processes 8-10 billion records which size is 4-5 TB every day. Our solution combines end-to-end data ingestion, processing, and mining the high volume data to detect some use cases of fraud in near real time using CDR and IPTDR to save millions, and better user experience.

(More...)

Keywords: fraud detection, realtime processing, Spark SQL, Spark ML, Machine Learning Algorithms

Burak Işıklı

Software Engineer, Turkcell

14.20 - 14.25

Technical break

14.25 - 14.55

Cloud operations with streaming analytics using Apache NiFi and Apache Flink

The amount of information coming from a Cloud deployment that can be used to have a better situational awareness and operate it efficiently is huge.

This session will explain how Red Hat uses tools like NiFi, Kafka and Flink to process the constant stream of syslog messages (RFC5424) produced by the Infrastructure as a Service, provided by OpenStack, and also detect common failure patterns that can arise and generate alerts as needed.

(More...)

Keywords: Apache Flink, Apache NiFi, Cloud monitoring, Apache Kafka

Suneel Marthi

Principal Technologist - AI/ML, Amazon Web Services

14.25 - 14.55

Big data serving with Vespa

Offline processing with big data sets can be done with tools such as Hadoop or Spark and streams of data processed with Storm. But what do you do when you need to process data at the time a user is making a request?

This talk will introduce Vespa – an engine solving the problem of big data serving. Vespa is behind the recommendation, ad targeting and search at Yahoo where it handles billions of daily queries over billions of documents. Some iteration of the engine has been in production for over 15 years. Vespa was recently open sourced at http://vespa.ai.

(More...)

Keywords: Vespa, recommendations, targeting, search

Jon Bratseth

Distinguished architect, Oath (former Yahoo)

14.25 - 14.55

A/B testing powered by Big data

At Booking we have more than a million properties selling their rooms to our customers. We have approximately 1000 events per minute from them leading to total 500 GB of data for partner events alone.

In order to make sure we receive the relevant inventory from our partners we A/B test various new features. There were more than 100 experiments focusing on availability alone in one quarter.

In my talk I ll be talking about A/B testing at Booking, different technologies like Hadoop, Hbase, Cassandra, Kafka etc that we use to store and process large volumes of data and building up of metrics to measure the success of our experiments.

(More...)

Saurabh Goyal

Backend Developer, Booking.com

14.25 - 14.55

Enhancing Spark - increase streaming capabilities of your applications

During this session we’ll discuss the pros and cons of a new structured streaming data processing model in Spark and a nifty way of enhancing Spark with SnappyData, an open-source framework providing great features for both persistent and in-motion data analysis.

Based on a real-life use case, where we designed and implemented a streaming application filtering, consuming and aggregating tons of events, we will talk the role of the persistent back-end and stream processing integration in the real-time applications in terms of performance, robustness and scalability of the solution.

(More...)

Keywords : Spark, structured streaming, snappy, in-memory

Kamil Folkert

CTO, Member of the Board, 3Soft

Tomasz Mirowski

IT Architect , 3Soft

14.55 - 15.00

Technical break

15.00 - 15.30

Big Data Journey at a Big Corp

We will present the journey of Orange Polska evolving from a proprietary ecosystem towards significantly open-source ecosystem based on Hadoop and friends

– a journey particularly challenging at a large corporation. We’ll present key drivers for starting Big Data, evolution of BI, emergence of Data Scientists and advanced analytics along with operational reporting and stream processing to detect issues. This presentation will cover both technical aspects and business environment, as both are inherently linked in process of big data enterprise adoption.

(More...)

Keywords: Entreprise Adoption, Hadoop integration in BI ecosystem, scaling solutions in enterprise, data teams organization

Tomasz Burzyński

Business Insights Director, Orange

Maciej Czyżowicz

Technical Leader for Analytics Stream, Orange

Maciej Czyżowicz

Architekt Korporacyjny, Orange Polska

15.00 - 15.30

Airflow as a Service

Oozie is still a popular workflow scheduler for Hadoop. It is a good choice if you like programming within XML file. Engineers at Allegro don’t.

Apache Airflow allows configuration as a code which is useful for workflow versioning and dev/test/prod release cycle. In this talk we present our approach to Airflow as a Service. This includes: Automatically setting up Airflow cluster on demand Running Airflow on Docker and Mesos Implementing common operators Collaborative work Automatic tests and deployment lots of other real life issues we have solved in order to make it work out of the box for dozens of our analysts, data scientists and developers. This concept can be easily generalized for other Data services, such as Jupyter Notebooks.

(More...)

Keywords: Workflow, Automation, Orchestration, Docker

Robert Mroczkowski

Data Platform Engineer and Technical Owner of Hadoop Cluster, Grupa Allegro

15.00 - 15.30

Data Science Lessons I have learned in 5 years

Since 2013 I have been working as Data Scientist – one of today’s hottest jobs in IT industry. During this time, I got the opportunities to experience the evolution of data science landscape — to see what worked and what didn’t.

In this presentation, I will present some of my best learnings in the past 5 years, like foundations for building data science team, efficient ways for data scientists to work with other teams, skills that data scientists should have, and common fallacies in data science work.

(More...)

Keywords: Data Science, Data Sciencist, teamwork, work skills

Boxun Zhang

Sr. Data Scientist, GoEuro

15.00 - 15.30

Design Patterns for Calculating User Profiles in Real Time

At mobile.de, Germany’s largest online vehicle marketplace, we calculate user profile in real-time to optimize the user journey on the e-marketplace platform by presenting relevant products to the user,

and by improving the relevance of search results. This presentation will discuss possible architecture designs and choices for addressing this challenge using popular open-source stream processing solutions.

(More...)

Keywords: Big Data, Stateful Stream Processing

Igor Mazor

senior data engineer, mobile.de

15.30 - 16.00

Coffee break

16.00 – 17.25 Roundtables sessions

16.00 - 16.05

Intro

Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable discussion – they are selected professionals with a vast knowledge and experience.

There will be 2 rounds of discussion, hence every conference participants can take part in 2 discussions

16.05 – 16.45 1^st round

16.50 – 17.25 2^nd round

16.05 - 16.45

1st ROUND

Stream processing engines - features, performance, comparison

Data collection, ingestion and management

Paweł Leszczyński

Hadoop Product Owner, Grupa Allegro

Data lake is like a snowball. Most of us have started with proof of concepts that filled data lake with stream data and batch imports from external data sources. Camus, Gobblin, Spark ingestion, Sqoop, NiFi and more. They all start as shiny snowflakes which get doubled within an eye blink. How to survive the flood on a data lake and successfully solve problems like: small files on HDFS, data retention, auditing and monitoring imports, (near) real-time ingestion, late and out-of-order events?

Being efficient data engineer. Tools, ecosystem, skills and ways of learning

Grzegorz Łyczba

Lead software engineer, OpenX

Data visualization - how to visualize large, complex and dirty data and what tools to use

Adam Karwan

Senior Data Scientist, Groupon

During panel we are going to discuss best techniques of power full data visualization. Classification data visualization tools according to their strengths and weaknesses would be another stage of discussion.

We will also talk about storytelling and targeting presentations for the audience i.e. customers, stakeholders, students, etc. https://www.youtube.com/watch?v=AdSZJzb-aX8

Plan of discussion:

Present yourself: name, current position, experience in data visualization
Describe tools for data manipulation and visualization, your favorite ones and most painful.
What are the features of good data visualization?
Did it happen that customer incorrectly interpret your visualization?
What should the correct presentation look like from the perspective of storytelling?
Dirty and missing data how to deal with that issue?
Reality, Complexity, Simplicity – what is the best strategy for visualizations?

Becoming a Data Scientist - from Coursera courses to big international projects

Radosław Kita

Team Lead, Adform

Becoming a scientist date seems to be temptingly easy. Finishing a specialisation at Coursera and waiting for job offers. I would like to share my observations what skills are worth having and what challenges will be set by the reality in the practical implementation of large data science projects

Deploying Big Data solutions in a big corporation in a successful and productive way - lessons learned, pitfalls, challenges

Marek Wiewiórka

Big Data Architect, GetInData

Marek Wiewiórka

Big Data Architect, GetInData

Data Storytelling: where persuasion meets hard science

Barbara Rychalska

Senior Data Scientist and Data Science Section Leader, Findwise

In today’s world, whose functioning is practically based on insights drawn from data, the ability to understand data in depth and communicate findings are increasingly welcome skill sets. Scientific communication is a science in itself, so what does it take to be both a good data scientist/big data engineer AND a good communicator? Should we strain to be both, anyway?

During the discussion we will try to answer this question and touch upon others such as:
– Data visualisation: is it an extra perk or an integral part of a data science project; what are good visualisations
– How to report your data science findings to fully convey the result in a persuasive manner
– What is and how important is exploratory data analysis (EDA)
With this talk I’d like to increase interest in scientific communication techniques and help bridge the gap between the scientist and the audience. We will explore how to fit data storytelling into a data science project development cycle to make it all more practical. Also, we will talk about how rational humans are – and how this (presumed) lack of total rationality influences the techniques of communication.

Bridging the gap between data scientists and data engineers

Firat Tekiner

Data Scientist and Big Data Architect, AB Initio

Transformative potential of AI, Machine Learning and Big Data in the Enterprise

Sujatha Subramanian

Data Scientist, Lingaro

In this round table, we dive into various topics in the Realm of Artificial Intelligence and Big Data. Understand the driving forces and the technologies powering the AI journey. Especially the growth of data both structured and unstructured in the current times creating a strong synergy between AI and Big Data. How AI and Big Data is instrumental in transformation of businesses? Discuss about applications of deep learning in Big data Analytics and proliferation of conversational AI like chatbots ,voice assistants in enterprises

Which Hadoop distribution to choose and why - HDP, CDH, MapR, Hops, AWS EMR, Google Dataproc?

Theofilos Kakantousis

Co-founder, Logical Clocks AB

Hadoop has evolved into a vast Big Data ecosystem of different frameworks and services, which means selecting a distribution that matches one’s needs has become a tricky task. In this session, we discuss the services and features that users should consider when opting for a modern Hadoop distribution. We focus on the main distributions, namely HDP, CDH, MapR, Hops, AWS EMR, Google Dataproc and we discuss how each one would fit their needs based on the following all important aspects:

Performance
Security
Platform Installation & Administration
Application Monitoring & Control
Big Data Processing frameworks
Data Governance
SQL & Business Intelligence
Deep Learning

Data Janitor - Cleaning Big Data

Vera Matei

Data Engineer, ING

SQL on Big Data – solutions, architecture, technology and innovations

Tal Sliwowicz

‎Director R&D - Scale, Performance & Data, Taboola

Tal Sliwowicz

R&D Director, Taboola

Taboola provides 500 Billion fully personalized content recommendations per month to 1.5 Billion unique visitors of the most prominent publishers across the world. To do that, we are processing 40TB+ a day in real time using Hadoop, Cassandra, Kafka and Spark. A significant part of the system is running SQL queries in Spark. In this table, We want to talk about people’s experience with running SQL on top of Spark, Presto, Drill, etc and share our experience.

Data privacy, personal integrity and GDPR

Lars Albertsson

Founder & data engineering consultant, Mapflat

16.45 - 16.50

Technical break

16.50 - 17.25

2nd ROUND

Stream processing engines - features, performance, comparison

Data collection, ingestion and management

Paweł Leszczyński

Hadoop Product Owner, Grupa Allegro

Being efficient data engineer. Tools, ecosystem, skills and ways of learning

Grzegorz Łyczba

Lead software engineer, OpenX

Data visualization - how to visualize large, complex and dirty data and what tools to use

Adam Karwan

Senior Data Scientist, Groupon

We will also talk about storytelling and targeting presentations for the audience i.e. customers, stakeholders, students, etc. https://www.youtube.com/watch?v=AdSZJzb-aX8

Plan of discussion:

Present yourself: name, current position, experience in data visualization
Describe tools for data manipulation and visualization, your favorite ones and most painful.
What are the features of good data visualization?
Did it happen that customer incorrectly interpret your visualization?
What should the correct presentation look like from the perspective of storytelling?
Dirty and missing data how to deal with that issue?
Reality, Complexity, Simplicity – what is the best strategy for visualizations?

Becoming a Data Scientist - from Coursera courses to big international projects

Radosław Kita

Team Lead, Adform

Deploying Big Data solutions in a big corporation in a successful and productive way - lessons learned, pitfalls, challenges

Marek Wiewiórka

Big Data Architect, GetInData

Marek Wiewiórka

Big Data Architect, GetInData

Pragmatic data science at scale

Mateusz Buśkiewicz

Tech Lead, Data Products Team, Base CRM

When Data Science meets large datasets, this can create a very effective mix and allows us to build more powerful data products. However, this is not always easy and effective.

How to be pragmatic about this topic in order to accelerate Data Science while avoiding pitfalls?

At this round-table, we will discuss a wide range of topics on what makes Data Scientist effective, from exploratory data analysis to the deployment of finished models on production. The questions we will ask ourselves include: Which Big Data tools are the most Data Scientist friendly? When should we use Big Data, and when is it more practical to stay with a single machine? When does it make sense to use distributed machine learning algorithms? How to visualize large datasets? How do you switch from prototyping to deploying scalable models on production?

Let’s share and learn from each other!

Choosing a right BI platform for a large data and a quick response time

Marcin Pękalski

Data Scientist, Kambi Sports Solutions, Kaggler

Many organisations assume that Business Intelligence will be able to answer all their data related questions. But for that working with the data should not be a bottleneck. For that we need to have a proper BI platform, but what is that and what requirements should it satisfy to provide the most benefits?

During the discussion we will try to answer a couple of questions:

– what is a BI platform?

– who should it serve?

– what are typical requirements on the platform?

– where can we expect the bottlenecks?

Meeting all necessary security requirements for Big Data clusters

Artur Fejklowicz

Data Architect / Data Engineering and Science Team Leader, TVN

I would like to talk with you about your experiences with security issues. Is it possible to implement AAA hadoop security stack without commercial software? How to secure rows of data in Hive? Do you think Java Reflections are safe? Where personal data should be processed? How to prepare data access to logs where the only personal data is cookie? Who should have access to data hash’ing? Kerberized it all – can we enable security in several steps or do we have to start with everything kerberized at the beginning?

Stream mining and dedicated algorithms for streaming data (ML, CEP, sketches, ...)

Christophe Salperwyck

ABB Ability Platform Engineer, ABB

When dealing with huge amount of data coming as streams you might not have the possibility to see these data again, in that case you need to use one-pass incremental algorithms.

These algorithms usually trade accuracy for performance. Real life examples are “filtering” using Bloom filter like in Chrome/HBase…, “count distinct” using HyperLogLog like in Spark/Redis/AtScale…), quantiles estimation…

We can have the same trade off in Machine Learning too: algorithms exist for both supervised and unsupervised learning that can learn incrementally on data streams.

The idea for this round table is to discuss real use cases of streaming algorithms/structures and stream mining.

Beyond pre-computed answers – interactive, sub-second OLAP queries with Druid / Kylin

Tomasz Szczechura

Team Leader of Data Systems Team, Grupa Wirtualna Polska

Business requirements for large data systems are growing. Time costs more and more. We will discuss the tools we can use to minimize time of loading and query this data with sub-second OLAP queries and how to scale this architecture to petabytes of data. We will discuss tools such as Druid, Kylin and Clickhouse. We will compare them and exchange experiences.

IoT in production - use-cases, data, tools and challenges

Arunabh Singh

Lead Data Scientist, HiQ International AB

IoT data often comes with its own unique challenges: lack of structure and the extreme diversity of source devices from light sensors to cars, unreliability around hardware integration with the “internet,” and a lack of an established best practices and ecosystem of tools to process and analyze the data. However, with many of the consumer-side “big data” problems cracked, processing and harnessing the value of IoT data is the next logical progression for the “big data” discipline to figure out on a large-scale. At Springworks, a connected cars platform based in Stockholm, we work with IoT data from cars using a telematic unit and also face many of these challenges. Key questions of this session include:

What are the best design choices for IoT data, especially to overcome the unreliability around hardware integration component?
What are similarities/differences with “regular” big data processing?
Which organizations are the leaders in leveraging IoT data, and what are their learnings?

The Latest and Greatest of Apache Spark 2.3

Jacek Laskowski

Apache Spark™ is a fast and general engine for distributed in-memory computations on massive scale. Let’s talk about what’s coming in Apache Spark 2.3 and how to use it for large data processing in batch or streaming modes. Bring all your questions about Apache Spark in general and Spark SQL, Spark MLlib, Spark Structured Streaming in particular. The roundtable is to help you fine-tune existing Spark workloads as well as prepare for future ones

17.25 - 17.45

Coffee break

17.45 - 18.15

Panel discussion - Getting more out of your data in 2018

Building an efficient Big Data platform and mining large volumes of data seems to be a never-ending story for data-driven companies. It’s an ongoing journey with many pitfalls, twists and unclear future. Each year, there is something that changes the game, brings new value, promises rewards or wastes our time. During this panel, our experts will talk about their plans and hopes for 2018 – small and big improvements to their big data strategy that will help them to get more out of data in 2018. This includes data monetization, new use-cases that become mainstream, new technologies that get significant adoption, new challenges that more and more companies face. The discussion won’t be about distant future, but about actions that you can take in 2018.

Host:

Adam Kawa

Data Engineer and Founder, GetInData

Adam Kawa

CEO and Co-founder, GetInData

Tomasz Burzyński

Business Insights Director, Orange

Karthik Ramasamy

Machine Learning Engineer, Google

Boxun Zhang

Sr. Data Scientist, GoEuro

18.15 - 18.30

Closing & Summary

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Adam Kawa

Data Engineer and Founder, GetInData

Adam Kawa

CEO and Co-founder, GetInData

19.00 - 22.00

Networking party for all participants and speakers

At the end of the conference we would like to invite all the attendees for the informal evening meeting in BOLEK Pub.

Estimated rank of the presentation, where: 1- very technical, 5 – mostly business related

EDITION 2018

AGENDA 2018

Changes in the order of presentation might occur

Przemysław Gamdzyk

Przemysław Gamdzyk

Adam Kawa

Adam Kawa

9.15 – 10.45 Plenary session

Bolke de Bruin

Karthik Ramasamy

Alexey Brodovshuk

Krzysztof Zarzycki

Krzysztof Zarzycki

11.15 – 15.30 Simultaneous sessions

Architecture, Operations & Deployment

Data Engineering

Analytics & Data Science

Real-Time Analytics

Ernst Kratky

Michał Kudelski

Keywords: stream processing, kafka, event sourcing, big data

Saulius Valatka

Alasdair Anderson

Keywords: Apache Flink, streaming, data processing engine

Piotr Nowojski

Firat Tekiner

Keywords: workflow, scheduling, hadoop, scala

Guillaume Bort

Keywords: machine learning, analytics, workflow

Elise Tropiano

Keywords: streaming, data flow, NiFi, Streams

Joey Frazee

Steve Blackmon

Keywords: hadoop, private cloud, google compute platform, migration, hybrid platforms

Krzysztof Adamski

Krzysztof Adamski

Keywords: Privacy, GDPR, data pipelines, data engineering

Lars Albertsson

Keywords: SAS Viya, Factorization Machines, recommendation system, sparse data

Paweł Łagodziński

Keywords: nlp, streaming, news, machine learning

Jörn Kottmann

Peter Thygesen

Keywords: Druid, Databases, Scale

Fokko Driesprong

Keywords: machine learning, software engineering, google cloud platform, user story

Przemysław Pastuszka

Keywords: machine learning, security, adversarial examples

Keywords: fraud detection, realtime processing, Spark SQL, Spark ML, Machine Learning Algorithms

Burak Işıklı

Keywords: Apache Flink, Apache NiFi, Cloud monitoring, Apache Kafka

Suneel Marthi

Keywords: Vespa, recommendations, targeting, search

Jon Bratseth

Saurabh Goyal

Keywords : Spark, structured streaming, snappy, in-memory

Kamil Folkert

Tomasz Mirowski

Keywords: Entreprise Adoption, Hadoop integration in BI ecosystem, scaling solutions in enterprise, data teams organization

Tomasz Burzyński

Maciej Czyżowicz

Maciej Czyżowicz

Keywords: Workflow, Automation, Orchestration, Docker

Robert Mroczkowski

Keywords: Data Science, Data Sciencist, teamwork, work skills

Boxun Zhang

Keywords: Big Data, Stateful Stream Processing

Igor Mazor

16.00 – 17.25 Roundtables sessions

Paweł Leszczyński

Grzegorz Łyczba

Adam Karwan

Radosław Kita

Marek Wiewiórka

Marek Wiewiórka

Barbara Rychalska

Firat Tekiner

Sujatha Subramanian

Theofilos Kakantousis

Vera Matei