Agenda 2017 | BigData Technology

AGENDA IN 2017 EDITION:

8.00 - 9.00

Registration and coffee

9.00 - 9.15

Conference opening

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Adam Kawa

Data Engineer and Founder, GetInData

Adam Kawa

CEO and Co-founder, GetInData

9.15 - 9.45

The data horizon 2017: Vision of Google Team

How cloud can address big data tasks: practical application of big data analytics and machine learning inside and outside Google. Ours vision of Cloud.

Magdalena Dziewguć

, Google

Michał Sapiński

Software engineer, Google

9.45 - 10.15

Meta-Experimentation at Etsy

Experimentation abounds, but how do we test our tests? I’ll share some ways we at Etsy proved our experimentation methods broken, and the approach we took to fixing them. I’ll discuss multiple ways of running A/A tests (as opposed to A/B tests), and a statistical method called bootstrapping, which we used to remedy our experiment analysis.

Emily Sommer

Software Engineer, Etsy

10.15 - 10.45

Managing the Margins: Big Data case study - Prescriptive Analysis for Semiconductor Manufacturing

The semiconductor industry is the backbone of the digital age. Sector innovations drive the ability to do more on ever smaller machines, but perhaps equally important is the ability to optimize the manufacturing processes. For example, in the digital printing of semiconductor components, 1 in a billion failure rate for droplets may sound like an acceptable rate. This is less so when you consider that up to 50 million droplets can be pushed per second, leading to an unacceptable defect rate of one every 20 seconds. Pre-emptive analytics on streaming sensor and image data play a key role in finding indications of where and when defects are looming. This presentation will focus on an industry use case for combining SAS and open source analytics to tackle these essential big data challenges, and will also provide some insights on applications in other sectors.

Sascha Schubert

Advisory Business Solutions Manager, Global Technology Practice, SAS Institute

10.45 - 11.15

Coffee break

Simultaneous sessions

Operations & Deployment

This track is dedicated to system administrators and people with DevOps skills who are interested in technologies and best practices for planning, installing, managing and securing their Big Data infrastructure in enterprise environments – both on-premise and the cloud.

Data Application Development

This track is the place for developers to learn about tools, techniques and innovative solutions to collect and process large volumes of data. It covers topics like data ingestion, ETL, process scheduling, metadata and schema management, distributed datastores and more.

Analytics & Data Science

This track includes real case-studies demonstrating how Big Data is used to address a wide range of business problems. You can find here talks about large-scale Machine Learning, A/B tests, visualizing data as well as various analysis that enable making data-driven decisions and feed personalized features of data-driven products.

Real-Time Processing

This track covers technologies, strategies and use-cases for real-time data ingestion and deriving real-time actionable insights from the flow of events coming from sensors, devices, users, and front-end systems.

Session chairs

Piotr Bednarek

Administrator Hadoop, GetInData

Piotr Bednarek

Administrator Hadoop, GetInData

Piotr Krewski

Data Consultant and Co-founder, GetInData

Piotr Krewski

Big Data Consultant and Co-founder, GetInData

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Klaudia Zduńczyk

Business Development Specialist, GetInData

11.15 - 11.45

That won’t fit into RAM

SentiOne is one of the leading solutions in Europe for social media listening and analysis. We monitor over 26 European markets including CEE, Scandinavia,

DACH, and the Balkans. The amount of data that is processed every day and is ready to be queried by our users is enormous. Over the years we have tested many technologies and approaches in big data from which many have failed. The presentation includes our experiences and lessons learned on setting up big data company from scratch. I will give details on configuring robust ElasticSearch cluster with over 26TB of data and describe key challenges in efficient web crawling and data extraction.

(More...)

Michał Brzezicki

Wiceprezes Zarządu, SentiOne

11.15 - 11.45

Enabling 'Log Everything' at Skyscanner

Skyscanner is a leading global travel search site offering a comprehensive and free flight search service as well as online

comparisons for hotels and car hire. We believe that data should be at the heart of every decision at Skyscanner, so it’s important that our engineers have the tools to seamlessly log the data that will help them with those decisions. In this talk, we discuss the approach we’ve taken to enable this and reflect on some of the challenges and lessons learnt. Technologies used include kafka, logstash, elasticsearch, secor, aws (S3, lambda), samza, protocal buffers and others.

(More...)

Robin Tweedie

Senior Software Engineer, SkyScanner

Arthur Vivian

Software Engineer, SkyScanner

11.15 - 11.45

Alchemists 2.0: Turning data into gold

How to bring money to the table with Data Science. Practical examples of

Data Science “in-action” from recent projects. When to use Linear Regression vs XG Boost in business applications. What is the money impact of using Data Science.

(More...)

Paweł Godula

Senior data Scientist, BCG Gamma

11.15 - 11.45

Real-Time Data Processing at RTB House – Architecture & Lessons Learned

Our platform, which purchases and runs advertisements in the Real-Time Bidding model, processes 250K bid

requests and generates 20K events per every second which gives 3TB data every day. Because of machine learning, system monitoring and financial settlements we need to filter, store, aggregate and join these events together. As a result processed events and aggregated statistics are available in Hadoop, Google BigQuery and Postgres. The most demanding are business requirements such as: events that should be joined together can appear 30 days after each other, we are not allowed to create any duplicates, we have to minimalize possible data losses as well as there could not be any differences between generated data outputs. We have designed and implemented the solution which has reduced delay of availability of this data from 1 day to 15 seconds.

We will preent: Our first approach to the problem (end-of-day batch jobs) and final solution (real-time stream processing) 2. detailed description of the current architecture 3. how we had tested new data flow before it was deployed and in which way it is being monitored now 4. our one-click deployment process 5. decisions which we made with its advantages and disadvantages and our future plans to improve our current solution.

We would like to share our experience connected with scaling solution over clusters of computers in several data centers. We will focus on the current architecture but also on testing and monitoring issues with our deployment process. Finally, we would like to provide an overview of engaged projects like Kafka, Mirrormaker, Storm, Aerospike, Flume, Docker etc. We will describe what we have achieved from given open source and some problems we have come across.

(More...)

Bartosz Łoś

Tech Lead, RTB House

11.45 - 11.50

Technical break

11.50 - 12.20

Scalable Analytics for Microservices Architecture

Avito is the third biggest classified site in the world after Craigslit and 58.com from China. Avito nowadays

is not a monolite project, but comprises dozens specialized vertical sites and applications.

The introduction of microservice architecture in Avito spawned hundreds of new services. In this situation is is critical to implement common BI infrastructure, able to collect, process, combine and analyse data from all those microservices and persistent to constant changes.

Avito Analytics is based on HP Vertica MPP database, highly normalized data lake and an asynchronous event bus. Those tools give Avito the ability to use all types of Machine Learning and Reporting tools, manage sites, applications and microservices.

Avito is the Russian OLX. Moreover, nowadays, Avito and OLX are both part of the Naspers group, we do the same business in different counties and share experience.

(More...)

Nikolay Golov

Chief Data Warehousing Architect, Avito

11.50 - 12.20

DataOps or how I learned to love production

A plethora of data processing tools, most of them open source, is available to us. But who actually runs data

pipelines? What about dynamically allocating resources to data pipeline components? In this talk we will discuss options to operate elastic data pipelines with modern, cloud native platforms such as DC/OS with Apache Mesos, Kubernetes and Docker Swarm. We will review good practices, from containerizing workloads to making things resilient and show elastic data pipelines in action.

(More...)

Michael Hausenblas

Developer Advocate, Mesosphere

11.50 - 12.20

SAS Viya – the fundamentals of analytics architecture of the future

Since the inception of modern analytical platforms, companies have been trying to out-smart each other to perform analytics faster than ever.

SAS Institute, has been leading the Analytics industry for over 40 years in the area of advanced analytics with new innovations including MVA, In-Database and In-Memory computing. SAS has recently released its 3rd Generation In-Memory platform, SAS Viya, that has been designed from ground-up for scalable analytics to solve the problems of the future, powered by CAS (Cloud Analytics Services) server. This session will give you an overview of the new and exciting features of SAS Viya and CAS, and how it differs from some of the other in-memory platforms in the market. We will discuss scalability, memory management, Hadoop infrastructure integration, integration with open source tools like Python, R and others. (Web) interfaces.

(More...)

Muhammad Asif Abbasi

Principal Business Solutions Manager, SAS Institute

11.50 - 12.20

Streaming analytics better than batch - when and why

While a lot of problems can be solved in batch, the stream processing approach currently gives you more benefits. And it’s not only sub-second latency at scale. But mainly possibility…

to express accurate analytics with little effort – something that is hard or usually ignored with older batch technologies like Pig, Scalding, Spark or even established stream processors like Storm or Spark Streaming. In this talk we’ll use a real-world example of user session analytics to give you a use-case driven overview of business and technical problems that modern stream processing technologies like Flink help you solve, and benefits you can get by using them today for processing your data as a stream.

(More...)

Adam Kawa

Data Engineer and Founder, GetInData

Adam Kawa

CEO and Co-founder, GetInData

Krzysztof Zarzycki

Big Data Architect, GetInData

Krzysztof Zarzycki

Big Data Architect, CTO and Co-founder, GetInData

Dawid Wysakowicz

Software Engineer, Ververica (former data Artisans)

12.20 - 12.25

Technical break

12.25 - 12.55

Creating effective, scalable and easily manageable environment for real-time big data processing and analytics

Creating effective, scalable and easily manageable environment for big data processing is a challenge which touches multiple domains. Business ideas, data science, analytic algorithms and analytic

software tools as well as scalable infrastructure which has to fit specific use case and be open for dynamic changes. Cisco and Alterdata understands together all stages of this process and are able to guide companies through this journey.
During the session we will describe use case of real-time big data analytics related to location tracking and how it leverages automated and scalable Cisco platform.
How to effectively use C-store DBMS analytics platform as well as Cisco Validated Design for Big Data architecture which combines tools as Cisco UCS (Unified Computing System), Cisco ACI (Application Centric Infrastructure) and UCS Director for Big Data which provides a single-touch solution that automates Hadoop deployment and provides a single management pane across both physical infrastructure and Hadoop software.

(More...)

many database systems and query processors feature advanced query optimizers and highly efficient execution engines, and last but not least it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?”. One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap. One approach is to support standard SQL which is known by users and tools but comes at the cost of cumbersome workarounds for many common streaming computations. Other approaches are to design custom SQL-inspired stream analytics languages or to extend SQL with streaming-specific keywords. While such solutions tend to result in more intuitive syntax, they suffer from not being established standards and thereby exclude many users and tools.

Apache Flink is a distributed stream processing system with very good support for streaming analytics. Flink features two relational APIs, the Table API and SQL. The Table API is a language-integrated relational API with stream-specific features. Flink’s SQL interface implements the plain SQL standard. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite.

In this talk we present the future of Apache Flink’s relational APIs for stream analytics, discuss their conceptual model, and showcase their usage. The central concept of these APIs are dynamic tables. We explain how streams are converted into dynamic tables and vice versa without losing information due to the stream-table duality. Relational queries on dynamic tables behave similar to materialized view definitions and produce new dynamic tables. We show how dynamic tables are converted back into changelog streams or are written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and are updated in place with low latency. We conclude our talk demonstrating the power and expressiveness of Flink’s relational APIs by presenting how common stream analytics use cases can be realized.

(More...)