Agenda 2020

February 27, 2020

8.00 - 9.00

Registration and welcome coffee

9.00 - 9.10

Conference opening

Adam Kawa

CEO and Co-founder, GetInData

9.10 - 10.40

Plenary Session

9.10 - 9.40

Challenges of modern analytics

Data is the new oil. More and more companies understand the value of data to optimise their core business or enter new business fields. They want to analyse data to enhance their internal processes, the way how they work with customers or how they collaborate with external parties such as suppliers, partners etc. All this is not trivial, requires the right skillset and an appropriate technology. The cloud promises scalability, elasticity and resources on demand. But a cloud native architecture is mandatory to leverage these features. Snowflake is built for the cloud, separates storage from compute and offers everything people expect from the cloud. Even a data marketplace for collaboration and monetisation is available that makes sharing and exchanging data globally easy. All this independent from a cloud vendor and capable for multi-cloud setups.

Keywords: #Snowflake #cloud #clouddataplatform #multicloud #cloudanalytics #datawarehousecloud #data #SQL #database #datawarehouse

Thomas Scholz

Sales Engineering Manager for EMEA, Snowflake

9.40 - 10.10

Credit Risk in practice on a global scale. New technology platform and methodologies in practice

Open source technologies and new machine learning methods are changing regulatory Credit risk in recent times. We will talk how we handle data globally from entire ING, what technologies for modelers experience we use, what machine learning tools we incorporate. All of this being compliant with more strict regulatory frameworks than ever. Handling risk on a balance sheet 1,5 size of Poland GDP is a really interesting task for modelers, data scientists and data engineers.

Keywords: #CreditRisk, #Python, #BigData, #UseCase #MachineLearning, #StatisticalModeling, #DataScience

Marcin Brożek

Credit Risk Modelling Expert, ING Tech Poland

Konrad Wypchło

Senior Chapter Lead, ING Tech Poland

10.10 - 10.40

Leveraging hybrid cloud for real-time insights with the new Cloudera Data Platform

The new Cloudera solutions for hybrid cloud environment. Adding Apache Flink integration to the CDP. Solving the real-life challenges based on the use cases from the Polish market. Apache Flink and CSP roadmap.

Key words: #hybrid_cloud, #data_in_motion, #Cloudera_Data_Platform, #stream_processing

Marton Balassi

Manager, Streaming Analytics, Cloudera

Kamil Folkert

CTO, Member of the Board, 3Soft

10.40 - 11.10

Coffee break

11.10 – 15.30 Simultaneous sessions

Architecture, Operations and Cloud

This track is dedicated to architects, administrators and experts with DevOps skills who are interested in technologies, techniques and best practices for planning, building, installing, managing, containerising and securing their Big Data infrastructure in enterprise environments – both on-premise and the cloud.

Show less

Data Engineering

This track is the place for engineers to learn about tools, techniques and battle-proven solutions to collect, store and process large amounts of data. It covers topics like data collection and ingestion, ETL, job scheduling, metadata and schema management, distributed processing engines, distributed datastores and more.

Show less

Streaming and Real-Time Analytics

This track covers technologies, strategies and valid use-cases for building streaming systems and implementing real-time applications that enable actionable insights and interactions not previously possible with classic batch systems. This includes solutions for data stream ingestion and applying various real-time algorithms and machine learning models to process events coming from IoT sensors, devices, front-end applications and users.

Show less

Artificial Intelligence and Data Science

This track includes real-world case studies demonstrating how data & technology are used together to address a wide range of complex problems in the domain of machine learning, artificial intelligence and data science. It also covers topics related to prototyping and operationalizing ML/AI models in production, data visualisation and A/B tests.

Show less

Data Strategy and ROI

This track is a quite new thing at BigData Techology Warsaw Summit. It is aimed at data and business professionals who are interested in learning how data and analytics can be used to generate growth, value added, and positive financial impact. It will contain presentations about real-world use cases that cover useful data-focused solutions, new business models and various data monetization strategies. Since most of the Big Data projects are difficult to complete on time and in budget, presentations will also explain necessary technical, cultural and leadership aspects that are key to successful Big Data initiatives at enterprises – also during the phase of proper preparation and ensuring longer term sustainability of the project – avoiding wasting money and getting a positive return on investment (ROI).

Show less

Host:

Arkadiusz Gąsior

Data Engineer, GetInData

Host:

Łukasz Suchenek

Conferences Editor, Evention

Host:

Paweł Jurkiewicz

Data Engineer, GetInData

Host:

Adrian Bednarz

Big Data Engineer, GetInData

Host:

Stefan Rautszko

Team Manager, Data Design, Roche

11.15 - 11.45

From Containers to Kubernetes Operators for a Datastore

“Containers are the new ZIP format to distribute software”, but how do you run a datastore on them? What are good practices for your production Docker images? What are the roles of Helm Charts and Kubernetes Operators and how do they compete or complement each other?

Show less

Keywords: #docker #container #kubernetes #operator #orchestration

Philipp Krenn

Developer, Elastic

11.15 - 11.45

Will we see driverless cars in 20s?

Autonomous Driving, where are we, levels definitions. Disengagement reported in California. Road approval challenge (where “billion mile” = Exabyte scale). DXC Robotic Drive Center of excellence (Big Data R&D center). BMW case study (MapR, Spark, Openshift). What about future.

Show less

Keywords: #autonomousdriving #dataingestion #petabytescale #hardwareinthehloop #mapr #spark #openshift

Sławomir Folwarski

Senior Architect, DXC Analytics Platform, DXC Technology

Piotr Frejowski

System Architect, DXC Robotic Drive Program, DXC Technology

11.15 - 11.45

Creating an extensible Big Data Platform for advanced analytics - 100s of PetaBytes with Realtime access

Building a reliable Big Data platform is extremely challenging when it has to store and serve 100s of PetaBytes of data in a real-time fashion. This talk reflects on the challenges faced and proposes architectural solutions to scale a Big Data Platform to ingest, store, and serve 100+ PB of data with minute level latency while efficiently utilizing the hardware and meeting the security needs.

Show less

Keywords: #bigdata #scalability #hadoop #spark #analytics #datascience #dataplatform

Reza Shiftehfar

Engineering Management & Leadership, Uber

11.15 - 11.45

Building Recommendation Platform for ESPN+ and Disney+. Lessons Learned

Building Recommendation System is a real challenges, as one needs to deal with many complex problem, like: volume and variety of data, scalability, different recommendation methods,ability for experimentation, changing environments and proper evaluation of recommendation algorithms. During presentation we will go through problems and lessons learned that our team dealt with while working on Recommender System providing personalized recommendations for ESPN+ and Disney+.

Show less

Keywords: #recommendersystems #ML #cloud #experimentation

Grzegorz Puchawski

Data Science and Recommendation, Disney Streaming Services

11.15 - 11.45

From bioreactors to kibana dashboards

Story about hope — how to shorten waiting time from weeks to minutes. Story about light and shadow — using cloud and not cloud technologies, calculator and human brain yet. Story about building the serverless and containers that provides real time and historical reports from biological experiments.

Show less

Keywords: #googleCloud #Streaming #DataFlow #DataOps

Fabian Wiktorowski

IT Expert, Roche

11.45 - 11.50

Technical break

11.50 - 12.20

Replication Is Not Enough for 450 PB: Try an Extra DC and a Cold Store

When you have 450 PB of Hadoop data you need a cold storage backup. Protect against operator error and file corruption. How did we choose our cold backup technology. How we help our users to backup all the required data but not all data. How we copy to the cold backup system and restore when required.

Show less

Keywords: #Hadoop #datasecurity #resilience #in-house #storage

Stuart Pook

Senior Site Reliability Engineer, Criteo

11.50 - 12.20

Data Platform at Bolt: challenges of scaling data infrastructure in a hyper growth startup

Introduction of Data Team at Bolt, Initial Data Warehouse setup and its limitations, Biggest challenges to solve, Data Infrastructure evolution and its trade-offs, Learned lessons and conclusions

Show less

Keywords: #aws #datalake #datawarehouse #preprocessing #machinelearning

Łukasz Grądzki

Engineering Manager, Bolt

11.50 - 12.20

Interactive Analytics at Alibaba

In this talk, I will talk about solutions for realtime data warehouse. I will cover the typical scenarios and its architecture. We use Flink to do realtime data process. The output is stored into a realtime data warehouse which we developed in-house at Alibaba. It provides efficient storage as well as performant query abilities.

Show less

Keywords:

Yuan Jiang

Senior Staff Engineer, Alibaba

11.50 - 12.20

Building a Factory for Machine Learning at Spotify

Spotify has been using Machine Learning in our product for nearly a decade, but it’s only been in the last few years that we’ve invested in building tools aimed at making the lives of internal Machine Learning practitioners easier and more productive. In this talk I’ll cover the main problems our Machine Learning Infrastructure teams have been focusing on, highlight some of the ML used in the Spotify product, and talk about the Machine Learning Productivity tools our Machine Learning Platform team has produced internally.

Amongst these tools are:

* A feature store that standardizes feature production and encourages collaboration between teams

* Kubeflow Pipelines as an end-to-end pipeline authoring tool

* Tensorflow and Tensorflow Extended as a toolset for the different phases of Machine Learning Development

* Libraries that ease the Model Serving process

* A UI that provides an overview of ML Projects at Spotify

Show less

Keywords: #ml #kubeflow #tensorflow #ml-infra

Josh Baer

Product Lead, Machine Learning Platform, Spotify

11.50 - 12.20

Abstraction matters

Organization faces many challenges when starting a long running projects, like which technology to choose, will it still exist in few years, where to find people, how long we need to train them, what if the business changes, how to “prototype” data. We will talk about important factors of a successful projects from a technology perspective: (1) Build once run anywhere, (2) Empowering citizen developers and (3) Spontaneous data federation.

Show less

Keywords: #lowcode, #executionabstraction, #datavirtualization

Anthony Ibrahim

Head of Ab Initio DACH/CEE, Ab Initio

12.20 - 12.25

Technical break

12.25 - 12.55

How to make your Data Scientists like you and save a few bucks while migrating to cloud - Truecaller case study

Last year, in this same conference we presented Truecaller on-premise data architecture. Since then a lot has changed: Truecaller reached 150M active users that generate 30B events a day! In this presentation, we will share with you some exciting tech choices we made to build a robust architecture lowering costs and making our data scientists happier by migrating our on-prem data centers to Google Cloud Platform. Among a mix of on-prem, hybrid and native cloud technologies we’ll share the journey of a fully centralized team to a decentralized one using multi-tenancy Airflow, Kafka, Google Big Query and Spark to build a scalable self-service analytics platform in the cloud.

Show less

Keywords: #cloudmigration #bigquery #airflow #kafka

Fouad Alsayadi

Senior Data Engineer, Truecaller

Juliana Araujo

Data Product Manager, Truecaller

Tomasz Żukowski

Data Analyst, GetInData

12.25 - 12.55

Kafka-workers, Parallelism First

Like almost everyone, we use Kafka as a message broker. However, most of our data processing apps are built on kafka-workers, an open-source library for highly parallel data processing. In this presentation I would like to show core concepts of kafka-workers and how it relates to the well known Kafka Connect. All our kafka-workers apps process 500K events/s on average, where a single serialized event weights typically between 3KB and 200KB.

Show less

Keywords: #kafka, #data processing, #high-performance

Tomasz Uliński

Software Developer, RTB House

12.25 - 12.55

Adventure in Complex Event Processing at telco

We would like to share our experience of 3 years doing Complex Event Processing on BigData solution at Orange Polska. What difficulties we did overcome using Kafka, Storm, integration with non-bigdata systems, security, business and technical monitoring (Influx/Grafana/Superset), multiple sources with different formats JSON, XML, etc.

Show less

Keywords:

Jakub Błachucki

Big Data Engineer, Orange

Maciej Czyżowicz

Technical Leader for Analytics Stream, Orange

Maciej Czyżowicz

Architekt Korporacyjny, Orange Polska

Paweł Pinkos

Big Data Engineer, Orange

12.25 - 12.55

Neural Machine Translation: achievements, challenges and the way forward

Current state of the art in Machine Translation: what is possible, what is not possible, what are the common misconceptions about applying it to real world problems in companies, and the many challenges connected to performance testing. Adversarial examples: a real danger which fools Machine Translation models (and all other models, in fact!), that we need to know and take steps against. Sentence level vs document level: can Machine Translation models retain document coherence and meaning while translating whole documents? Technical issues connected to applying Machine Translation in companies.

Show less

Keywords: #machinetranslation #deeplearning #adversarialexamples #datascience

Katarzyna Pakulska

Data Science Technology Leader, Findwise

Barbara Rychalska

Senior Data Scientist and Data Science Section Leader, Findwise

12.25 - 12.55

It's 2020. Why are we still using 1980s tech?

Organizations today produce exponentially more data than they did just a few years ago, yet they’re using databases built decades ago, with much smaller volumes in mind. The mismatch between what these databases were designed to handle and the modern workload has laid the groundwork for painful analytics challenges for these organizations: reporting takes way too long, and some complex analytics simply cannot be done. The Era of Massive Data is upon us, and a new approach is required to overcome the limitations of antiquated CPU-based data stores. In this session we will present how several enterprises use SQream DB to accelerate analytics on massive datasets. Join the session to learn how you can uncover the insights hidden in your treasure troves of data.

Show less

Keywords: #Analytics #SQL #DWH #CaseStudy #BigData

Arnon Shimoni

Product Manager and Solutions Architect, SQream

12.55 - 13.50

Lunch

13.50 - 14.20

DevOps best practices in AWS cloud

Presentation will concentrate on three topics: Distributed data processing, costs optimisation and security. In particular we are going to introduce EMR as big data service, SageMaker for model training, LakeFormation for managing granular access to data. All of that developed as infrastructure as a code in secure environment, automated with best devops practices and keeping within reasonable costs inside AWS cloud. You will know how to control expenses and pay only for well utilised services in serverless environment.

Show less

Keywords: #aws_cloud #devops #best_practices #infrastructure_as_a_code

Adam Kurowski

Senior DevOps, StepStone Services

Kamil Szkoda

DevOps Team Leader and Product Owner , StepStone Services

13.50 - 14.20

Presto @ Zalando: A cloud journey for Europe’s leading online retailer

At Zalando we have been using Presto as an analytical system for both ad-hoc and ETL use cases for over three years. This talk will describe how Presto
changed in that time and how the Zalando setup changed with it. What are the biggest learnings, and how are we providing fast SQL capabilities to our analysts over an ever growing S3 data lake in a multi tenant environment with fine grained access control.

Show less

Keywords: #CloudAnalytics #Presto #DataVirtualization #SQL-on-Hadoop #DWH

Wojciech Biela

Co-founder & Senior Director of Engineering, Starburst

Piotr Findeisen

Software Engineer, Starburst

Max Schultze

Data Engineer, Zalando SE

13.50 - 14.20

Network monitoring, data processing, forecasting, fraud and anomaly detection- using Spark, Elasticsearch, Machine Learning and Hadoop

During the presentation I will share our experience in building and maintaining reliable network monitoring system, with data processed in near real-time, drill down functionality, visualizations, anomaly and fraud detection.

Show less

Keywords: #spark #elasticsearch #machinelearning #hadoop #dataprocessing

Kamil Szpakowski

Big Data Main Specialist, T-Mobile

13.50 - 14.20

Feature store: Solving anti-patterns in ML-systems

In the Industry 4.0 era, data is the new oil and intelligent solutions are new combustion engines. Both of these have a crucial role in the vertical scaling of the modern economy. The value of machine learning solutions highly depends on the quality, accuracy, and availability of features crafted on top of data, which are used during training. In the pursuit of creating new AI solutions and increasingly complex models, people are falling into a pitfall of hidden technical traps known from classical software engineering. One of the answers to some of the issues are solutions called feature stores. The feature store is a data management layer that will help you to take your machine learning workflow and model management to the next level. During this talk, I will try to identify the most common anti-patterns in machine learning based systems and present feature store as a possible remedy.

Show less

Keywords: #ml #recommendersystem #mlops #automl

Andrzej Michałowski

Head of AI Research & Development, Synerise

13.50 - 14.20

Omnichannel Personalization as example of creating data ROI - from separate use cases to operational complete data ecosystem

Journey of Orange Polska from single use cases to advanced eco-system of data. Example of implementing the environment to personalize realtime customer relationships. Technology mix of personalization environment and examples of Use Cases. How to bridge technical and business perspective. Organizational approach.

Show less

Keywords: #ROI #real-timeomnichannelpersonalization #scalingdataecosystem #businessengagement #harvesting

Tomasz Burzyński

Business Insights Director, Orange

Mateusz Krawczyk

Personalization Solutions Product Owner, Orange

14.20 - 14.25

Technical break

14.25 - 14.55

The Big Data Bento: Diversified yet Unified

What is a Bento? Evolution of data platforms: packaged -> build your own -> cloud. Why did people started building their own platforms using solutions outside the Hadoop ecosystem (e.g. Spark, Flink, Druid)? Why are people moving to the cloud now? Faster time to value through a diversified yet unified platform.

Show less

Keywords: #bigdatabento #cloud #unifiedanalyticsplatform #unifieddataanalyticsplatform #spark

Michael Shtelma

Solutions Architect, Databricks

14.25 - 14.55

Towards enterprise-grade data discovery and data lineage at ING with Apache Atlas and Amundsen

A successful implementation of the data discovery solution is the key enabler for the true democratization of data in an enterprise context. It is also a prerequisite for transforming an old-school big data platforms that were primarily designed for ETL processes into robust analytics environment that your data scientist would love. In the presentation we will share our experience from designing and implementing a data discovery product powered by open-source technologies such as Apache Atlas and Amundsen (initially created by Lyft). We will discuss the overall architecture, integration with Big Data ecosystem components as well as what kinds of metadata we use to make our solution a powerful and trustworthy solution that facilitates data scientist work at ING.

Show less

Keywords: #BigData, #DataDiscovery, #DataIngestion, #Lineage, #MetadataGovernance, #Data-Driven

Verdan Mahmood

Software Engineer, ING

Marek Wiewiórka

Big Data Architect, GetInData

Marek Wiewiórka

Big Data Architect, GetInData

14.25 - 14.55

Monitoring & Analysing Communication and Trade Events as Graphs

Graph Analytics on Trade Events & Communication Events, Streaming Transactions as Edges, Visualising Graphs in Kibana

Show less

Keywords: #graphAnalytics #transactionProcessing #FlinkGelly #Elasticsearch #Kibana

Christos Hadjinikolis

Senior Consultant, Lead ML Engineer, Data Reply UK

14.25 - 14.55

Utilizing Machine Learning To Optimize Marketing Spend Through Attribution Modelling

For any online business, optimizing their marketing mix is one of their key pain points. It is also one of their highest running costs. We developed a couple of approaches to model attribution of various online channels, using simple last click, logistic regression and Markov chain approaches.

This talk will present a high-level overview of end-to-end implementation of attribution modelling, including problem-framing, data collection and cleaning, model implementation and validation, technology stack description, and eventual deployment. The data set was large (billions of rows), so various optimizations for processing and analysis was key.

The presentation is interdisciplinary in nature, so anyone interested in marketing, data science and engineering, cloud technologies and general mechanisms of online businesses is welcome to attend.

Show less

Keywords: #attribution #datascience #statisticalmodeling #marketingmix #interdisciplinary

Arunabh Singh

Lead Data Scientist, HiQ International AB

14.25 - 14.55

Data Science @ PMI – Journey from business problem to the data product industrialization

Introduction to Data Products, CI & CD, Portfolio Optimization, Modus Operandi and Agile Data Science on the Data Ocean, Best Practices: Kubernetes, Docker Containers, Project Templates, Programming Style Standards, Airflow etc., Reproducible Data Science

Show less

Keywords: #UseCase #CI/CD #BestPracticesforData Science #DataProduct #Reproducibleresearch

Michał Dyrda

Senior Enterprise Data Scientist, Philip Morris International

Maciej Marek

Enterprise Data Scientist , Philip Morris International

14.55 - 15.00

Technical break

15.00 - 15.30

How to send 16,000 servers to the cloud in 8 months?

Case study based on the migration of 5 DC of one of the largest OpenX ad-exchange platforms to the GCP public cloud. Google’s role in the migration process and how to prepare for such migration. Lessons learnt

Show less

Keywords: #Openx #gcp #scale #adtech #migration

Marcin Mierzejewski

Engineering Director, OpenX

Radek Stankiewicz

Strategic Cloud Engineer, Google Cloud

15.00 - 15.30

Optimize your Data Pipeline without Rewriting it

It is not fast enough! – one of the more common responses to a data engineer when putting a data pipeline in production. We will see a structured process to optimize our batch pipelines. Make our performance tuning process data-driven. Examples from real-world cases where its been applied.

Show less

Keywords: #data-driven #optimize #data-pipeline #operation #improvement

Magnus Runesson

Senior Data Engineer, Tink

15.00 - 15.30

Flink on a trip - a real-time car insurance system in a nut(shell)

Automotive is 135 years old. Car insurance concept is 90 years old. Old enough to change the model of car insurance using the state of the art IT systems.The presentation shows innovative car insurance system built with Apache Flink. It goes in depth of the stream technology behind the framework. It presents a use case of detection of a car trip, that might looks trivial at glance, but expose some traps at the detailed view of the presented solutions. After this presentation you will learn some car insurance history and vistas of this industry, but you will learn definitely more on advantages and limitations of Apache Flink for real time streaming systems on example of the use case on the production system at Humn.AI.

Show less

Wojciech Indyk

Streaming Analytics and All Things Data Black Belt Ninja, Humn.ai

15.00 - 15.30

Reliability in ML - how to manage changes in data science projects?

Why management of changes in data science projects is so difficult? What kind of problems with simultaneous tracking of all elements of a data science project (e.g. code, data extracts, model) may appear? The presentation is going to answer these questions and discuss different approaches to implementation of revision control in data science projects. The approaches will be presented and compared on real use-cases and scenarios.

Show less

Keywords: #datascience #datamanagement #revisioncontrol #datapipeline

Kornel Skałkowski

Senior AI Engineer, Consonance Solutions

15.00 - 15.30

Using data to build Products

It’s quite challenging to get ideas for new products and build them from scratch. I will be sharing my experiences on how data and machine learning helped us at Booking.com in finding what to build from scratch to solve a user problem with efficiency and scalability.

Show less

Keywords: #NewProducts #MachineLearning #DataFueledGrowth #DataGuidedProductDevelopment #ScalingNewProduct

Ketan Gupta

Product Leader, Booking.com

15.30 - 16.10

Coffee break

16.10 – 17.35 Roundtables sessions

16.10 - 16.15

Intro

Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable discussion – they are selected professionals with a vast knowledge and experience.

There will be 2 rounds of discussion, hence every conference participants can take part in 2 discussions

16.15 – 16.55 1^st round

17.00 – 17.40 2^nd round

16.15 - 16.55

1st round

1. Managing a Big Data project – how to make it all work well?

Data Scientists, Data Engineers, DevOps Specialists, Business Stakeholders – all these people come from different worlds, but need to work close together in order to make a Big Data project a success.
Let’s discuss about our achievements as well as… spectacular failures when it comes to communication, cooperation and meeting ones expectations.
We’ll be talking about methodologies, tools, best practices and, so called, human element.

Michał Rudko

Big Data Analyst / Architect, GetInData

2. Analytics and Customer Experience Management on top of Big Data

How to ensure successful adoption of Big Data and Analytics systems? It is a challenge for most organizations. Let’s discuss how to promote user-centric approach, leverage experience design and manage user expectation on Big Data projects. I would be happy to hear you opinion and answer your questions, based on my practical experience applying Design Thinking and architecture design methodologies. I believe this conversation will be interesting for Architects, Tech Leaders, Product Managers and C-level folks.

Taras Bachynskyy

Director, Big Data & Analytics, SoftServe

3. Data visualization, how to visualize large, complex and dirty data and what tools to use

Data visualisation is great tool at explaining data – best we know right now. But our data volume grows every day and we often hit hard limits of current visualisation systems. How we can approach this problem so we still can analyse and explore data? From interactive interfaces that link many visualisations to machine learning algorithms that pick best chart type and parameters. It is really interdisciplinary issue, so let’s share our knowledge!

Adrian Mróź

Frontend Developer, Allegro

4. Practical application of AI

Industry 4.0 and AI – are we ready for the 4th industrial transformation? Who should be the beneficiary of the Industry 4.0? Key barriers to the implementation of AI projects in organizations. Real cases of AI in Industry.

Natalia Szóstak

Head of R&D, TIDK

5. The need for explainable AI

With the spread of AI-based solutions, more and more organizations would like to understand the reasons for system decisions. It’s especially interested in regulated industries. The session will cover so-called white-box methods, as well as modern approaches to AI explainability which allow understanding more complex models.

Kacper Łukawski

Data Science Lead, Codete

6. Real-life machine learning at scale using Kubernetes and Kubeflow

How to build a machine learning pipeline to process 1500 TB data daily in a fast and cost-effective way on Google Cloud Platform using Kubeflow? How to serve TensorFlow model with almost 1M requests per second and latency < 10ms on Kubernetes? Is Kubernetes and Kubeflow ready to serve data scientists?

Michał Bryś

Data scientist, OpenX

Michał Żyliński

Customer Engineer, Google

7. Big Data on Kubernetes

Kubernetes found its place in microservices world. More and more teams are betting on Kubernetes as their goto platform for deploying business applications. What about Big Data? Can our ETLs also make the move? During the roundtable discussion we’ll discuss how Kubernetes can be utilised as a runtime for Big Data jobs. Is current tooling ready for being deployed in Kubernetes containers? What the potential shift means for used storage technologies? Finally, will Kubernetes democratize Big Data work and we’ll move from central data lakes to distributed data meshes?

Tomasz Kogut

Lead Software Engineer, Adform

8. Best tools for alerting and monitoring of the data platforms

Let’s discuss what should be monitored in data platforms? What are best tools for particular use cases? What is not recommended?

Piotr Kalański

Development Manager, StepStone Services

9. What to do with my HDP/CDH cluster with new Cloudera licensing model

After the merger with Hortonworks, Cloudera becomes a single vendor that builds a distribution that consists of major components from so called Hadoop Ecosystem (e.g. Hadoop, Spark, Hive, Ranger). While these components itself are open-source, access to binaries that are critical to install/upgrade the clusters will be limited to only customers who purchase a paid subscription. This means that thousands of the companies that currently use Hadoop for free, will need to decide what to do next. Should I pay for a subscription or compile own binaries to build own distribution? Should I stop using on-premise Hadoop and go to the public cloud instead? During this panel we explore this topic and try to answer these questions based on our vendor-neutral experience when working with our customers who have large production installations of HDP/CDH clusters.

Krzysztof Zarzycki

Big Data Architect, GetInData

Krzysztof Zarzycki

Big Data Architect, CTO and Co-founder, GetInData

10. Addressing challenges of modern analytics with Snowflake

Tomasz Mazurek

Sales Director for Eastern Europe, Snowflake

11. Being efficient data engineer. Tools, ecosystem, skills and ways of learning

What does it mean to be a productive (data) engineer? Is it about the tools we use? Is it the mindset we have? Is it the environment we are surrounded by? Let’s share and discuss war stories, learning resources, methodologies and libraries that help us escape the gumption traps in the daily life of an engineer. Discussion will be divided into 4 areas: debugging, implementation, communication and learning.

Rafał Wojdyła

Data Engineer,

12. Data discovery – building trust around your data

Worldwide growth of data has changed business landscape forever. Multiple organizations undergo transformation triggered by the data revolution. While one can understand the benefits of collecting bigger data volumes, it has revealed additional challenges when trying to effectively use it. Ability to explore the data and increasing compliance demands force us to think about solutions to leverage power of metadata. Data description evolve from being a simple schema definition to catching application context, behavior and how it is changing over time.
Let’s discuss data discovery in context of use cases, technologies and possible challenges.

Damian Warszawski

Software Engineer, ING Tech Poland

13. SQL on Big Data for batch, ad-hoc & streaming processing

Data Analysis is the key factor in data-driven decision culture and SQL is the omnipresent language for deriving information from data. Today, even small companies have huge data sets while huge organisations have enormous ones. With the advent of technologies aiming at replacing and unifying the ones we used previously, we only have more complex and more heterogeneous data landscapes. How to query the data to fuel key business decisions? How to handle data ingestion? When do we need it?

Piotr Findeisen

Software Engineer, Starburst

14. The Latest and Greatest of Apache Spark

Apache Spark is a fast and general engine for distributed in-memory computations on a massive scale. Spark 3 is in preview and expected to be released in the first quarter of 2020. What features are you waiting for and what problems do you hope to solve with Spark 3? The roundtable is to share and discuss problems we want to solve with the new features coming in Spark 3.

Magnus Runesson

Senior Data Engineer, Tink

15. Serverless data warehousing – big data, the cloud way

Definition what is serverless warehouse. List of solutions which are considered serverless. Data ingestion. Data storage. Data processing, pricing and cost efficiency. Advantages and disadvantages of both serverless and on premise.

Arkadiusz Gąsior

Data Engineer, GetInData

16. Stream processing engines – features, performance, comparison

Streaming systems are gaining more and more attention and we don’t expect this trend to slow down. Currently there are a few engines on the market. At this roundtable we will share our knowledge about similarities and differences among them in various areas. What are the strengths, weaknesses and constraints of each of them? Is there a niche for each of them or maybe a final winner should emerge?

Marek Maj

Big Data Engineer, GetInData

17. Snorkel Beambell – Real-time Weak Supervision on Apache Beam

Deep Learning models have led to a massive growth of real-world machine learning that allows Machine Learning Practitioners to get the state-of-the-art score on benchmarks without any hand-engineered features. The challenge with continuous retraining is that one needs to maintain prior state (e.g., the learning functions in case of Weak Supervision or a pretrained model like BERT or Word2Vec for Transfer Learning) that is shared across multiple streams. Apache Beam’s Stateful Stream processing capabilities are a perfect match to include support for scalable Weak Supervision.

Suneel Marthi

Principal Technologist - AI/ML, Amazon Web Services

17.00 - 17.40

2nd round

1. Managing a Big Data project – how to make it all work well?

Michał Rudko

Big Data Analyst / Architect, GetInData

2. Bring Data as Products to consumers

How do you define data products? What mindset and approach we should have to make Product approach possible in Data? Let’s brainstorm about implementation of the approach, opportunities and values it brings.

Łukasz Pakuła

RGITSC Team Manager - DataOps , Roche

3. Data visualization, how to visualize large, complex and dirty data and what tools to use

Adrian Mróź

Frontend Developer, Allegro

4. BI platfom: a choice or a circumstance?

We take a lot of effort to design datalakes, pipelines and data warehouses. But don’t forget what was the original aim at the very begining of this road: to get some insight. In order to utilize petabytes in orginzed and governed manner we need a solid BI Platform that suits our needs. A well chosen one is not only icing on the cake – it makes all data efforts reasonable and drives our business head-to-head with competition.
During this round table session we will discuss the importance of choosing the right BI platform, exchange best practices (and some bad ones too) and focus on currnet trends in the area of BI.

Emil Ruchała

Data Analyst & BI Developer, GetInData

5. Real-life machine learning at scale using Kubernetes and Kubeflow

Michał Bryś

Data scientist, OpenX

Michał Żyliński

Customer Engineer, Google

6. Databases in Kubernetes: from bare metal to cloud native

Initial cloud-native conquests started with stateless services but gradually turned towards data management systems. DBMS have relied on centralized, bare metal servers for decades. Cloud-native architectures are a big new technology shift for such systems. Many commercial and open source databases already provide cloud-native adoptions of their products. Among the most interesting cloud-native converts are analytic databases, which imply additional requirements to storage and clustering techniques in order to run fast analytic queries over billions, or even trillions, of rows. Examples include the MySQL clustering project Vitess, which has recently reached CNCF graduation level, and ClickHouse, an extremely fast and scalable analytical database that is being converted to cloud-native operation by Altinity.

Join me to discuss various aspects of running databases in Kubernetes. This is a new technology that has a lot of caveats, such as storage. At the same time databases in Kubernetes promise substantial benefits to the users of such applications as well as companies that operate them. We will explore these issues as well as the path to maturity.

Alexander Zaitsev

Co-founder & CTO, Altinity

7. Scale Your Logs, Metrics, and Traces with the Elastic Stack From traditional applications to microservices and Kubernetes

How do you tackle your monitoring and observability problems? There is a high chance that you are using the Elastic or ELK Stack and this session is all about making it scale: From easier collection of data, to scalable multi-tier architectures, and the lifecycle of your data including the deletion.

Philipp Krenn

Developer, Elastic

8. From on-premise to the cloud: an end to end cloud migration journey

The goal of discussion is to share various experiences and ideas around tackling the migration challenge. Together we will try to nail down the benefits, obstacles and solutions that can help in our journey from on-prem to the cloud.
The plan is to identify the important areas of the migration and to share the lessons we’ve learned from those.

Mateusz Pytel

Google Certified Professional - Cloud Architect, GetInData

9. Challenges of building a modern & future-proof data processing platform

The speed of changes in IT and our companies seems to never stop increasing, especially in the field of BigData. To keep up with it we need to move fast and be smart about it, but how to actually achieve that? How to predict future needs for processing and tools? How to prepare for it? What kind of trade offs we can make? We’ll try to answer that together and share good practices and experience during this session.

Monika Puchalska

Engineering Manager, Zendesk

10. Hadoop is dying, long life to HDFS, what are your options and plans for sustainable advanced analytics and machine learning?

Many enterprise organizations, particularly in financial services and telco, have built massive data lakes in Hadoop, specifically HDFS. However, Hadoop never lived up to its hype as a data warehouse replacement and languished as storage option in on-premises data centers. On top of that, the top two HDFS vendors – Hortonworks and Cloudera – have merged and the third, MapR, has been sold off to HPE in a fire sale.
What is your escape plan and alternatives for applying advanced analytics and machine learning to your growing data volumes? Will you adopt object storage as a cost-effective, data storage repository? Is the separation of compute and storage the future database architecture to manage variable workloads?
Join this roundtable discussion to get to the bottom of these top-of-mind questions and learn the emerging trends and options to build modern data pipelines for advanced analytics and Machine Learning over vast amounts of data?

Maciej Paliwoda

Solution Engineer, VERTICA

11. Being efficient data engineer. Tools, ecosystem, skills and ways of learning

Rafał Wojdyła

Data Engineer,

12. Data discovery – building trust around your data

Damian Warszawski

Software Engineer, ING Tech Poland

13. SQL on Big Data for batch, ad-hoc & streaming processing

Piotr Findeisen

Software Engineer, Starburst

14. The Latest and Greatest of Apache Spark

Magnus Runesson

Senior Data Engineer, Tink

15. Managing workflows at scale

How to build and maintain thousands of pipelines in the organisation? What are the biggest pain points in orchestrating hundreds of ETLs? What open source and managed solutions are available?

Paweł Kupidura

Data Engineer, Bolt

16. Stream processing engines – features, performance, comparison

Marek Maj

Big Data Engineer, GetInData

17. Data Auditing

Have you ever encountered a situation when your pipeline (or system) produced less (or more) data than expected? Has it lost your data? Or maybe that data has never reached the source you read (or reached, but delayed too much)? Have you ever seen too many empty (or NULL) fields? Or a field called `age` with negative values? Or maybe you encrypted some data thoroughly enough, that no one could read (decrypt) that data anymore? Last, but not least, what to do if you detect a problem (or, do you even monitor your data? Do you have any alerts set?). Can you backfill, mutate or leave a wrong dataset? Let’s talk about different ways to ensure that data you produce makes sense.

Bartosz Janota

Senior Data Engineer, Bolt

17.40 - 17.55

Coffee break

17.55 - 18.25

Panel discussion: Ways to make large-scale ML actually work

Despite the spread of dedicated AI platforms, ready-to-use ML libraries and tons of data available, running successful large-scale AI/ML projects still faces technical and organizational challenges. According to some studies, 8 out of 10 such projects fail. This panel will explore the necessary technical prerequisites that a company should introduce to build ML-based solutions efficiently. This includes, for example, organizing the data (e.g. data discovery, data lineage, data quality), experimenting with models (e.g. notebooks, libraries, collaboration), one-click-deployment of a model (e.g. AI/ML platforms, infrastructure) and more. While many of these challenges are not that hard when working with small data, everything gets more complex & time-consuming when the data and scale is larger.

Host:

Marcin Choiński

Head of Big Data & Analytics Ecosystem, TVN

Panelists:

Josh Baer

Product Lead, Machine Learning Platform, Spotify

Marek Wiewiórka

Big Data Architect, GetInData

Marek Wiewiórka

Big Data Architect, GetInData

Paweł Zawistowski

Lead Data Scientist, Adform, Assistant Professor, Warsaw University of Technology

Paweł Zawistowski

Data Scientist, Grupa Allegro

18.25 - 18.40

Closing & Summary

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

19.00 - 22.00

Networking party for all participants and speakers

At the end of the conference we would like to invite all the attendees for the informal evening meeting at “Dekada” Club , which is located at the Grójecka 19/25, 02-021 Warszawa.

EDITION 2020

Agenda 2020

February 27, 2020

Keywords: #Snowflake #cloud #clouddataplatform #multicloud #cloudanalytics #datawarehousecloud #data #SQL #database #datawarehouse

Keywords: #CreditRisk, #Python, #BigData, #UseCase #MachineLearning, #StatisticalModeling, #DataScience

Key words: #hybrid_cloud, #data_in_motion, #Cloudera_Data_Platform, #stream_processing

11.10 – 15.30 Simultaneous sessions

Architecture, Operations and Cloud

Data Engineering

Streaming and Real-Time Analytics

Artificial Intelligence and Data Science

Data Strategy and ROI

Keywords: #docker #container #kubernetes #operator #orchestration

Keywords: #autonomousdriving #dataingestion #petabytescale #hardwareinthehloop #mapr #spark #openshift

Keywords: #bigdata #scalability #hadoop #spark #analytics #datascience #dataplatform

Keywords: #recommendersystems #ML #cloud #experimentation

Keywords: #googleCloud #Streaming #DataFlow #DataOps

Keywords: #Hadoop #datasecurity #resilience #in-house #storage

Keywords: #aws #datalake #datawarehouse #preprocessing #machinelearning

Keywords:

Keywords: #ml #kubeflow #tensorflow #ml-infra

Keywords: #lowcode, #executionabstraction, #datavirtualization

Keywords: #cloudmigration #bigquery #airflow #kafka

Keywords: #kafka, #data processing, #high-performance

Keywords:

Keywords: #machinetranslation #deeplearning #adversarialexamples #datascience

Keywords: #Analytics #SQL #DWH #CaseStudy #BigData

Keywords: #aws_cloud #devops #best_practices #infrastructure_as_a_code

Keywords: #CloudAnalytics #Presto #DataVirtualization #SQL-on-Hadoop #DWH

Keywords: #spark #elasticsearch #machinelearning #hadoop #dataprocessing

Keywords: #ml #recommendersystem #mlops #automl

Keywords: #ROI #real-timeomnichannelpersonalization #scalingdataecosystem #businessengagement #harvesting

Keywords: #bigdatabento #cloud #unifiedanalyticsplatform #unifieddataanalyticsplatform #spark

Keywords: #BigData, #DataDiscovery, #DataIngestion, #Lineage, #MetadataGovernance, #Data-Driven

Keywords: #graphAnalytics #transactionProcessing #FlinkGelly #Elasticsearch #Kibana

Keywords: #attribution #datascience #statisticalmodeling #marketingmix #interdisciplinary

Keywords: #UseCase #CI/CD #BestPracticesforData Science #DataProduct #Reproducibleresearch

Keywords: #Openx #gcp #scale #adtech #migration

Keywords: #data-driven #optimize #data-pipeline #operation #improvement

Keywords: #datascience #datamanagement #revisioncontrol #datapipeline

Keywords: #NewProducts #MachineLearning #DataFueledGrowth #DataGuidedProductDevelopment #ScalingNewProduct

16.10 – 17.35 Roundtables sessions

1. Managing a Big Data project – how to make it all work well?

2. Analytics and Customer Experience Management on top of Big Data

3. Data visualization, how to visualize large, complex and dirty data and what tools to use

4. Practical application of AI

5. The need for explainable AI

6. Real-life machine learning at scale using Kubernetes and Kubeflow

7. Big Data on Kubernetes

8. Best tools for alerting and monitoring of the data platforms

9. What to do with my HDP/CDH cluster with new Cloudera licensing model

10. Addressing challenges of modern analytics with Snowflake

11. Being efficient data engineer. Tools, ecosystem, skills and ways of learning

12. Data discovery – building trust around your data