Agenda 2020
February 27, 2020
8.00 - 9.00
Registration and welcome coffee
9.00 - 9.10
Conference opening
Przemysław Gamdzyk
CEO & Meeting Designer, Evention
Przemysław Gamdzyk
CEO & Meeting Designer, Evention
9.10 - 10.40
Plenary Session
9.10 - 9.40
Challenges of modern analytics
Data is the new oil. More and more companies understand the value of data to optimise their core business or enter new business fields. They want to analyse data to enhance their internal processes, the way how they work with customers or how they collaborate with external parties such as suppliers, partners etc. All this is not trivial, requires the right skillset and an appropriate technology. The cloud promises scalability, elasticity and resources on demand. But a cloud native architecture is mandatory to leverage these features. Snowflake is built for the cloud, separates storage from compute and offers everything people expect from the cloud. Even a data marketplace for collaboration and monetisation is available that makes sharing and exchanging data globally easy. All this independent from a cloud vendor and capable for multi-cloud setups.
Keywords: #Snowflake #cloud #clouddataplatform #multicloud #cloudanalytics #datawarehousecloud #data #SQL #database #datawarehouse
Thomas Scholz
Sales Engineering Manager for EMEA, Snowflake
9.40 - 10.10
Credit Risk in practice on a global scale. New technology platform and methodologies in practice
Open source technologies and new machine learning methods are changing regulatory Credit risk in recent times. We will talk how we handle data globally from entire ING, what technologies for modelers experience we use, what machine learning tools we incorporate. All of this being compliant with more strict regulatory frameworks than ever. Handling risk on a balance sheet 1,5 size of Poland GDP is a really interesting task for modelers, data scientists and data engineers.
Keywords: #CreditRisk, #Python, #BigData, #UseCase #MachineLearning, #StatisticalModeling, #DataScience
Marcin Brożek
Credit Risk Modelling Expert, ING Tech Poland
Konrad Wypchło
Senior Chapter Lead, ING Tech Poland
10.10 - 10.40
Leveraging hybrid cloud for real-time insights with the new Cloudera Data Platform
The new Cloudera solutions for hybrid cloud environment. Adding Apache Flink integration to the CDP. Solving the real-life challenges based on the use cases from the Polish market. Apache Flink and CSP roadmap.
Key words: #hybrid_cloud, #data_in_motion, #Cloudera_Data_Platform, #stream_processing
10.40 - 11.10
Coffee break
11.10 – 15.30 Simultaneous sessions
Architecture, Operations and Cloud
Data Engineering
Streaming and Real-Time Analytics
Artificial Intelligence and Data Science
Data Strategy and ROI
Host:
Arkadiusz Gąsior
Data Engineer, GetInData
Host:
Łukasz Suchenek
Conferences Editor, Evention
Host:
Paweł Jurkiewicz
Data Engineer, GetInData
Host:
Adrian Bednarz
Big Data Engineer, GetInData
Host:
Stefan Rautszko
Team Manager, Data Design, Roche
11.15 - 11.45
From Containers to Kubernetes Operators for a Datastore
Keywords: #docker #container #kubernetes #operator #orchestration
Philipp Krenn
Developer, Elastic
11.15 - 11.45
Will we see driverless cars in 20s?
Keywords: #autonomousdriving #dataingestion #petabytescale #hardwareinthehloop #mapr #spark #openshift
Sławomir Folwarski
Senior Architect, DXC Analytics Platform, DXC Technology
Piotr Frejowski
System Architect, DXC Robotic Drive Program, DXC Technology
11.15 - 11.45
Creating an extensible Big Data Platform for advanced analytics - 100s of PetaBytes with Realtime access
Keywords: #bigdata #scalability #hadoop #spark #analytics #datascience #dataplatform
Reza Shiftehfar
Engineering Management & Leadership, Uber
11.15 - 11.45
Building Recommendation Platform for ESPN+ and Disney+. Lessons Learned
Keywords: #recommendersystems #ML #cloud #experimentation
Grzegorz Puchawski
Data Science and Recommendation, Disney Streaming Services
11.15 - 11.45
From bioreactors to kibana dashboards
Keywords: #googleCloud #Streaming #DataFlow #DataOps
Fabian Wiktorowski
IT Expert, Roche
11.45 - 11.50
Technical break
11.50 - 12.20
Replication Is Not Enough for 450 PB: Try an Extra DC and a Cold Store
Keywords: #Hadoop #datasecurity #resilience #in-house #storage
Stuart Pook
Senior Site Reliability Engineer, Criteo
11.50 - 12.20
Data Platform at Bolt: challenges of scaling data infrastructure in a hyper growth startup
Keywords: #aws #datalake #datawarehouse #preprocessing #machinelearning
Łukasz Grądzki
Engineering Manager, Bolt
11.50 - 12.20
Building a Factory for Machine Learning at Spotify
Keywords: #ml #kubeflow #tensorflow #ml-infra
Josh Baer
Product Lead, Machine Learning Platform, Spotify
11.50 - 12.20
Abstraction matters
Keywords: #lowcode, #executionabstraction, #datavirtualization
Anthony Ibrahim
Head of Ab Initio DACH/CEE, Ab Initio
12.20 - 12.25
Technical break
12.25 - 12.55
How to make your Data Scientists like you and save a few bucks while migrating to cloud - Truecaller case study
Keywords: #cloudmigration #bigquery #airflow #kafka
Fouad Alsayadi
Senior Data Engineer, Truecaller
Juliana Araujo
Data Product Manager, Truecaller
Tomasz Żukowski
Data Analyst, GetInData
12.25 - 12.55
Kafka-workers, Parallelism First
Keywords: #kafka, #data processing, #high-performance
Tomasz Uliński
Software Developer, RTB House
12.25 - 12.55
Adventure in Complex Event Processing at telco
Keywords:
Jakub Błachucki
Big Data Engineer, Orange
Maciej Czyżowicz
Technical Leader for Analytics Stream, Orange
Maciej Czyżowicz
Architekt Korporacyjny, Orange Polska
Paweł Pinkos
Big Data Engineer, Orange
12.25 - 12.55
Neural Machine Translation: achievements, challenges and the way forward
Keywords: #machinetranslation #deeplearning #adversarialexamples #datascience
Katarzyna Pakulska
Data Science Technology Leader, Findwise
Barbara Rychalska
Senior Data Scientist and Data Science Section Leader, Findwise
12.25 - 12.55
It's 2020. Why are we still using 1980s tech?
Keywords: #Analytics #SQL #DWH #CaseStudy #BigData
Arnon Shimoni
Product Manager and Solutions Architect, SQream
12.55 - 13.50
Lunch
13.50 - 14.20
DevOps best practices in AWS cloud
Keywords: #aws_cloud #devops #best_practices #infrastructure_as_a_code
Adam Kurowski
Senior DevOps, StepStone Services
Kamil Szkoda
DevOps Team Leader and Product Owner , StepStone Services
13.50 - 14.20
Presto @ Zalando: A cloud journey for Europe’s leading online retailer
Keywords: #CloudAnalytics #Presto #DataVirtualization #SQL-on-Hadoop #DWH
Wojciech Biela
Co-founder & Senior Director of Engineering, Starburst
Piotr Findeisen
Software Engineer, Starburst
Max Schultze
Data Engineer, Zalando SE
13.50 - 14.20
Network monitoring, data processing, forecasting, fraud and anomaly detection- using Spark, Elasticsearch, Machine Learning and Hadoop
Keywords: #spark #elasticsearch #machinelearning #hadoop #dataprocessing
Kamil Szpakowski
Big Data Main Specialist, T-Mobile
13.50 - 14.20
Feature store: Solving anti-patterns in ML-systems
Keywords: #ml #recommendersystem #mlops #automl
Andrzej Michałowski
Head of AI Research & Development, Synerise
13.50 - 14.20
Omnichannel Personalization as example of creating data ROI - from separate use cases to operational complete data ecosystem
Keywords: #ROI #real-timeomnichannelpersonalization #scalingdataecosystem #businessengagement #harvesting
Tomasz Burzyński
Business Insights Director, Orange
Mateusz Krawczyk
Personalization Solutions Product Owner, Orange
14.20 - 14.25
Technical break
14.25 - 14.55
The Big Data Bento: Diversified yet Unified
Keywords: #bigdatabento #cloud #unifiedanalyticsplatform #unifieddataanalyticsplatform #spark
Michael Shtelma
Solutions Architect, Databricks
14.25 - 14.55
Towards enterprise-grade data discovery and data lineage at ING with Apache Atlas and Amundsen
Keywords: #BigData, #DataDiscovery, #DataIngestion, #Lineage, #MetadataGovernance, #Data-Driven
Verdan Mahmood
Software Engineer, ING
Marek Wiewiórka
Big Data Architect, GetInData
Marek Wiewiórka
Big Data Architect, GetInData
14.25 - 14.55
Monitoring & Analysing Communication and Trade Events as Graphs
Keywords: #graphAnalytics #transactionProcessing #FlinkGelly #Elasticsearch #Kibana
Christos Hadjinikolis
Senior Consultant, Lead ML Engineer, Data Reply UK
14.25 - 14.55
Utilizing Machine Learning To Optimize Marketing Spend Through Attribution Modelling
Keywords: #attribution #datascience #statisticalmodeling #marketingmix #interdisciplinary
Arunabh Singh
Lead Data Scientist, HiQ International AB
14.25 - 14.55
Data Science @ PMI – Journey from business problem to the data product industrialization
Keywords: #UseCase #CI/CD #BestPracticesforData Science #DataProduct #Reproducibleresearch
Michał Dyrda
Senior Enterprise Data Scientist, Philip Morris International
Maciej Marek
Enterprise Data Scientist , Philip Morris International
14.55 - 15.00
Technical break
15.00 - 15.30
How to send 16,000 servers to the cloud in 8 months?
Keywords: #Openx #gcp #scale #adtech #migration
Marcin Mierzejewski
Engineering Director, OpenX
Radek Stankiewicz
Strategic Cloud Engineer, Google Cloud
15.00 - 15.30
Optimize your Data Pipeline without Rewriting it
Keywords: #data-driven #optimize #data-pipeline #operation #improvement
Magnus Runesson
Senior Data Engineer, Tink
15.00 - 15.30
Flink on a trip - a real-time car insurance system in a nut(shell)
Wojciech Indyk
Streaming Analytics and All Things Data Black Belt Ninja, Humn.ai
15.00 - 15.30
Reliability in ML - how to manage changes in data science projects?
Keywords: #datascience #datamanagement #revisioncontrol #datapipeline
Kornel Skałkowski
Senior AI Engineer, Consonance Solutions
15.00 - 15.30
Using data to build Products
Keywords: #NewProducts #MachineLearning #DataFueledGrowth #DataGuidedProductDevelopment #ScalingNewProduct
Ketan Gupta
Product Leader, Booking.com
15.30 - 16.10
Coffee break
16.10 – 17.35 Roundtables sessions
16.10 - 16.15
Intro
Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable discussion – they are selected professionals with a vast knowledge and experience.
There will be 2 rounds of discussion, hence every conference participants can take part in 2 discussions
16.15 – 16.55 1st round
17.00 – 17.40 2nd round
16.15 - 16.55
1st round
1. Managing a Big Data project – how to make it all work well?
Data Scientists, Data Engineers, DevOps Specialists, Business Stakeholders – all these people come from different worlds, but need to work close together in order to make a Big Data project a success.
Let’s discuss about our achievements as well as… spectacular failures when it comes to communication, cooperation and meeting ones expectations.
We’ll be talking about methodologies, tools, best practices and, so called, human element.
Michał Rudko
Big Data Analyst / Architect, GetInData
2. Analytics and Customer Experience Management on top of Big Data
How to ensure successful adoption of Big Data and Analytics systems? It is a challenge for most organizations. Let’s discuss how to promote user-centric approach, leverage experience design and manage user expectation on Big Data projects. I would be happy to hear you opinion and answer your questions, based on my practical experience applying Design Thinking and architecture design methodologies. I believe this conversation will be interesting for Architects, Tech Leaders, Product Managers and C-level folks.
Taras Bachynskyy
Director, Big Data & Analytics, SoftServe
3. Data visualization, how to visualize large, complex and dirty data and what tools to use
Data visualisation is great tool at explaining data – best we know right now. But our data volume grows every day and we often hit hard limits of current visualisation systems. How we can approach this problem so we still can analyse and explore data? From interactive interfaces that link many visualisations to machine learning algorithms that pick best chart type and parameters. It is really interdisciplinary issue, so let’s share our knowledge!
Adrian Mróź
Frontend Developer, Allegro
4. Practical application of AI
Industry 4.0 and AI – are we ready for the 4th industrial transformation? Who should be the beneficiary of the Industry 4.0? Key barriers to the implementation of AI projects in organizations. Real cases of AI in Industry.
Natalia Szóstak
Head of R&D, TIDK
5. The need for explainable AI
With the spread of AI-based solutions, more and more organizations would like to understand the reasons for system decisions. It’s especially interested in regulated industries. The session will cover so-called white-box methods, as well as modern approaches to AI explainability which allow understanding more complex models.
Kacper Łukawski
Data Science Lead, Codete
6. Real-life machine learning at scale using Kubernetes and Kubeflow
How to build a machine learning pipeline to process 1500 TB data daily in a fast and cost-effective way on Google Cloud Platform using Kubeflow? How to serve TensorFlow model with almost 1M requests per second and latency < 10ms on Kubernetes? Is Kubernetes and Kubeflow ready to serve data scientists?
Michał Bryś
Data scientist, OpenX
Michał Żyliński
Customer Engineer, Google
7. Big Data on Kubernetes
Kubernetes found its place in microservices world. More and more teams are betting on Kubernetes as their goto platform for deploying business applications. What about Big Data? Can our ETLs also make the move? During the roundtable discussion we’ll discuss how Kubernetes can be utilised as a runtime for Big Data jobs. Is current tooling ready for being deployed in Kubernetes containers? What the potential shift means for used storage technologies? Finally, will Kubernetes democratize Big Data work and we’ll move from central data lakes to distributed data meshes?
Tomasz Kogut
Lead Software Engineer, Adform
8. Best tools for alerting and monitoring of the data platforms
Let’s discuss what should be monitored in data platforms? What are best tools for particular use cases? What is not recommended?
Piotr Kalański
Development Manager, StepStone Services
9. What to do with my HDP/CDH cluster with new Cloudera licensing model
After the merger with Hortonworks, Cloudera becomes a single vendor that builds a distribution that consists of major components from so called Hadoop Ecosystem (e.g. Hadoop, Spark, Hive, Ranger). While these components itself are open-source, access to binaries that are critical to install/upgrade the clusters will be limited to only customers who purchase a paid subscription. This means that thousands of the companies that currently use Hadoop for free, will need to decide what to do next. Should I pay for a subscription or compile own binaries to build own distribution? Should I stop using on-premise Hadoop and go to the public cloud instead? During this panel we explore this topic and try to answer these questions based on our vendor-neutral experience when working with our customers who have large production installations of HDP/CDH clusters.
Krzysztof Zarzycki
Big Data Architect, GetInData
Krzysztof Zarzycki
Big Data Architect, CTO and Co-founder, GetInData
10. Addressing challenges of modern analytics with Snowflake
Data is the new oil. More and more companies understand the value of data to optimise their core business or enter new business fields. They want to analyse data to enhance their internal processes, the way how they work with customers or how they collaborate with external parties such as suppliers, partners etc. All this is not trivial, requires the right skillset and an appropriate technology. The cloud promises scalability, elasticity and resources on demand. But a cloud native architecture is mandatory to leverage these features. Snowflake is built for the cloud, separates storage from compute and offers everything people expect from the cloud. Even a data marketplace for collaboration and monetisation is available that makes sharing and exchanging data globally easy. All this independent from a cloud vendor and capable for multi-cloud setups.
Tomasz Mazurek
Sales Director for Eastern Europe, Snowflake
11. Being efficient data engineer. Tools, ecosystem, skills and ways of learning
What does it mean to be a productive (data) engineer? Is it about the tools we use? Is it the mindset we have? Is it the environment we are surrounded by? Let’s share and discuss war stories, learning resources, methodologies and libraries that help us escape the gumption traps in the daily life of an engineer. Discussion will be divided into 4 areas: debugging, implementation, communication and learning.
Rafał Wojdyła
Data Engineer,
12. Data discovery – building trust around your data
Worldwide growth of data has changed business landscape forever. Multiple organizations undergo transformation triggered by the data revolution. While one can understand the benefits of collecting bigger data volumes, it has revealed additional challenges when trying to effectively use it. Ability to explore the data and increasing compliance demands force us to think about solutions to leverage power of metadata. Data description evolve from being a simple schema definition to catching application context, behavior and how it is changing over time.
Let’s discuss data discovery in context of use cases, technologies and possible challenges.
Damian Warszawski
Software Engineer, ING Tech Poland
13. SQL on Big Data for batch, ad-hoc & streaming processing
Data Analysis is the key factor in data-driven decision culture and SQL is the omnipresent language for deriving information from data. Today, even small companies have huge data sets while huge organisations have enormous ones. With the advent of technologies aiming at replacing and unifying the ones we used previously, we only have more complex and more heterogeneous data landscapes. How to query the data to fuel key business decisions? How to handle data ingestion? When do we need it?
Piotr Findeisen
Software Engineer, Starburst
14. The Latest and Greatest of Apache Spark
Apache Spark is a fast and general engine for distributed in-memory computations on a massive scale. Spark 3 is in preview and expected to be released in the first quarter of 2020. What features are you waiting for and what problems do you hope to solve with Spark 3? The roundtable is to share and discuss problems we want to solve with the new features coming in Spark 3.
Magnus Runesson
Senior Data Engineer, Tink
15. Serverless data warehousing – big data, the cloud way
Definition what is serverless warehouse. List of solutions which are considered serverless. Data ingestion. Data storage. Data processing, pricing and cost efficiency. Advantages and disadvantages of both serverless and on premise.
Arkadiusz Gąsior
Data Engineer, GetInData
16. Stream processing engines – features, performance, comparison
Streaming systems are gaining more and more attention and we don’t expect this trend to slow down. Currently there are a few engines on the market. At this roundtable we will share our knowledge about similarities and differences among them in various areas. What are the strengths, weaknesses and constraints of each of them? Is there a niche for each of them or maybe a final winner should emerge?
Marek Maj
Big Data Engineer, GetInData
17. Snorkel Beambell – Real-time Weak Supervision on Apache Beam
Deep Learning models have led to a massive growth of real-world machine learning that allows Machine Learning Practitioners to get the state-of-the-art score on benchmarks without any hand-engineered features. The challenge with continuous retraining is that one needs to maintain prior state (e.g., the learning functions in case of Weak Supervision or a pretrained model like BERT or Word2Vec for Transfer Learning) that is shared across multiple streams. Apache Beam’s Stateful Stream processing capabilities are a perfect match to include support for scalable Weak Supervision.
Suneel Marthi
Principal Technologist - AI/ML, Amazon Web Services
17.00 - 17.40
2nd round
1. Managing a Big Data project – how to make it all work well?
Data Scientists, Data Engineers, DevOps Specialists, Business Stakeholders – all these people come from different worlds, but need to work close together in order to make a Big Data project a success.
Let’s discuss about our achievements as well as… spectacular failures when it comes to communication, cooperation and meeting ones expectations.
We’ll be talking about methodologies, tools, best practices and, so called, human element.
Michał Rudko
Big Data Analyst / Architect, GetInData
2. Bring Data as Products to consumers
How do you define data products? What mindset and approach we should have to make Product approach possible in Data? Let’s brainstorm about implementation of the approach, opportunities and values it brings.
Łukasz Pakuła
RGITSC Team Manager - DataOps , Roche
3. Data visualization, how to visualize large, complex and dirty data and what tools to use
Data visualisation is great tool at explaining data – best we know right now. But our data volume grows every day and we often hit hard limits of current visualisation systems. How we can approach this problem so we still can analyse and explore data? From interactive interfaces that link many visualisations to machine learning algorithms that pick best chart type and parameters. It is really interdisciplinary issue, so let’s share our knowledge!
Adrian Mróź
Frontend Developer, Allegro
4. BI platfom: a choice or a circumstance?
We take a lot of effort to design datalakes, pipelines and data warehouses. But don’t forget what was the original aim at the very begining of this road: to get some insight. In order to utilize petabytes in orginzed and governed manner we need a solid BI Platform that suits our needs. A well chosen one is not only icing on the cake – it makes all data efforts reasonable and drives our business head-to-head with competition.
During this round table session we will discuss the importance of choosing the right BI platform, exchange best practices (and some bad ones too) and focus on currnet trends in the area of BI.
Emil Ruchała
Data Analyst & BI Developer, GetInData
5. Real-life machine learning at scale using Kubernetes and Kubeflow
How to build a machine learning pipeline to process 1500 TB data daily in a fast and cost-effective way on Google Cloud Platform using Kubeflow? How to serve TensorFlow model with almost 1M requests per second and latency < 10ms on Kubernetes? Is Kubernetes and Kubeflow ready to serve data scientists?
Michał Bryś
Data scientist, OpenX
Michał Żyliński
Customer Engineer, Google
6. Databases in Kubernetes: from bare metal to cloud native
Initial cloud-native conquests started with stateless services but gradually turned towards data management systems. DBMS have relied on centralized, bare metal servers for decades. Cloud-native architectures are a big new technology shift for such systems. Many commercial and open source databases already provide cloud-native adoptions of their products. Among the most interesting cloud-native converts are analytic databases, which imply additional requirements to storage and clustering techniques in order to run fast analytic queries over billions, or even trillions, of rows. Examples include the MySQL clustering project Vitess, which has recently reached CNCF graduation level, and ClickHouse, an extremely fast and scalable analytical database that is being converted to cloud-native operation by Altinity.
Join me to discuss various aspects of running databases in Kubernetes. This is a new technology that has a lot of caveats, such as storage. At the same time databases in Kubernetes promise substantial benefits to the users of such applications as well as companies that operate them. We will explore these issues as well as the path to maturity.
Alexander Zaitsev
Co-founder & CTO, Altinity
7. Scale Your Logs, Metrics, and Traces with the Elastic Stack From traditional applications to microservices and Kubernetes
How do you tackle your monitoring and observability problems? There is a high chance that you are using the Elastic or ELK Stack and this session is all about making it scale: From easier collection of data, to scalable multi-tier architectures, and the lifecycle of your data including the deletion.
Philipp Krenn
Developer, Elastic
8. From on-premise to the cloud: an end to end cloud migration journey
The goal of discussion is to share various experiences and ideas around tackling the migration challenge. Together we will try to nail down the benefits, obstacles and solutions that can help in our journey from on-prem to the cloud.
The plan is to identify the important areas of the migration and to share the lessons we’ve learned from those.
Mateusz Pytel
Google Certified Professional - Cloud Architect, GetInData
9. Challenges of building a modern & future-proof data processing platform
The speed of changes in IT and our companies seems to never stop increasing, especially in the field of BigData. To keep up with it we need to move fast and be smart about it, but how to actually achieve that? How to predict future needs for processing and tools? How to prepare for it? What kind of trade offs we can make? We’ll try to answer that together and share good practices and experience during this session.
Monika Puchalska
Engineering Manager, Zendesk
10. Hadoop is dying, long life to HDFS, what are your options and plans for sustainable advanced analytics and machine learning?
Many enterprise organizations, particularly in financial services and telco, have built massive data lakes in Hadoop, specifically HDFS. However, Hadoop never lived up to its hype as a data warehouse replacement and languished as storage option in on-premises data centers. On top of that, the top two HDFS vendors – Hortonworks and Cloudera – have merged and the third, MapR, has been sold off to HPE in a fire sale.
What is your escape plan and alternatives for applying advanced analytics and machine learning to your growing data volumes? Will you adopt object storage as a cost-effective, data storage repository? Is the separation of compute and storage the future database architecture to manage variable workloads?
Join this roundtable discussion to get to the bottom of these top-of-mind questions and learn the emerging trends and options to build modern data pipelines for advanced analytics and Machine Learning over vast amounts of data?
Maciej Paliwoda
Solution Engineer, VERTICA
11. Being efficient data engineer. Tools, ecosystem, skills and ways of learning
What does it mean to be a productive (data) engineer? Is it about the tools we use? Is it the mindset we have? Is it the environment we are surrounded by? Let’s share and discuss war stories, learning resources, methodologies and libraries that help us escape the gumption traps in the daily life of an engineer. Discussion will be divided into 4 areas: debugging, implementation, communication and learning.
Rafał Wojdyła
Data Engineer,
12. Data discovery – building trust around your data
Worldwide growth of data has changed business landscape forever. Multiple organizations undergo transformation triggered by the data revolution. While one can understand the benefits of collecting bigger data volumes, it has revealed additional challenges when trying to effectively use it. Ability to explore the data and increasing compliance demands force us to think about solutions to leverage power of metadata. Data description evolve from being a simple schema definition to catching application context, behavior and how it is changing over time.
Let’s discuss data discovery in context of use cases, technologies and possible challenges.
Damian Warszawski
Software Engineer, ING Tech Poland
13. SQL on Big Data for batch, ad-hoc & streaming processing
Data Analysis is the key factor in data-driven decision culture and SQL is the omnipresent language for deriving information from data. Today, even small companies have huge data sets while huge organisations have enormous ones. With the advent of technologies aiming at replacing and unifying the ones we used previously, we only have more complex and more heterogeneous data landscapes. How to query the data to fuel key business decisions? How to handle data ingestion? When do we need it?
Piotr Findeisen
Software Engineer, Starburst
14. The Latest and Greatest of Apache Spark
Apache Spark is a fast and general engine for distributed in-memory computations on a massive scale. Spark 3 is in preview and expected to be released in the first quarter of 2020. What features are you waiting for and what problems do you hope to solve with Spark 3? The roundtable is to share and discuss problems we want to solve with the new features coming in Spark 3.
Magnus Runesson
Senior Data Engineer, Tink
15. Managing workflows at scale
How to build and maintain thousands of pipelines in the organisation? What are the biggest pain points in orchestrating hundreds of ETLs? What open source and managed solutions are available?
Paweł Kupidura
Data Engineer, Bolt
16. Stream processing engines – features, performance, comparison
Streaming systems are gaining more and more attention and we don’t expect this trend to slow down. Currently there are a few engines on the market. At this roundtable we will share our knowledge about similarities and differences among them in various areas. What are the strengths, weaknesses and constraints of each of them? Is there a niche for each of them or maybe a final winner should emerge?
Let’s share our experiences in using streaming engines as well as our predictions about their future.
Marek Maj
Big Data Engineer, GetInData
17. Data Auditing
Have you ever encountered a situation when your pipeline (or system) produced less (or more) data than expected? Has it lost your data? Or maybe that data has never reached the source you read (or reached, but delayed too much)? Have you ever seen too many empty (or NULL) fields? Or a field called `age` with negative values? Or maybe you encrypted some data thoroughly enough, that no one could read (decrypt) that data anymore? Last, but not least, what to do if you detect a problem (or, do you even monitor your data? Do you have any alerts set?). Can you backfill, mutate or leave a wrong dataset? Let’s talk about different ways to ensure that data you produce makes sense.
Bartosz Janota
Senior Data Engineer, Bolt
17.40 - 17.55
Coffee break
17.55 - 18.25
Panel discussion: Ways to make large-scale ML actually work
Despite the spread of dedicated AI platforms, ready-to-use ML libraries and tons of data available, running successful large-scale AI/ML projects still faces technical and organizational challenges. According to some studies, 8 out of 10 such projects fail. This panel will explore the necessary technical prerequisites that a company should introduce to build ML-based solutions efficiently. This includes, for example, organizing the data (e.g. data discovery, data lineage, data quality), experimenting with models (e.g. notebooks, libraries, collaboration), one-click-deployment of a model (e.g. AI/ML platforms, infrastructure) and more. While many of these challenges are not that hard when working with small data, everything gets more complex & time-consuming when the data and scale is larger.
Host:
Marcin Choiński
Head of Big Data & Analytics Ecosystem, TVN
Panelists:
Josh Baer
Product Lead, Machine Learning Platform, Spotify
Marek Wiewiórka
Big Data Architect, GetInData
Marek Wiewiórka
Big Data Architect, GetInData
Paweł Zawistowski
Lead Data Scientist, Adform, Assistant Professor, Warsaw University of Technology
Paweł Zawistowski
Data Scientist, Grupa Allegro
18.25 - 18.40
Closing & Summary
Przemysław Gamdzyk
CEO & Meeting Designer, Evention
Przemysław Gamdzyk
CEO & Meeting Designer, Evention
19.00 - 22.00
Networking party for all participants and speakers
At the end of the conference we would like to invite all the attendees for the informal evening meeting at “Dekada” Club , which is located at the Grójecka 19/25, 02-021 Warszawa.