Real-time interactive analytics at mega-scale

Yuan Jiang is a senior staff engineer leading the storage engine team for the Interactive Analytics product at Alibaba. At Big Data Tech Warsaw Summit 2020, he told a story of a large-scale real-time data warehouse product developed in-house. He talked solutions for a real-time data warehouse, its architecture, and typical scenarios. What is Interactive Analytics service, and is it available for the general public? It is a sub-second real-time data warehouse. It offers the ability to analyze a massive amount of data interactively, and it is fully compatible with PostgreSQL. We combined large scale Computational Storage – low cost,
Read More

Real-time access to hundreds of petabytes of data

How to create an extensible Big Data platform for advanced analytics? How to scale data platform from a few terabytes to over 100 petabytes while reducing the data latency from 24+ hours to minutes. Reza Shiftehfar, Hadoop Platform teams leader at Uber, at Big Data Tech Warsaw Summit 2020, told a story of building and growing Uber’s reliable and scalable Big Data platform. It serves petabytes of data in a real-time fashion utilizing technologies such as Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, and Presto. What is your role at Uber? I manage the new platform team at Uber.
Read More

Big data, even bigger challenges

We all know it: data is the new oil, and we live in a data age. We all understand the value of data in optimizing the business and pursuing new opportunities. The question is how to use it efficiently to become more successful and more agile, how to become a data-driven organization? The best way to look for the answers is Big Data Tech Warsaw Summit, an independent annual conference focusing on data science and engineering, AI and machine learning, data strategies and monetization, DevOps and the cloud. Data truly is today’s most valuable resource. Every company in the world
Read More

Data Engineer at Zalando on the data integration at petabyte scale, best practices, and technology tools. An interview with Max Schultze.

What are the main challenges of building an end to end data integration platform at petabyte scale? Max Schultze [MS]: At Zalando building a Data Lake was not the very first thing the company had in mind. Throughout the growth of the company from a startup to the size of thousands of employees that it is now, the technical landscape grew organically and such did the data landscape. Realizing that classical analytical processing was no longer sufficient to sustain the company’s growth – not even speaking of future-oriented use cases like machine learning – the first team of data engineers was
Read More

Data Scientist at Roche on generating business value, biggest challenges and great opportunities. An interview with Dr Mohammadjavad Faraji.

Can data science significantly generate medical and business value at a non-IT company like Roche? Mohammadjavad Faraji [MF]: Definitely yes! The combined strengths of our pharmaceutical and diagnostic business under one roof already have made Roche the leader in personalised healthcare – PHC, offering comprehensive diagnostics and targeted therapies for people with cancer and other severe diseases. The digitalisation in healthcare now also brings the ability to understand and interpret unprecedented volumes of data that allows a higher resolution view of each individual patient than ever before. We are committed to delivering on this opportunity and are drawing on our unique
Read More

Ludzie, procesy i narzędzia – wywiad z Mateuszem Fedoryszakiem

Na czym polega sekret udanej współpracy między data scientist a data engineer? Szacunek i pokora. Kiedy pracujesz z ludźmi o komplementarnych umiejętnościach, łatwo jest pomyśleć: My rozwiązujemy prawdziwe problemy, ich zadania może wykonać licealista. Często nie zdajesz sobie sprawy, dlaczego wdrożenie niewielkiej usługi lub narysowanie prostego wykresu może być wyzwaniem. Z drugiej strony nawet osoby, które nie rozumieją w pełni twojej dziedziny, mogą dostarczyć cennych sugestii i opinii. Czy rozwiązaniem jest ścisły podział obowiązków? Czasami częścią problemu jest zmuszanie naukowców zajmujących się danymi do wykonywania zadań inżynierskich. W drugą stronę prawdopodobnie zdarza się to rzadziej? Mieliśmy odwrotny problem – wydawało
Read More

Big Data Technology Warsaw 2019 Recap: from technology to people

Big Data Technology Warsaw 2019 Recap: from technology to people The rise of the Kubernetes, open source in the cloud, market consolidation and a shortage of data science and data engineering skills top Big Data Technology Warsaw Summit 2019 takeaways Big Data has always been evolving fast. Not so long ago the Hadoop and open source revolution have reshaped the data analytics landscape. But big data and AI technology landscape are still changing quite rapidly. Today we see new megatrends that might completely change the Big Data landscape: containerisation, hybrid, and public cloud, and ML/AI adoption.
Read More

Flink committer on the new generation big data framework and processing engine, project development plans and why it is great to contribute to the open source projects and community – an interview with Dawid Wysakowicz, Software Engineer at Ververica

Why can Apache Flink be considered the best choice for processing data streaming? Dawid Wysakowicz [DW]: One reason is that it addresses all streaming use cases: bounded stream – aka batch, streaming analytics, event-driven applications, etc. It also has the best of class support for state management and event time. It is also industry proven as it runs at Netflix-scale. In contrast to other open-source stream processors, Flink provides not only true stream processing at the event granularity, i.e., no micro batching, but also handles many batch use cases very well. What are the main directions of Flink development? DW: Right now
Read More