The Best Chance to Get Firsthand Information About Big Data / The Biggest Big Data Tech Event in Central Europe
Big Data Technology Warsaw Summit 2017 was the third edition of the first truly international conference in Poland. The event was a huge success. Over 400 specialists – data scientists, data engineers, IT managers, application developers, and system administrators – registered to participate in the main event, and over 100 took part in the technical workshops earlier the same week. For the participants, it was a unique chance to get to know how data analysis is handled in the world’s leading tech companies, like Google, Facebook, H2O.ai, Cisco, SkySkanner, Fandom or Avito.
– There are not many tech big data events in Poland, and in Europe either. We wanted to create a conference similar the events in Berlin, Sevilla, and London – said Przemysław Gamdzyk, CEO & Meeting Designer at Evention, one of the conference organizers.
About 60 speakers invited by an international conference committee shared their practical knowledge during four days of workshops, thematic sessions, and roundtable debates. – Our most important goal for creating this event is knowledge sharing – said Adam Kawa CEO and Co-founder, GetInData.
Getting inside knowledge from world’s leading companies
The conference was a unique chance to learn directly from the experiences of data scientists, system administrators, developers and data engineers from world’s biggest companies: Google, Facebook, Cisco, H20.ai, SAS Institute, SkySkanner, Fandom, Avito, AB Initio, Criteo, PubMatic, Findwise, GetInData, and Allegro.
During the main conference day, the participants could choose from four dedicated tracks.
Operations & Deployment was dedicated to system administrators and people with DevOps skills who are interested in technologies and best practices for planning, installing, managing and securing their Big Data infrastructure in enterprise environments – both on-premise and the cloud.
Data Application Development was a place for developers to learn about tools, techniques and innovative solutions to collect and process large volumes of data. It covers topics like data ingestion, ETL, process scheduling, metadata and schema management, distributed data stores and more.
Analytics & Data Science included real case-studies demonstrating how Big Data is used to address a wide range of business problems. You can find here talks about large-scale Machine Learning, A/B tests, visualizing data as well as various analysis that enable making data-driven decisions and feed personalized features of data-driven products.
Finally, Real-Time Processing covered technologies, strategies, and use cases for real-time data ingestion and deriving real-time actionable insights from the flow of events coming from sensors, devices, users, and front-end systems.
Making profit on big data
The participants could learn the business aspect of big data and machine learning from industry insiders. Sascha Schubert, Advisory Business Solutions Manager at SAS Institute, analytics software developer company from North Carolina, shared the detailed case studies on how machine learning can be used in different industries.
The Big Data conference participants could learn the details of an impressive case study, where a combination of SAS and open source analytics helped a semiconductor manufacturer optimize their production process.- The semiconductor industry is the backbone of the digital age. The sector innovations drive the ability to do more on ever smaller machines, but perhaps equally important is the ability to optimize the manufacturing processes – said Sascha Schubert. – In the digital printing of semiconductor components, 1 in a billion failure rate for droplets may sound like an acceptable rate. This is less so when you consider that up to 50 million droplets can be pushed per second, leading to an unacceptable defect rate of one failure every 20 seconds – he explained. An automatic quality control using efficient in-memory processing helped the semiconductor manufacturer reduce time to run weekly quality control checks.
Paweł Goduła, senior data scientist at BCG Gamma, an advanced analytics and big data branch of Boston Consulting Group, talked about his cooperation with colleagues consultants, and how analytics can be used to increase the profits of his clients. He disclosed some details on practical examples of data science “in action” from recent BCG Gamma projects. He also explained when to use Linear Regression vs XG Boost in business applications.
According to Paweł Goduła, working as a data scientist for a consulting company is complex. – You work with consultants, who care about earning money. They ask me “Pawel is this model going to bring us any money?” or “When the model will be ready?” – said Paweł Goduła. – On today’s job market you want to be in the intersection: match statistical knowledge with business. Statistical and business savvy. There are few people who can do it available, so you can imagine the implications it will have for you – he added.
Getting to know the newest technology
Jo-fai Chow, Data Scientist at H2O.ai talked about the motivation and benefits of Deep Water. He showed how to build and deploy deep learning models with or without programmingexperience using H2O’s R/Python/Flow (Web) interfaces. Deep Water is H2O’s integration with multiple open source deep learning libraries such as TensorFlow, MXNet, and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability, ease of use and deployment.
System administrators could get some inside information on big data architecture from Nikolay Golov, chief data warehousing architect at Avito, world’s second largest classified advertisements website with 35 million unique monthly visitors.
Krzysztof Baczyński, Cisco Big Data Lead for Poland, and Kamil Ciuszko, CEO of Alterdata, explained how to create an effective, scalable and easily manageable environment for big data processing. They presented a case of a real-time big data analytics related to location tracking using automated and scalable Cisco platform.
Participants could learn how to effectively use C-store DBMS analytics platform as well as Cisco Validated Design for Big Data architecture which combines tools as Cisco UCS (Unified Computing System), Cisco ACI (Application Centric Infrastructure) and UCS Director for Big Data which provides a single-touch solution that automates Hadoop deployment and provides a single management pane across both physical infrastructure and Hadoop software.
Firat Tekiner, a data scientist and big data architect at AB Initio told the conference participants, about a general-purpose data processing, and metadata management platform his company offers. AB Initio has a single architecture for processing data on HDFS, regular files, database tables, message queues, web services, and metadata. This architecture enables virtually any technical or business rule to be graphically defined, shared, and executed in a timely manner.
Learning from the experience of people running huge projects
How does Facebook process its data? What does data engineer job look like in one of the world’s largest corporations? At Big Data Technology Warsaw Summit 2017 you could talk about it with Paweł Koperek, Data Engineer at Facebook. Facebook serves ads from over 4 million advertisers to more than a billion people each day. – Every day we face the challenge of building the best products to such a large user base. In order to focus the right ones, we have to make well-informed decisions, which we can prove with data – Paweł Koperek told the conference participants.
Krystian Mistrzak and Thejas Murthy, data engineers at Fandom Powered by Wikia, the largest entertainment fan site in the world, with more than 360,000 fan communities and a global audience of over 190 million monthly unique user, gave conference participants an overview of comparisons of existing tools and emphasized on why they chose Airflow, and how Airflow is being used to create a stable and reliable orchestration platform to enable non-data engineers to seamlessly access data by democratizing data. They also shared some tricks and best practices of developing workflows with Airflow and showed how they use it.
Stuart Pook, Senior DevOps Engineer at Criteo, NASDAQ traded French personalized retargeting company talked about his company’s experience with a Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and over 100000 jobs per day. This cluster was critical in both stockage and compute but without backups. After many efforts to increase their redundancy, Criteo moved to two clusters that combined have more than 2000 nodes, 130 PB, two different versions of Hadoop and 200000 jobs per day but these clusters do not yet provide a redundant solution to our all storage and compute needs.
In his presentation, he disclosed the choices and issues Criteo solved in creating a 1200 node cluster with new hardware in a new data center. Some of the challenges involved in running two different clusters in parallel will be presented.
Software developers had a chance to listen to Robin Tweedie and Arthur Vivian, software engineers in SkyScanner, world’s leading travel metasearch engines with monthly 60 million users. They talked about approach Skyscanner took to enable every decision in the company to be data based. They shared the lessons learned when using technologies like: Kafka, Logstash, Elasticsearch, Secor, AWS Lambda with Amazon S3, Samza, Protocol Buffers, and others.
A look inside data analytics processes
Michael Hausenblas, Developer Advocate, at Mesosphere, a San Francisco-based company building operating systems for data centers based on Apache Mesos discussed options to operate elastic data pipelines with modern, cloud-native platforms such as DC/OS with Apache Mesos, Kubernetes, and Docker Swarm.
Mark Pybus, Head of Data Engineering at Sky Bet, one of the largest UK online bookmakers, explained how their Hadoop platform addresses two common problems in the gambling industry – knowing your current liability position and helping potential irresponsible gamblers before they identify themselves.
Participants could learn Sky Bet’s experiences replacing a traditional data warehouse with Hadoop. How the architecture met the needs of sportsbook traders to be able to manage liabilities in a competitive and high-frequency environment and how that led to decommissioning the legacy data warehouse.
Polish companies were represented by Allegro Group among others. Piotr Guzik, Software Engineer at Allegro, explained how his company managed to detect anomalies, such as heavy web traffic after the successful commercial event, using their own simple model. He disclosed why Allegro moved from R language to a working solution in Scala.
Mariusz Strzelecki, Senior Data Engineer at Allegro shared his experience with Hadoop applications development using tools like MapReduce, Scala and a lot of APIs for submitting, scheduling and monitoring jobs. And of course is a Kerberos expert.
Scalable big data solutions in science
Big Data Technology Warsaw Summit 2017 had also some interesting presentations about the use of data engineering in science. Marek Wiewiórka, Solution Architect at GetInData talked about genomic population studies, and how they incorporate storing, analyzing and interpretation of various kinds of genomic variants as its central issue. When thousands of patients sequenced exomes and genomes are being sequenced, there is a growing need for efficient database storage systems, querying engines and powerful tools for statistical analyses. Scalable big data solutions such as Apache Impala, Apache Kudu, Apache Phoenix or Apache Kylin can address many of the challenges in large-scale genomic analyses.
The presentation covered some of the lessons learned from the project aiming at creating a data warehousing solution for storing and analyzing genomic variants information at Department of Medical Genetics Warsaw Medical University.
Ashish Tadose, Senior Data Architect at PubMatic, online advertising software company from India, explained how AdTech companies need to address data increase at breakneck speed along with customer demands of insights & analytical reports. PubMatic receive billions of events and several TBs of data per day from various geographic regions. This high volume data needs to be processed in real-time to derive actionable insights such as campaign decisions, audience targeting and also provide a feedback loop to AdServer for making efficient ad serving decisions. Ashish Tadose shared how PubMatic designed and implemented these scalable low latency real-time data processing solutions for our use cases using Apache Apex.
These are just a small part of the speakers on the Big Data Technology Warsaw Summit 2017. The participants could talk to them during roundtable sessions that took place at 11 tables in two rounds. It was a unique chance to get to know the best big data specialists working for the world’s leading projects, and learn from their experience.