Real-time access to hundreds of petabytes of data

How to create an extensible Big Data platform for advanced analytics? How to scale data platform from a few terabytes to over 100 petabytes while reducing the data latency from 24+ hours to minutes. Reza Shiftehfar, Hadoop Platform teams leader at Uber, at Big Data Tech Warsaw Summit 2020, told a story of building and growing Uber’s reliable and scalable Big Data platform. It serves petabytes of data in a real-time fashion utilizing technologies such as Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, and Presto.

What is your role at Uber?
I manage the new platform team at Uber. That’s the team that provides Big Data as a service to the rest of the company. I’ve been with Uber since 2014. That was the time when we were a small start-up; our data was small enough to fit in a single box traditional Postgres database. For the past seven years, we grew our data from a few terabytes to over hundreds of petabytes. We have to build a horizontally scalable infrastructure.

What was the growth of the company at that time and how it affected data infrastructure?
It took us almost six years to complete 1 billion trips on the platform. But when we did that, we did the second 1 billion in six months. We did another one in five months, and another one in four months. If you look at this from an infrastructure perspective, you need to basically start planning for your next architecture because this growth is exponential, and scaling the infrastructure is not that easy.
Let’s look at the range of the products that we provide. We started in 2010 by providing just trips. It’s a pretty straightforward product. You have one driver, one rider, and a trip is happening at the same time. Then we launched UberPool. It’s a much better experience for the user. It’s cheaper, but it’s much more complicated: you have one driver, two riders, and two trips that are overlapping with each other.
Then we had Uber Eats. It’s even more complicated. We have three parties involved: a person that placed the order, a restaurant partner, and you have someone to make a pick-up and deliver. It requires two trips: one to get to the restaurant, pick up the food, and the other to deliver the order. Then we had Uber Freight – even more parties involved. So the maturity and complexity of the products are increasing. The amount of data generated is also increasing. You have to have a pretty reliable data infrastructure to be able to support an even better experience for the users.

How data is used at Uber?
Data is at the heart of our company. We use the previous data, the experience of the previous users to create a much better and much smoother experience for the next people taking the ride. Internally looking at the data users in the company, from an infrastructure perspective, we have three main categories of users.
We have thousands of city operators. These are on-the-ground crews in every city that manage and scale our transportation network in that market. This is like a small start-up in Uber. Their goal is to increase the number of trips in that city. They are operational teams. They do not have a technical background, and they do not write code, only some simple SQL statements. Technical skills are low, but dependency on the data is high.
We also have hundreds of data scientists and analysts. They look back at the previous data, and they try out different ideas, different products. They do have technical skills. They can write queries, they can write code, and they understand the data content. They use a large amount of historical data.

We also have hundreds of engineers across the company, that relies on data. They build services that give, for example, estimated delivery time for food. They have high coding skills, and the dependence on data is pretty high.

What Uber’s scale is in terms of numbers?
In terms of some numbers, we have over 2 trillion unique events generated every day, over 20,000 physical hosts in each of our data centers, hundreds of petabytes of data in HDFS clusters.

We generate over 200 terabytes of new data per day, and we have several hundred thousands of queries that are running on this data every day. We have over 2,000 unique dashboards at every single time that are used by different parts of the company to see how the business is operating.

What was your philosophy when building an infrastructure platform for petabyte-scale?
We knew that no single company could build everything from scratch. There are very few companies in the world that are operating at this scale with this amount of data. The best way to build a reliable platform is to cooperate with them and rely on the open source community to build such a platform. We are heavy users of very popular Big Data platform open source products. We are heavy users of Spark, Hive, Presto, HDFS, Parquet.
At the same time, many of these products do not work out of the box for such a scale. When we start using them, we start to see all these data limitations. So we try to re-architecture, redesign those components to be able to scale. And again, we try our best to contribute back to the open-source community. The good examples from those are the observer nodes in HDFS or geospatial support in Presto.
Finally, from time to time, we see that there’s a gap in the open-source community – there’s no single product that addresses our need, or there are way too many products but not for our scale. That is when we start our own project from scratch. We try our best to open source them as well. A few examples, those that are from my team, are the Apache Hudi or Marmaray.

Categories: News.