Data serves little purpose if we cannot find it, looking up individual records in 100+ petabytes of data fast enabling us to gather useful insights

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

On a daily basis, there were tens of terabytes of new data added to our data lake, and our Big Data platform grew to over 10,000 vcores with over 100,000 running batch jobs on any given day. This resulted in our Hadoop data lake becoming the centralized source-of-truth for all analytical Uber data.

"This resulted in our Hadoop data lake becoming the centralized source-of-truth for all analytical Uber data." - Reza Shiftehfar @Uber

"Uber is committed to delivering safer and more reliable transportation across our global markets. To accomplish this, Uber relies heavily on making data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks in our driver-partner sign-up process. Over time, the need for more insights has resulted in over 100 petabytes of analytical data that needs to be cleaned, stored, and served with minimum latency through our Hadoop-based Big Data platform." -

"(Apache) HBase and (Apache) Cassandra are two key value stores widely used at Uber. For our global indexing solution, we chose to use (Apache) HBase for the following reasons:

HBase only permits consistent reads and writes, so there is no need to tweak consistency parameters.

HBase provides automatic rebalancing of HBase tables within a cluster. The master-slave architecture enables getting a global view of the spread of a dataset across the cluster, which we utilize in customizing dataset specific throughputs to our HBase cluster.

Generating and uploading indexes with HFiles

We generate indexes in (Apache) HBase’s internal storage file format, referred to as HFile, and upload them to our (Apache) HBase cluster. (Apache) HBase partitions data based on sorted, non-overlapping key ranges across regional servers in the HFile file format. Within each HFile, data is sorted based on the key value and the column name. To generate HFiles in the format expected by (Apache) HBase, we use Apache Spark to execute large, distributed operations across a cluster of machines."