Powering big data at Pinterest by Apache Hadoop
We currently log 20 terabytes of new data each day, and have around 10+ petabytes of data. We use Hadoop to process this data, which enables us to put the most relevant and recent content in front of Pinners through features such as Related Pins, Guided Search, and image processing.
"We currently log 20 terabytes of new data each day, and have around 10 petabytes of data. We use Hadoop to process this data, which enables us to put the most relevant and recent content in front of Pinners through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis. In order to build big data applications quickly, we’ve evolved our single cluster Hadoop infrastructure into a ubiquitous self-serving platform.
Before choosing from these solutions, we mapped out our Hadoop setup requirements.
Isolated multitenancy: MapReduce has many applications with very different software requirements and configurations. Developers should be able to customize their jobs without impacting other users’ jobs.
Elasticity: Batch processing often requires burst capacity to support experimental development and backfills. In an ideal setup, you could ramp up to multi-thousand node clusters and scale back down without any interruptions or data loss.
Multi-cluster support: While it’s possible to scale a single Hadoop cluster horizontally, we’ve found that a) getting perfect isolation/elasticity can be difficult to achieve and b) business requirements such as privacy, security and cost allocation make it more practical to support multiple clusters.
Support for ephemeral clusters: Users should be able to spawn clusters and leave them up for as long as they need. Clusters should spawn in a reasonable amount of time and come with full blown support for all Hadoop jobs without manual configuration.
Easy software package deployment: We need to provide developers simple interfaces to several layers of customization from the OS and Hadoop layers to job specific scripts.
Shared data store: Regardless of the cluster, it should be possible to access data produced by other clusters
Access control layer: Just like any other service oriented system, you need to be able to add and modify access quickly (i.e. not SSH keys). Ideally, you could integrate with an existing identity (e.g. via OAUTH).
Where we are today With our current setup, Hadoop is a flexible service that’s adopted across the organization with minimal operational overhead. We have over 100 regular Mapreduce users running over 2,000 jobs each day through Qubole’s web interface, ad-hoc jobs and scheduled workflows."