The World's Largest Streaming Service's Recommendation ML Pipeline Using Apache Spark

Netflix Uses Apache Spark To Service 80+ Million members In Over 250 Countries To Inform Nearly Every Aspect of Its Products

Netflix uses Apache Spark machine learning to inform nearly every aspect of its products, from recommendations, to decisions made about which TV shows and movies are created.

"Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation.

With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability.

Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production." - DB Tsai - Senior Research Sceintist @Netflix 2017