The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox

To support complex data-intensive applications such as personalized recommendations, targeted advertising, and intelligent services, the data management community has focused heavily on the design of systems to support training complex models on large datasets. Unfortunately, the design of these systems largely ignores a critical component of the overall analytics process: the deployment and serving of models at scale. In this work, we present Velox, a new component of the Berkeley Data Analytics Stack. Velox is a data management system for facilitating the next steps in real-world, large-scale analytics pipelines: online model management, maintenance, and serving. Velox provides end-user applications and services with a low-latency, intuitive interface to models, transforming the raw statistical models currently trained using existing offline large-scale compute frameworks into full-blown, end-to-end data products capable of recommending products, targeting advertisements, and personalizing web content. To provide up-to-date results for these complex models, Velox also facilitates lightweight online model maintenance and selection (i.e., dynamic weighting). In this paper, we describe the challenges and architectural considerations required to achieve this functionality, including the abilities to span online and offline systems, to adaptively adjust model materialization strategies, and to exploit inherent statistical properties such as model error tolerance, all while operating at "Big Data" scale.

[1]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[2]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[3]  Philip A. Bernstein,et al.  Applying Model Management to Classical Meta Data Problems , 2003, CIDR.

[4]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[5]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[6]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[7]  Inderjit S. Dhillon,et al.  Matrix Completion from Power-Law Distributed Samples , 2009, NIPS.

[8]  Rada Chirkova,et al.  Materialized Views , 2012, Found. Trends Databases.

[9]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[10]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[11]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[12]  Eli Upfal,et al.  The Case for Predictive Database Systems: Opportunities and Challenges , 2011, CIDR.

[13]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[14]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[15]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[16]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[17]  Yannis Sismanis,et al.  Sparkler: supporting large-scale matrix factorization , 2013, EDBT '13.

[18]  Herodotos Herodotou,et al.  Massively Parallel Databases and MapReduce Systems , 2013, Found. Trends Databases.

[19]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[20]  Prasoon Goyal,et al.  Probabilistic Databases , 2009, Encyclopedia of Database Systems.

[21]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[22]  Christopher Ré,et al.  Materialization optimizations for feature selection workloads , 2014, SIGMOD Conference.