Real-Time Machine Learning: The Missing Pieces

Machine learning applications are increasingly deployed not only to serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making. These applications pose a new set of requirements, none of which are difficult to achieve in isolation, but the combination of which creates a challenge for existing distributed execution frameworks: computation with millisecond latency at high throughput, adaptive construction of arbitrary task graphs, and execution of heterogeneous kernels over diverse sets of resources. We assert that a new distributed execution framework is needed for such ML applications and propose a candidate approach with a proof-of-concept architecture that achieves a 63x performance improvement over a state-of-the-art execution framework for a representative application.

[1]  Carl Hewitt,et al.  The incremental garbage collection of processes , 1977, Artificial Intelligence and Programming Languages.

[2]  Joe Armstrong,et al.  Concurrent programming in ERLANG , 1993 .

[3]  Claes Wikström,et al.  Concurrent programming in ERLANG (2nd ed.) , 1996 .

[4]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[7]  Anant Agarwal,et al.  Factored operating systems (fos): the case for a scalable operating system for multicores , 2009, OPSR.

[8]  James R. Larus,et al.  Orleans: cloud computing for everyone , 2011, SoCC.

[9]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[10]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[11]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[12]  Fernando M. V. Ramos,et al.  Software-Defined Networking: A Comprehensive Survey , 2014, Proceedings of the IEEE.

[13]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[14]  Matthew Rocklin,et al.  Dask: Parallel Computation with Blocked algorithms and Task Scheduling , 2015, SciPy.

[15]  Michael I. Jordan,et al.  The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox , 2014, CIDR.

[16]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[17]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[18]  Reynold Xin,et al.  Apache Spark , 2016 .

[19]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[20]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[21]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.