PipeDream: generalized pipeline parallelism for DNN training
-
爱吃猫的鱼0于 2021年9月28日 18:02
Nikhil R. Devanur | Amar Phanishayee | Gregory R. Ganger | Phillip B. Gibbons | Deepak Narayanan | Matei Zaharia | Vivek Seshadri | Aaron Harlap | G. Ganger | D. Narayanan | M. Zaharia | Vivek Seshadri | A. Harlap | Amar Phanishayee | V. Seshadri
[1] Kunle Olukotun,et al. Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark , 2018, ACM SIGOPS Oper. Syst. Rev..
[2] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.
[3] Charles R. Qi,et al. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks , 2018, ICML.
[4] Kunle Olukotun,et al. DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .
[5] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[6] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[7] Amit Agarwal,et al. CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.
[8] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.
[9] Seunghak Lee,et al. Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.
[10] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Tim Kraska,et al. Automating model search for large scale machine learning , 2015, SoCC.
[12] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.
[13] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[14] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.
[15] Eric P. Xing,et al. GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.
[16] Amar Phanishayee,et al. Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[17] Pengtao Xie,et al. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.
[18] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.
[19] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[20] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[21] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[22] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, MLSys.
[23] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.
[24] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.
[25] Carlo Luschi,et al. Revisiting Small Batch Training for Deep Neural Networks , 2018, ArXiv.
[26] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[27] Yuanzhou Yang,et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.
[28] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.
[29] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[30] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.
[31] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[32] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[33] Nikhil R. Devanur,et al. PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.
[34] Dong Yu,et al. Pipelined Back-Propagation for Context-Dependent Deep Neural Networks , 2012, INTERSPEECH.
[35] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.
[36] Dong Yu,et al. On parallelizability of stochastic gradient descent for speech DNNS , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[37] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[38] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[39] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.
[40] Seunghak Lee,et al. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.
[41] Samy Bengio,et al. Device Placement Optimization with Reinforcement Learning , 2017, ICML.
[42] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..
[43] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[44] Bin Gu,et al. Decoupled Parallel Backpropagation with Convergence Guarantee , 2018, ICML.
[45] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.