论文信息 - Enabling collaborative data science development with the Ballet framework

Enabling collaborative data science development with the Ballet framework

While the open-source model for software development has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small groups. We describe challenges to scaling data science collaborations and present a novel ML programming model to address them. We instantiate these ideas in Ballet, a lightweight software framework for collaborative open-source data science and a cloud-based development environment, with a plugin for collaborative feature engineering. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to an ML evaluation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct an extensive case study analysis of a real-world income prediction problem, and discuss implications for collaborative projects.

Kalyan Veeramachaneni | Micah J. Smith | Jurgen Cito | Kelvin Lu

[1] Harald C. Gall,et al. Towards quality gates in continuous delivery and deployment , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[2] Judith S. Olson,et al. From Shared Databases to Communities of Practice: A Taxonomy of Collaboratories , 2007, J. Comput. Mediat. Commun..

[3] Kiri Wagstaff,et al. Machine Learning that Matters , 2012, ICML.

[4] Tom Mens,et al. On the topology of package dependency networks: a comparison of three programming language ecosystems , 2016, ECSA Workshops.

[5] S. Hart,et al. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research , 1988 .

[6] Emerson Murphy-Hill,et al. Data Analysts and Their Software Practices: A Profile of the Sabermetrics Community and Beyond , 2020, Proc. ACM Hum. Comput. Interact..

[7] Michael S. Bernstein,et al. Flock: Hybrid Crowd-Machine Learning Classifiers , 2015, CSCW.

[8] Kalyan Veeramachaneni,et al. Deep feature synthesis: Towards automating data science endeavors , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[9] James Bennett,et al. The Netflix Prize , 2007 .

[10] Martin C. Rinard,et al. AMS: generating AutoML search spaces from weak specifications , 2020, ESEC/SIGSOFT FSE.

[11] Michael J. Muller,et al. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation , 2019, CHI.

[12] Sandra Slaughter,et al. Understanding the Motivations, Participation, and Performance of Open Source Software Developers: A Longitudinal Study of the Apache Projects , 2006, Manag. Sci..

[13] Roger B. Grosse,et al. Testing MCMC code , 2014, ArXiv.

[14] Kunle Olukotun,et al. DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[15] Dawn Xiaodong Song,et al. ExploreKit: Automatic Feature Generation and Selection , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[16] Neoklis Polyzotis,et al. Data Validation for Machine Learning , 2019, MLSys.

[17] Bradley Reaves,et al. How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories , 2019, NDSS.

[18] Georgios Gousios,et al. Work practices and challenges in pull-based development: the contributor's perspective , 2015, ICSE.

[19] Françoise Détienne,et al. A Situated Approach of Roles and Participation in Open Source Software Communities , 2014, Hum. Comput. Interact..

[20] Dean R. De Cock,et al. Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project , 2011 .

[21] W. Ouchi. A Conceptual Framework for the Design of Organizational Control Mechanisms , 1979 .

[22] Alexander Serebrenik,et al. Continuous Integration in a Social-Coding World: Empirical Evidence from GitHub , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[23] Arie van Deursen,et al. An exploratory study of the pull-based software development model , 2014, ICSE.

[24] Peter Bailis,et al. Model Assertions for Monitoring and Improving ML Models , 2020, MLSys.

[25] Kevin Crowston,et al. Socio-technical Affordances for Stigmergic Coordination Implemented in MIDST, a Tool for Data-Science Teams , 2019, Proc. ACM Hum. Comput. Interact..

[26] Per Runeson,et al. Guidelines for conducting and reporting case study research in software engineering , 2009, Empirical Software Engineering.

[27] Feng Liu,et al. Continuous Integration of Machine Learning Models with ease.ml/ci: Towards a Rigorous Yet Practical Treatment , 2019, MLSys.

[28] Luís Torgo,et al. OpenML: networked science in machine learning , 2014, SKDD.

[29] Ben Shneiderman,et al. Designing the User Interface: Strategies for Effective Human-Computer Interaction , 1998 .

[30] Yuming Zhou,et al. The impact of continuous integration on other software development practices: A large-scale empirical study , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[31] Christopher De Sa,et al. Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[32] Justin P. Johnson,et al. Collaboration, Peer Review and Open Source Software , 2004, Inf. Econ. Policy.

[33] Brad A. Myers,et al. The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool , 2018, CHI.

[34] Yu Kui,et al. A Scalable and Accurate Online Feature Selection for Big Data * , 2016 .

[35] et al.,et al. Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[36] S. G. Hart,et al. Development of NASA-TLX(Task Load Index) , 1988 .

[37] Isabelle Guyon,et al. Taking Human out of Learning Applications: A Survey on Automated Machine Learning , 2018, 1810.13306.

[38] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[39] Kalyan Veeramachaneni,et al. The Synthetic Data Vault , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[40] College Park,et al. Characteristics of Collaboration in the Emerging Practice of Open Data Analysis , 2016 .

[41] Amy X. Zhang,et al. How do Data Science Workers Collaborate? Roles, Workflows, and Tools , 2020, Proc. ACM Hum. Comput. Interact..

[42] Hang Zhang,et al. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data , 2020, ArXiv.

[43] Kalyan Veeramachaneni,et al. ATMSeer: Increasing Transparency and Controllability in Automated Machine Learning , 2019, CHI.

[44] Soya Park,et al. How Much Automation Does a Data Scientist Want? , 2021, ArXiv.

[45] Burak Turhan,et al. Effect of time-pressure on perceived and actual performance in functional software testing , 2018, ICSSP.

[46] Evangelia Berdou. Organization in Open Source Communities: At the Crossroads of the Gift and Market Economies , 2010 .

[47] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[48] Adam Tauman Kalai,et al. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[49] Souti Chattopadhyay,et al. What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities , 2020, CHI.

[50] A. Kraskov,et al. Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[51] Kalyan Veeramachaneni,et al. Label, Segment, Featurize: A Cross Domain Framework for Prediction Engineering , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[52] Markus Weimer,et al. Building Continuous Integration Services for Machine Learning , 2020, KDD.

[53] Jing Zhou,et al. Streaming feature selection using alpha-investing , 2005, KDD '05.

[54] Andreas W. Kempa-Liehr,et al. Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh - A Python package) , 2018, Neurocomputing.

[55] P. Kidwell,et al. The mythical man-month: Essays on software engineering , 1996, IEEE Annals of the History of Computing.

[56] Lei Xu,et al. Modeling Tabular data using Conditional GAN , 2019, NeurIPS.

[57] Dhruv Batra,et al. Fabrik: An Online Collaborative Neural Network Editor , 2018, ArXiv.

[58] Hao Wang,et al. Online Feature Selection with Streaming Features , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59] Jan Borchers,et al. Casual Notebooks and Rigid Scripts: Understanding Data Science Programming , 2020, 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[60] Ron Kohavi,et al. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[61] D. Sculley,et al. The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets , 2017 .

[62] Srini Ramaswamy,et al. Mining CVS Repositories to Understand Open-Source Project Developer Roles , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[63] Cynthia Dwork,et al. Differential Privacy: A Survey of Results , 2008, TAMC.

[64] David Maxwell Chickering,et al. Machine Teaching: A New Paradigm for Building Machine Learning Systems , 2017, ArXiv.

[65] Steve Oney,et al. How Data Scientists Use Computational Notebooks for Real-Time Collaboration , 2019, Proc. ACM Hum. Comput. Interact..

[66] Qian Yang,et al. Grounding Interactive Machine Learning Tool Design in How Non-Experts Actually Build Models , 2018, Conference on Designing Interactive Systems.

[67] Xindong Wu,et al. Group Feature Selection with Streaming Features , 2013, 2013 IEEE 13th International Conference on Data Mining.

[68] Jing Wang,et al. Online Feature Selection with Group Structure Analysis , 2015, IEEE Transactions on Knowledge and Data Engineering.

[69] D. Sculley,et al. Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[70] Christopher Ré,et al. Brainwash: A Data System for Feature Engineering , 2013, CIDR.

[71] MullerMichael,et al. Human-AI Collaboration in Data Science , 2019 .

[72] Reidar Conradi,et al. Adoption of open source software in software-intensive organizations - A systematic literature review , 2010, Inf. Softw. Technol..

[73] Christopher Ré,et al. Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices , 2019, NeurIPS.

[74] Kalyan Veeramachaneni,et al. FeatureHub: Towards Collaborative Data Science , 2017, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[75] Pedro M. Domingos. A few useful things to know about machine learning , 2012, Commun. ACM.

[76] Jez Humble,et al. Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation , 2010 .

[77] Neville Churcher,et al. A user evaluation of synchronous collaborative software engineering tools , 2005, 12th Asia-Pacific Software Engineering Conference (APSEC'05).

[78] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[79] Georgios Gousios,et al. Work Practices and Challenges in Pull-Based Development: The Integrator's Perspective , 2014, ICSE.

[80] Eric S. Raymond,et al. The Cathedral and the Bazaar , 2003 .

[81] Aditya G. Parameswaran,et al. Whither AutoML? Understanding the Role of Automation in Machine Learning Workflows , 2021, CHI.

[82] Premkumar T. Devanbu,et al. Quality and productivity outcomes relating to continuous integration in GitHub , 2015, ESEC/SIGSOFT FSE.

[83] Christian Payne,et al. On the security of open source software , 2002, Inf. Syst. J..

[84] James D. Hollan,et al. Exploration and Explanation in Computational Notebooks , 2018, CHI.

[85] Kalyan Veeramachaneni,et al. Towards Feature Engineering at Scale for Data from Massive Open Online Courses , 2014, ArXiv.

[86] Khurana Udayan,et al. Cognito: Automated Feature Engineering for Supervised Learning , 2016 .

[87] Antje Kirchner,et al. Measuring the predictability of life outcomes with a scientific mass collaboration , 2020, Proceedings of the National Academy of Sciences.

[88] Kalyan Veeramachaneni,et al. The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development , 2019, SIGMOD Conference.

[89] Gilles Louppe,et al. Independent consultant , 2013 .

[90] Meng Xia,et al. Exploring how software developers work with mention bot in GitHub , 2018, CCF Transactions on Pervasive Computing and Interaction.