Meeting in the notebook: a notebook-based environment for micro-submissions in data science collaborations

Developers in data science and other domains frequently use computational notebooks to create exploratory analyses and prototype models. However, they often struggle to incorporate existing software engineering tooling into these notebook-based workflows, leading to fragile development processes. We introduce Assemblé, a new development environment for collaborative data science projects, in which promising code fragments of data science pipelines can be contributed as pull requests to an upstream repository entirely from within JupyterLab, abstracting away low-level version control tool usage. We describe the design and implementation of Assemblé and report on a user study of 23 data scientists.

[1]  Dick Hardt,et al.  The OAuth 2.0 Authorization Framework , 2012, RFC.

[2]  Harald C. Gall,et al.  Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[3]  Brad A. Myers,et al.  Exploring exploratory programming , 2017, 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[4]  Philip J. Guo,et al.  Software Developers Learning Machine Learning: Motivations, Hurdles, and Desires , 2019, 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[5]  Michael J. Muller,et al.  How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation , 2019, CHI.

[6]  Kevin Crowston,et al.  Socio-technical Affordances for Stigmergic Coordination Implemented in MIDST, a Tool for Data-Science Teams , 2019, Proc. ACM Hum. Comput. Interact..

[7]  Steven M. Drucker,et al.  Managing Messes in Computational Notebooks , 2019, CHI.

[8]  Kalyan Veeramachaneni,et al.  The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development , 2019, SIGMOD Conference.

[9]  Greg Wilson,et al.  Software Carpentry: Getting Scientists to Write Better Code by Making Them More Productive , 2006, Computing in Science & Engineering.

[10]  College Park,et al.  Characteristics of Collaboration in the Emerging Practice of Open Data Analysis , 2016 .

[11]  Julien Gori,et al.  FileWeaver: Flexible File Management with Automatic Dependency Tracking , 2020, UIST.

[12]  Arie van Deursen,et al.  An exploratory study of the pull-based software development model , 2014, ICSE.

[13]  Brad A. Myers,et al.  Towards Effective Foraging by Data Scientists to Find Past Analysis Choices , 2019, CHI.

[14]  Georgios Gousios,et al.  Work Practices and Challenges in Pull-Based Development: The Integrator's Perspective , 2014, ICSE.

[15]  Souti Chattopadhyay,et al.  What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities , 2020, CHI.

[16]  Steve Oney,et al.  How Data Scientists Use Computational Notebooks for Real-Time Collaboration , 2019, Proc. ACM Hum. Comput. Interact..

[17]  Kalyan Veeramachaneni,et al.  Enabling collaborative data science development with the Ballet framework , 2020, ArXiv.

[18]  Qian Yang,et al.  Grounding Interactive Machine Learning Tool Design in How Non-Experts Actually Build Models , 2018, Conference on Designing Interactive Systems.

[19]  Michael S. Bernstein,et al.  Meta: Enabling Programming Languages to Learn from the Crowd , 2016, UIST.

[20]  Jan Borchers,et al.  Casual Notebooks and Rigid Scripts: Understanding Data Science Programming , 2020, 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[21]  Brad A. Myers,et al.  The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool , 2018, CHI.