FeatureHub: Towards Collaborative Data Science

Feature engineering is a critical step in a successful data science pipeline. This step, in which raw variables are transformed into features ready for inclusion in a machine learning model, can be one of the most challenging aspects of a data science effort. We propose a new paradigm for feature engineering in a collaborative framework and instantiate this idea in a platform, FeatureHub. In our approach, independent data scientists collaborate on a feature engineering task, viewing and discussing each others' features in real-time. Feature engineering source code created by independent data scientists is then integrated into a single predictive machine learning model. Our platform includes an automated machine learning backend which abstracts model training, selection, and tuning, allowing users to focus on feature engineering while still receiving immediate feedback on the performance of their features. We use a tightly-integrated forum, native feature discovery APIs, and targeted compensation mechanisms to facilitate and incentivize collaboration among data scientists. This approach can reduce the redundancy from independent or competitive data scientists while decreasing time to task completion. In experimental results, automatically generated models using crowdsourced features show performance within 0.03 or 0.05 points of winning submissions, with minimal human oversight.

[1]  Kalyan Veeramachaneni,et al.  Deep feature synthesis: Towards automating data science endeavors , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[2]  Michael S. Bernstein,et al.  Soylent: a word processor with a crowd inside , 2010, UIST.

[3]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[4]  Marilyn Tremaine CHI '01 Extended Abstracts on Human Factors in Computing Systems , 2001, CHI Extended Abstracts.

[5]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[6]  Lilly Irani,et al.  Amazon Mechanical Turk , 2018, Advances in Intelligent Systems and Computing.

[7]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Michael S. Bernstein,et al.  Flock: Hybrid Crowd-Machine Learning Classifiers , 2015, CSCW.

[10]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[11]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[12]  Jennifer Widom,et al.  Surpassing Humans and Computers with JELLYBEAN: Crowd-Vision-Hybrid Counting Algorithms , 2015, HCOMP.

[13]  AnHai Doan,et al.  Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing , 2014, Proc. VLDB Endow..

[14]  Ned Gulley Patterns of innovation: a web-based MATLAB programming contest , 2001, CHI Extended Abstracts.

[15]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[16]  Peter A. Flach,et al.  Propositionalization approaches to relational data mining , 2001 .

[17]  Kalyan Veeramachaneni,et al.  Likely to stop? Predicting Stopout in Massive Open Online Courses , 2014, ArXiv.

[18]  Arno J. Knobbe,et al.  Propositionalisation and Aggregates , 2001, PKDD.