论文信息 - What Would a Data Scientist Ask? Automatically Formulating and Solving Predictive Problems

What Would a Data Scientist Ask? Automatically Formulating and Solving Predictive Problems

In this paper, we designed a formal language, called Trane, for describing prediction problems over relational datasets, implemented a system that allows data scientists to specify problems in that language. We show that this language is able to describe several prediction problems and even the ones on KAGGLE-a data science competition website. We express 29 different KAGGLE problems in this language. We designed an interpreter, which translates input from the user, specified in this language, into a series of transformation and aggregation operations to apply to a dataset in order to generate labels that can be used to train a supervised machine learning classifier. Using a smaller subset of this language, we developed a system to automatically enumerate, interpret and solve prediction problems. We tested this system on the Walmart Store Sales Forecasting dataset found on KAGGLE, enumerated 1077 prediction problems and built models that attempted to solve them, for which we produced 235 AUC scores. Considering that only one out of those 1077 problems was the focus of a 2.5 month long competition on KAGGLE, we expect this system to deliver a thousandfold increase in data scientist's productivity.

Kalyan Veeramachaneni | Benjamin Schreck | K. Veeramachaneni | B. Schreck

[1] Wes McKinney,et al. Data Structures for Statistical Computing in Python , 2010, SciPy.

[2] Kalyan Veeramachaneni,et al. Deep feature synthesis: Towards automating data science endeavors , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[3] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4] Katherine L. Milkman,et al. Will I Stay or Will I Go , 2010 .

[5] Doina Precup,et al. A Machine Learning Approach to the Detection of Fetal Hypoxia during Labor and Delivery , 2010, AI Mag..

[6] Markus Hofmann,et al. RapidMiner: Data Mining Use Cases and Business Analytics Applications , 2013 .

[7] Peter A. Flach,et al. Propositionalization approaches to relational data mining , 2001 .

[8] Anna Rumshisky,et al. Unfolding physiological state: mortality modelling in intensive care units , 2014, KDD.

[9] Kalyan Veeramachaneni,et al. Label, Segment, Featurize: A Cross Domain Framework for Prediction Engineering , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[10] Michael Stonebraker,et al. Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[11] Kiri Wagstaff,et al. Machine Learning that Matters , 2012, ICML.