What Would a Data Scientist Ask? Automatically Formulating and Solving Predictive Problems

In this paper, we designed a formal language, called Trane, for describing prediction problems over relational datasets, implemented a system that allows data scientists to specify problems in that language. We show that this language is able to describe several prediction problems and even the ones on KAGGLE-a data science competition website. We express 29 different KAGGLE problems in this language. We designed an interpreter, which translates input from the user, specified in this language, into a series of transformation and aggregation operations to apply to a dataset in order to generate labels that can be used to train a supervised machine learning classifier. Using a smaller subset of this language, we developed a system to automatically enumerate, interpret and solve prediction problems. We tested this system on the Walmart Store Sales Forecasting dataset found on KAGGLE, enumerated 1077 prediction problems and built models that attempted to solve them, for which we produced 235 AUC scores. Considering that only one out of those 1077 problems was the focus of a 2.5 month long competition on KAGGLE, we expect this system to deliver a thousandfold increase in data scientist's productivity.