Deep feature synthesis: Towards automating data science endeavors

In this paper, we develop the Data Science Machine, which is able to derive predictive models from raw data automatically. To achieve this automation, we first propose and develop the Deep Feature Synthesis algorithm for automatically generating features for relational datasets. The algorithm follows relationships in the data to a base field, and then sequentially applies mathematical functions along that path to create the final feature. Second, we implement a generalizable machine learning pipeline and tune it using a novel Gaussian Copula process based approach. We entered the Data Science Machine in 3 data science competitions that featured 906 other data science teams. Our approach beats 615 teams in these data science competitions. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieved 94% of the best competitor's score. In the best case, with an ongoing competition, we beat 85.6% of the teams and achieved 95.7% of the top submissions score.

[1]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[2]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  Matthew A. Brown,et al.  Recognising panoramas , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[8]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[9]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[10]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[11]  Andrew Gordon Wilson,et al.  Copula Processes , 2010, NIPS.

[12]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[13]  Gjergji Kasneci,et al.  Automated feature generation from structured knowledge , 2011, CIKM '11.

[14]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[15]  Jeffrey D. Ullman,et al.  Mining of Massive Datasets: Data Mining , 2011 .

[16]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[17]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[18]  Johannes Fürnkranz,et al.  Unsupervised generation of data mining features from linked open data , 2012, WIMS '12.

[19]  Joshua B. Tenenbaum,et al.  Automatic Construction and Natural-Language Description of Nonparametric Regression Models , 2014, AAAI.