Differentially- and non-differentially-private random decision trees

We consider supervised learning with random decision trees, where the tree construction is completely random. The method is popularly used and works well in practice despite the simplicity of the setting, but its statistical mechanism is not yet well-understood. In this paper we provide strong theoretical guarantees regarding learning with random decision trees. We analyze and compare three different variants of the algorithm that have minimal memory requirements: majority voting, threshold averaging and probabilistic averaging. The random structure of the tree enables us to adapt these methods to a differentially-private setting thus we also propose differentially-private versions of all three schemes. We give upper-bounds on the generalization error and mathematically explain how the accuracy depends on the number of random decision trees. Furthermore, we prove that only logarithmic (in the size of the dataset) number of independently selected random decision trees suffice to correctly classify most of the data, even when differential-privacy guarantees must be maintained. We empirically show that majority voting and threshold averaging give the best accuracy, also for conservative users requiring high privacy guarantees. Furthermore, we demonstrate that a simple majority voting rule is an especially good candidate for the differentially-private classifier since it is much less sensitive to the choice of forest parameters than other methods.

[1]  E. Pfaffelhuber Learning and information theory. , 1972, The International journal of neuroscience.

[2]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[3]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[4]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[6]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[7]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[8]  Philip S. Yu,et al.  Is random model better? On its accuracy and efficiency , 2003, Third IEEE International Conference on Data Mining.

[9]  Wenliang Du,et al.  Using randomized response techniques for privacy-preserving data mining , 2003, KDD '03.

[10]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  L. Breiman CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS , 2004 .

[13]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[14]  Wei Fan,et al.  On the Optimality of Probability Estimation by Random Decision Trees , 2004, AAAI.

[15]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[16]  Rebecca N. Wright,et al.  Privacy-preserving distributed k-means clustering over arbitrarily partitioned data , 2005, KDD '05.

[17]  A. Prasad,et al.  Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction , 2006, Ecosystems.

[18]  Philip S. Yu,et al.  Effective estimation of posterior probabilities: explaining the accuracy of randomized decision tree approaches , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[19]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[20]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[21]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[22]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[23]  Rong Yan,et al.  Model-shared subspace boosting for multi-label classification , 2007, KDD '07.

[24]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[25]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[26]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[27]  Kamalika Chaudhuri,et al.  Privacy-preserving logistic regression , 2008, NIPS.

[28]  Rebecca N. Wright,et al.  A Practical Differentially Private Random Decision Tree Classifier , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[29]  A. Kouzani,et al.  Multilabel Classification by BCH Code and Random Forests , 2009 .

[30]  Udaya B. Kogalur,et al.  Consistency of Random Survival Forests. , 2008, Statistics & probability letters.

[31]  Robin Genuer Risk bounds for purely uniformly random forests , 2010, 1006.2980.

[32]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[33]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[34]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[35]  Robin Genuer,et al.  Variance reduction in purely random forests , 2012 .

[36]  Yin Yang,et al.  Differential privacy in data publication and analysis , 2012, SIGMOD Conference.

[37]  Ran Xu,et al.  Random forests for metric learning with implicit pairwise position dependence , 2012, KDD.

[38]  Darakhshan J. Mir Differentially-private learning and information theory , 2012, EDBT-ICDT '12.

[39]  Pravesh Kothari,et al.  25th Annual Conference on Learning Theory Differentially Private Online Learning , 2022 .

[40]  Misha Denil,et al.  Consistency of Online Random Forests , 2013, ICML.

[41]  Larry A. Wasserman,et al.  Differential privacy for functions and functional data , 2012, J. Mach. Learn. Res..

[42]  Antonio Criminisi,et al.  Decision Forests for Computer Vision and Medical Image Analysis , 2013, Advances in Computer Vision and Pattern Recognition.

[43]  Ben Glocker,et al.  Atlas Encoding by Randomized Forests for Efficient Label Propagation , 2013, MICCAI.

[44]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[45]  Krzysztof Choromanski,et al.  Adaptive Anonymity via b-Matching , 2013, NIPS.

[46]  Abhijit Patil,et al.  Differential private random forest , 2014, 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[47]  Basit Shafiq,et al.  A Random Decision Tree Framework for Privacy-Preserving Data Mining , 2014, IEEE Transactions on Dependable and Secure Computing.

[48]  Misha Denil,et al.  Narrowing the Gap: Random Forests In Theory and In Practice , 2013, ICML.

[49]  Krzysztof Choromanski,et al.  Differentially-private learning of low dimensional manifolds , 2016, Theor. Comput. Sci..