A scalable bootstrap for massive data

The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large data sets—which are increasingly prevalent—the calculation of bootstrap‐based quantities can be prohibitively demanding computationally. Although variants such as subsampling and the m out of n bootstrap can be used in principle to reduce the cost of bootstrap computations, these methods are generally not robust to specification of tuning parameters (such as the number of subsampled data points), and they often require knowledge of the estimator's convergence rate, in contrast with the bootstrap. As an alternative, we introduce the ‘bag of little bootstraps’ (BLB), which is a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. The BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate the BLB's favourable statistical performance via a theoretical analysis elucidating the procedure's properties, as well as a simulation study comparing the BLB with the bootstrap, the m out of n bootstrap and subsampling. In addition, we present results from a large‐scale distributed implementation of the BLB demonstrating its computational superiority on massive data, a method for adaptively selecting the BLB's tuning parameters, an empirical study applying the BLB to several real data sets and an extension of the BLB to time series data.

[1]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[2]  P. Lachenbruch Mathematical Statistics, 2nd Edition , 1972 .

[3]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[4]  Robert Tibshirani,et al.  How Many Bootstraps , 1985 .

[5]  P. Bickel,et al.  Richardson Extrapolation and the Bootstrap , 1988 .

[6]  H. Künsch The Jackknife and the Bootstrap for General Stationary Observations , 1989 .

[7]  E. Giné,et al.  Bootstrapping General Empirical Measures , 1990 .

[8]  R. Durrett Probability: Theory and Examples , 1993 .

[9]  B. Efron More Efficient Bootstrap Computations , 1990 .

[10]  Regina Y. Liu Moving blocks jackknife and bootstrap capture weak dependence , 1992 .

[11]  Joseph P. Romano,et al.  The stationary bootstrap , 1994 .

[12]  Léopold Simar,et al.  Computer Intensive Methods in Statistics , 1994 .

[13]  E. Mammen,et al.  On General Resampling Algorithms and their Performance in Distribution Estimation , 1994 .

[14]  J. Hahn Bootstrapping Quantile Regression Estimators , 1995, Econometric Theory.

[15]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[16]  H. Putter,et al.  Resampling: Consistency of Substitution Estimators , 1996 .

[17]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .

[18]  A. V. D. Vaart,et al.  Asymptotic Statistics: Frontmatter , 1998 .

[19]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[20]  P. Bickel,et al.  Extrapolation and the bootstrap , 2002 .

[21]  R. Samworth A note on methods of restoring consistency to the bootstrap , 2003 .

[22]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[23]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[24]  P. Bickel,et al.  ON THE CHOICE OF m IN THE m OUT OF n BOOTSTRAP AND CONFIDENCE BOUNDS FOR EXTREMA , 2008 .

[25]  F. Götze,et al.  RESAMPLING FEWER THAN n OBSERVATIONS: GAINS, LOSSES, AND REMEDIES FOR LOSSES , 2012 .

[26]  S. Lahiri,et al.  Gap bootstrap methods for massive data sets with an application to transportation engineering , 2012, 1301.2459.

[27]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.