Learning in decentralized systems: a nonparametric approach

Rapid advances in information technology result in increased deployment of decentralized decision-making systems embedded within large-scale infrastructure consisting of data collection and processing devices. In such a system, each statistical decision is performed on the basis of limited amount of data due to constraints given by the decentralized system. For instance, the constraints maybe imposed by limits in energy source, communication bandwidth, computation or time budget. A fundamental research problem arised in decentralized systems involves the development methods that takes into account not only the statistical accuracy of decision-making procedures, but also the constraints imposed by the system limits. It is this general problem that drives the focus of this thesis. In particular, we focus on the development and analysis of statistical learning methods for decentralized decision-making by employing a nonparametric approach. The nonparametric approach imposes very little a priori assumption on the data; such flexibility allows it to be applicable to a wide range of applications. Coupled with tools from convex analysis and empirical process theory we develop computationally efficient algorithms and analyze their statistical 1 behavior both theoretically and empirically. Our specific contributions include the following. We develop a novel kernel-based algorithm for centralized detection and estimation in the ad hoc sensor networks through the challenging task of sensor mote localization. Next, we develop and analyze a nonparametric decentralized detection algorithm using the methodology of convex surrogate loss functions and marginalized kernels. The analysis of this algorithm leads to an in-depth study of the correspondence between the class of surrogate loss functions widely used in statistical machine learning and the class of divergence functionals widely used in information theory. This correspondence allows us to provide an interesting decision-theoretic justification to a given choice of divergence functionals, which often arise from asymptotic analysis. In addition, this correspondence also motivates the development and analysis of a novel M-estimation procedure for estimating divergence functionals and the likelihood ratio. Finally, we also investigate a sequential setting of the decentralized detection algorithm, and settle an open question regarding the characterization of optimal decision rules in such a setting.

[1]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[2]  B. Silverman,et al.  On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method , 1982 .

[3]  R. Durrett Probability: Theory and Examples , 1993 .

[4]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[5]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[6]  Pramod K. Varshney,et al.  Distributed detection with multiple sensors I. Fundamentals , 1997, Proc. IEEE.

[7]  Alʹbert Nikolaevich Shiri︠a︡ev,et al.  Optimal stopping rules , 1977 .

[8]  H. Chernoff Sequential Analysis and Optimal Design , 1987 .

[9]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[10]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[11]  David G. Stork,et al.  Pattern Classification , 1973 .

[12]  G. Lorden On Excess Over the Boundary , 1970 .

[13]  Yu Hen Hu,et al.  Detection, classification, and tracking of targets , 2002, IEEE Signal Process. Mag..

[14]  H. Joe Estimation of entropy and other functionals of a multivariate density , 1989 .

[15]  M. Birman,et al.  PIECEWISE-POLYNOMIAL APPROXIMATIONS OF FUNCTIONS OF THE CLASSES $ W_{p}^{\alpha}$ , 1967 .

[16]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[17]  D. Lindley On a Measure of the Information Provided by an Experiment , 1956 .

[18]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[19]  B. Laurent Efficient estimation of integral functionals of a density , 1996 .

[20]  Joel A. Tropp,et al.  Just relax: convex programming methods for identifying sparse signals in noise , 2006, IEEE Transactions on Information Theory.

[21]  Chee-Yee Chong,et al.  Sensor networks: evolution, opportunities, and challenges , 2003, Proc. IEEE.

[22]  J. Tsitsiklis Decentralized Detection' , 1993 .

[23]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[24]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[25]  Somesh Jha,et al.  Global Intrusion Detection in the DOMINO Overlay System , 2004, NDSS.

[26]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[27]  L. Györfi,et al.  Density-free convergence properties of various estimators of entropy , 1987 .

[28]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[29]  K. Khalil On the Complexity of Decentralized Decision Making and Detection Problems , 2022 .

[30]  A. Gualtierotti H. L. Van Trees, Detection, Estimation, and Modulation Theory, , 1976 .

[31]  Jan M. Rabaey,et al.  Robust Positioning Algorithms for Distributed Ad-Hoc Wireless Sensor Networks , 2002, USENIX Annual Technical Conference, General Track.

[32]  J. Lamperti ON CONVERGENCE OF STOCHASTIC PROCESSES , 1962 .

[33]  Michael I. Jordan,et al.  Nonparametric decentralized detection using kernel methods , 2005, IEEE Transactions on Signal Processing.

[34]  Mani B. Srivastava,et al.  Dynamic fine-grained localization in Ad-Hoc networks of sensors , 2001, MobiCom '01.

[35]  Bin Yu Assouad, Fano, and Le Cam , 1997 .

[36]  H. Weinert Reproducing kernel Hilbert spaces: Applications in statistical signal processing , 1982 .

[37]  Sriram Ramabhadran,et al.  NetProfiler: Profiling Wide-Area Networks Using Peer Cooperation , 2005, IPTPS.

[38]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[39]  Venugopal V. Veeravalli,et al.  Decentralized detection in sensor networks , 2003, IEEE Trans. Signal Process..

[40]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[41]  R. Viswanathan,et al.  Distributed detection of a signal in generalized Gaussian noise , 1989, IEEE Trans. Acoust. Speech Signal Process..

[42]  Jianqing Fan,et al.  Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[43]  P. Hall,et al.  On the estimation of entropy , 1993 .

[44]  George G. Lorentz,et al.  Constructive Approximation , 1993, Grundlehren der mathematischen Wissenschaften.

[45]  T. Lai SEQUENTIAL ANALYSIS: SOME CLASSICAL PROBLEMS AND NEW CHALLENGES , 2001 .

[46]  Deborah Estrin,et al.  Robust range estimation using acoustic and multimodal sensing , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[47]  Sawasd Tantaratana,et al.  Nonparametric distributed detector using Wilcoxon statistics , 1997, Signal Process..

[48]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[49]  H. Vincent Poor,et al.  Consistency in Models for Communication Constrained Distributed Learning , 2004, COLT.

[50]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[51]  Deborah Estrin,et al.  GPS-less low-cost outdoor localization for very small devices , 2000, IEEE Wirel. Commun..

[52]  Cameron Whitehouse The Design of Calamari : an Ad-hoc Localization System for Sensor Networks , 2002 .

[53]  Flemming Topsøe,et al.  Some inequalities for information divergence and related measures of discrimination , 2000, IEEE Trans. Inf. Theory.

[54]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[55]  R. N. Bradt On the Design and Comparison of Certain Dichotomous Experiments , 1954 .

[56]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[57]  A. Keziou Dual representation of Φ-divergences and applications , 2003 .

[58]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[59]  Rick S. Blum,et al.  Distributed detection with multiple sensors I. Advanced topics , 1997, Proc. IEEE.

[60]  Qing Wang,et al.  Divergence estimation of continuous distributions based on data-dependent partitions , 2005, IEEE Transactions on Information Theory.

[61]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[62]  L. Breiman Arcing Classifiers , 1998 .

[63]  Nils Sandell,et al.  Detection with Distributed Sensors , 1980, IEEE Transactions on Aerospace and Electronic Systems.

[64]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[65]  S. Geer Empirical Processes in M-Estimation , 2000 .

[66]  Walter T. Federer,et al.  Sequential Design of Experiments , 1967 .

[67]  D. M. Titterington,et al.  Recent advances in nonlinear experiment design , 1989 .

[68]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[69]  D. Donoho For most large underdetermined systems of equations, the minimal 𝓁1‐norm near‐solution approximates the sparsest near‐solution , 2006 .

[70]  Martin J. Wainwright,et al.  Nonparametric estimation of the likelihood ratio and divergence functionals , 2007, 2007 IEEE International Symposium on Information Theory.

[71]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[72]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[73]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[74]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[75]  Saburou Saitoh,et al.  Theory of Reproducing Kernels and Its Applications , 1988 .

[76]  M. A. Girshick,et al.  Bayes and minimax solutions of sequential decision problems , 1949 .

[77]  Bruno Sinopoli,et al.  A kernel-based learning approach to ad hoc sensor network localization , 2005, TOSN.

[78]  Martin J. Wainwright,et al.  Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[79]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[80]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[81]  Shie Mannor,et al.  Greedy Algorithms for Classification -- Consistency, Convergence Rates, and Adaptivity , 2003, J. Mach. Learn. Res..

[82]  Martin J. Wainwright,et al.  On optimal quantization rules for sequential decision problems , 2006, 2006 IEEE International Symposium on Information Theory.

[83]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[84]  D. Blackwell Comparison of Experiments , 1951 .

[85]  D. Donoho,et al.  Geometrizing Rates of Convergence , II , 2008 .

[86]  H. V. Poor,et al.  Applications of Ali-Silvey Distance Measures in the Design of Generalized Quantizers for Binary Decision Systems , 1977, IEEE Trans. Commun..

[87]  Yu Hen Hu,et al.  Energy Based Acoustic Source Localization , 2003, IPSN.

[88]  Andy Hopper,et al.  The active badge location system , 1992, TOIS.

[89]  Larry A. Wasserman,et al.  Rodeo: Sparse Nonparametric Regression in High Dimensions , 2005, NIPS.

[90]  Larry A. Wasserman,et al.  Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo , 2007, AISTATS.

[91]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[92]  F. Pukelsheim Optimal Design of Experiments , 1993 .

[93]  C.C. White,et al.  Dynamic programming and stochastic control , 1978, Proceedings of the IEEE.

[94]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[95]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[96]  P. Varshney,et al.  Some results on distributed nonparametric detection , 1990, 29th IEEE Conference on Decision and Control.

[97]  Akbar M. Sayeed,et al.  Collaborative Signal Processing for Distributed Classification in Sensor Networks , 2003, IPSN.

[98]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[99]  P. Massart,et al.  Estimation of Integral Functionals of a Density , 1995 .

[100]  Ingo Steinwart,et al.  Consistency of support vector machines and other regularized kernel classifiers , 2005, IEEE Transactions on Information Theory.

[101]  D. Blackwell Equivalent Comparisons of Experiments , 1953 .

[102]  Maurizio Longo,et al.  Quantization for decentralized hypothesis testing under communication constraints , 1990, IEEE Trans. Inf. Theory.

[103]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[104]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[105]  John N. Tsitsiklis,et al.  Extremal properties of likelihood-ratio quantizers , 1993, IEEE Trans. Commun..

[106]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[107]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[108]  G. C. Hood Estimation of Entropy , 1953 .

[109]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[110]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[111]  Michel Broniatowski,et al.  Parametric estimation and tests through divergences and the duality technique , 2008, J. Multivar. Anal..

[112]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[113]  J. Tsitsiklis On threshold rules in decentralized detection , 1986, 1986 25th IEEE Conference on Decision and Control.

[114]  Venugopal V. Veeravalli,et al.  Sequential decision fusion: theory and applications , 1999 .

[115]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[116]  H. V. Trees Detection, Estimation, And Modulation Theory , 2001 .

[117]  Ding-Xuan Zhou,et al.  The covering number in learning theory , 2002, J. Complex..

[118]  Emad K. Al-Hussaini,et al.  Decentralized CFAR signal detection , 1995, Signal Process..

[119]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[120]  D. S. Mitrinovic,et al.  Classical and New Inequalities in Analysis , 1992 .

[121]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[122]  Michael K. Reiter,et al.  Seurat: A Pointillist Approach to Anomaly Detection , 2004, RAID.

[123]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[124]  Paramvir Bahl,et al.  RADAR: an in-building RF-based user location and tracking system , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[125]  J. Hiriart-Urruty,et al.  Fundamentals of Convex Analysis , 2004 .

[126]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[127]  P.K. Varshney,et al.  Channel-aware distributed detection in wireless sensor networks , 2006, IEEE Signal Processing Magazine.

[128]  H. Vincent Poor,et al.  Decentralized Sequential Detection with a Fusion Center Performing the Sequential Test , 1992, 1992 American Control Conference.

[129]  H. Vincent Poor,et al.  An Introduction to Signal Detection and Estimation , 1994, Springer Texts in Electrical Engineering.

[130]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[131]  K. Chaloner,et al.  Bayesian Experimental Design: A Review , 1995 .

[132]  Y. Mei Asymptotically optimal methods for sequential change-point detection , 2003 .

[133]  Jeffrey Hightower,et al.  Real-Time Error in Location Modeling for Ubiquitous Computing , 2001 .

[134]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[135]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[136]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[137]  M. Degroot,et al.  Comparison of Experiments and Information Measures , 1979 .

[138]  W. G. Hunter,et al.  Experimental Design: Review and Comment , 1984 .

[139]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[140]  Hari Balakrishnan,et al.  6th ACM/IEEE International Conference on on Mobile Computing and Networking (ACM MOBICOM ’00) The Cricket Location-Support System , 2022 .

[141]  Thomas Kailath,et al.  RKHS approach to detection and estimation problems-I: Deterministic signals in Gaussian noise , 1971, IEEE Trans. Inf. Theory.

[142]  J. Wolfowitz,et al.  Optimum Character of the Sequential Probability Ratio Test , 1948 .

[143]  Ling Huang,et al.  Communication-Efficient Online Detection of Network-Wide Anomalies , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[144]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[145]  D. Donoho,et al.  Geometrizing Rates of Convergence, III , 1991 .

[146]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[147]  G. Wahba Spline models for observational data , 1990 .

[148]  Andy Hopper,et al.  A new location technique for the active office , 1997, IEEE Wirel. Commun..

[149]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[150]  Alexander J. Smola,et al.  Learning with kernels , 1998 .