Probabilistic noise identification and data cleaning

Real world data is never as perfect as we would like it to be and can often suffer from corruptions that may impact interpretations of the data, models created from the data, and decisions made based on the data. One approach to this problem is to identify and remove records that contain corruptions. Unfortunately, if only certain fields in a record have been corrupted then usable, uncorrupted data will be lost. We present LENS, an approach for identifying corrupted fields and using the remaining noncorrupted fields for subsequent modeling and analysis. Our approach uses the data to learn a probabilistic model containing three components: a generative model of the clean records, a generative model of the noise values, and a probabilistic model of the corruption process. We provide an algorithm for the unsupervised discovery of such models and empirically evaluate both its performance at detecting corrupted fields and, as one example application, the resulting improvement this gives to a classifier.

[1]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[2]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[3]  Vijay T. Raisinghani Cleaning Methods In Data Warehousing , 1999 .

[4]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[5]  N. Mati,et al.  Discovering Informative Patterns and Data Cleaning , 1996 .

[6]  Li Deng,et al.  A new method for speech denoising and robust speech recognition using probabilistic models for clean speech and for noise , 2001, INTERSPEECH.

[7]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[8]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[9]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[10]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[11]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[12]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[13]  Richard M. Stern,et al.  Reconstruction of damaged spectrographic features for robust speech recognition , 2000, INTERSPEECH.

[14]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[15]  Steven A. Wolfman,et al.  Cleaning Data with Bayesian Methods , 2000 .

[16]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[17]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[18]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.