Interactive Information Extraction with Constrained Conditional Random Fields

Information Extraction methods can be used to automatically "fill-in" database forms from unstructured data such as Web documents or email. State-of-the-art methods have achieved low error rates but invariably make a number of errors. The goal of an interactive information extraction system is to assist the user in filling in database fields while giving the user confidence in the integrity of the data. The user is presented with an interactive interface that allows both the rapid verification of automatic field assignments and the correction of errors. In cases where there are multiple errors, our system takes into account user corrections, and immediately propagates these constraints such that other fields are often corrected automatically. Linear-chain conditional random fields (CRFs) have been shown to perform well for information extraction and other language modelling tasks due to their ability to capture arbitrary, overlapping features of the input in a Markov model. We apply this framework with two extensions: a constrained Viterbi decoding which finds the optimal field assignments consistent with the fields explicitly specified or corrected by the user; and a mechanism for estimating the confidence of each extracted field, so that low-confidence extractions can be highlighted. Both of these mechanisms are incorporated in a novel user interface for form filling that is intuitive and speeds the entry of data--providing a 23% reduction in error due to automated corrections.

[1]  George F. Foster,et al.  Confidence estimation for translation prediction , 2003, CoNLL.

[2]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[3]  Rob Miller,et al.  Outlier finding: focusing user attention on possible errors , 2001, UIST '01.

[4]  Paul N. Bennett Assessing the Calibration of Naive Bayes Posterior Estimates , 2000 .

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Richard A. Becker,et al.  Brushing scatterplots , 1987 .

[7]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[8]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[9]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[10]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[11]  Claire Cardie,et al.  Proposal for an Interactive Environment for Information Extraction , 1998 .

[12]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[13]  Andreas Buja,et al.  XGobi: Interactive Dynamic Data Visualization in the X Window System , 1998 .

[14]  Hsiao-Wuen Hon,et al.  Word-based acoustic confidence measures for large-vocabulary speech recognition , 1998, ICSLP.

[15]  Remco R. Bouckaert Low Level Information Extraction: a Bayesian network based approach , 2002 .

[16]  P. Hodor High Precision Information Extraction , 2000 .

[17]  J. Movellan Tutorial on Hidden Markov Models , 2006 .

[18]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[19]  M. Zuker Suboptimal sequence alignment in molecular biology. Alignment with error analysis. , 1991, Journal of molecular biology.