论文信息 - Entity Annotation based on Inverse Index Operations

Entity Annotation based on Inverse Index Operations

Entity annotation involves attaching a label such as 'name' or 'organization' to a sequence of tokens in a document. All the current rule-based and machine learning-based approaches for this task operate at the document level. We present a new and generic approach to entity annotation which uses the inverse index typically created for rapid key-word based searching of a document collection. We define a set of operations on the inverse index that allows us to create annotations defined by cascading regular expressions. The entity annotations for an entire document corpus can be created purely of the index with no need to access the original documents. Experiments on two publicly available data sets show very significant performance improvements over the document-based annotators.

Ganesh Ramakrishnan | Sachindra Joshi | Sreeram Balakrishnan

[1] Michael J. Franklin,et al. A Fast Index for Semistructured Data , 2001, VLDB.

[2] Junghoo Cho,et al. A fast regular expression indexing engine , 2002, Proceedings 18th International Conference on Data Engineering.

[3] Line Eikvil,et al. Information Extraction from World Wide Web - A Survey , 1999 .

[4] Quanzhong Li,et al. Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[5] Luis Gravano,et al. Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[6] Andrew McCallum,et al. Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[7] Diana Maynard,et al. JAPE: a Java Annotation Patterns Engine , 2000 .