Event Extraction from Unstructured Text Data

We extend a bootstrapping method that was initially developed for extracting relations from webpages to the problem of extracting content from large collections of short unstructured text. Such data appear as field notes in enterprise applications and as messages in social media services. The method iteratively learns sentence patterns that match a set of representative event mentions and then extracts different mentions using the learnt patterns. At every step, the semantic similarity between the text and set of patterns is used to determine if the pattern was matched. Semantic similarity is calculated using the WordNet lexical database. Local structure features such as bigrams are extracted where possible from the data to improve the accuracy of pattern matching. We rank and filter the learnt patterns to balance the precision and recall of the approach with respect to extracted events. We demonstrate this approach on two different datasets. One is a collection of field notes from an enterprise dataset. The other is a collection of "tweets" collected from the Twitter social network. We evaluate the accuracy of the extracted events when method parameters are varied.

[1]  Jeffrey F. Naughton,et al.  Information extraction challenges in managing unstructured data , 2009, SGMD.

[2]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[3]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[4]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[5]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[6]  Bo Zong,et al.  Towards scalable critical alert mining , 2014, KDD.

[7]  Regina Barzilay,et al.  Event Discovery in Social Media Feeds , 2011, ACL.

[8]  John A Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD , 2012, Behavior research methods.

[9]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[10]  Oren Etzioni,et al.  Open domain event extraction from twitter , 2012, KDD.

[11]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[12]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[13]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[14]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR Forum.

[15]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.