论文信息 - Latent-Descriptor Clustering for Unsupervised POS Induction

Latent-Descriptor Clustering for Unsupervised POS Induction

We present a novel approach to distributionalonly, fully unsupervised, POS tagging, based on an adaptation of the EM algorithm for the estimation of a Gaussian mixture. In this approach, which we call Latent-Descriptor Clustering (LDC), word types are clustered using a series of progressively more informative descriptor vectors. These descriptors, which are computed from the immediate left and right context of each word in the corpus, are updated based on the previous state of the cluster assignments. The LDC algorithm is simple and intuitive. Using standard evaluation criteria for unsupervised POS tagging, LDC shows a substantial improvement in performance over state-of-the-art methods, along with a several-fold reduction in computational cost.

Elie Bienenstock | Michael Lamar | Yariv Maron

[1] Mark Johnson,et al. Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[2] Christopher M. Bishop,et al. Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[3] Hinrich Schütze,et al. Distributional Part-of-Speech Tagging , 1995, EACL.

[4] John DeNero,et al. Painless Unsupervised Learning with Features , 2010, NAACL.

[5] Ari Rappoport,et al. Improved Unsupervised POS Induction through Prototype Discovery , 2010, ACL.

[6] Thomas L. Griffiths,et al. A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[7] Jun'ichi Tsujii,et al. Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[8] Ben Taskar,et al. Posterior vs Parameter Sparsity in Latent Variable Models , 2009, NIPS.

[9] Eugene Charniak,et al. Evaluating Unsupervised Part-of-Speech Tagging for Grammar Induction , 2008, COLING.

[10] Alexander Clark. Unsupervised induction of stochastic context-free grammars using distributional clustering , 2001, CoNLL.

[11] Jianfeng Gao,et al. A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers , 2008, EMNLP.