论文信息 - Automatic Hilghter of Lengthy Legal Documents

Automatic Hilghter of Lengthy Legal Documents

Legal documents are known for being lengthy. To our knowledge, some categories of legal documents contain duplicated information that do not require our attention. However, manually extracting non-duplicate information from documents requires considerable amount of effort. Thus, we want to use machine learning algorithms to pick up unordinary sentences for us. In this paper, we propose a set of algorithms that filters out duplicate information and returns useful information to the user. We are able to train a learner that can mark unordinary parts of a legal document for manual scrutinization. Our learning process contains two phases. At the first phase, we pick some legal documents that contain common patterns, e.g. software user agreements, to form a knowledge base for the trainer. We then run LDA [1] model on these documents. The LDA model will return us with a set of common topics across the knowledge base. At the second phase, we take a new piece of legal document as the test sample. We first remove common topic words from the test document to increase differences between sentences. We then use Word2Vec [2], [3] to convert sentences into vectors. After generating the feature space, we run Agglomerative Clustering and Local Outlier Factor(LOF) [4] algorithms on the feature vectors to detect special sentences in the given document. Last, we use PCA and t-SNE to visualize our result.

[1] Robert R. Sokal,et al. A statistical method for evaluating systematic relationships , 1958 .

[2] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[3] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[4] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5] P. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[6] Manabu Okumura,et al. Towards Multi-paper Summarization Using Reference Information , 1999, IJCAI.

[7] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8] Wenjie Li,et al. Simultaneous Ranking and Clustering of Sentences: A Reinforcement Approach to Multi-Document Summarization , 2010, COLING.

[9] Jaideep Srivastava,et al. Contextual Anomaly Detection in Text Data , 2012, Algorithms.

[10] Dragomir R. Radev,et al. Scientific Paper Summarization Using Citation Summary Networks , 2008, COLING.

[11] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.