Automatic Hilghter of Lengthy Legal Documents

Legal documents are known for being lengthy. To our knowledge, some categories of legal documents contain duplicated information that do not require our attention. However, manually extracting non-duplicate information from documents requires considerable amount of effort. Thus, we want to use machine learning algorithms to pick up unordinary sentences for us. In this paper, we propose a set of algorithms that filters out duplicate information and returns useful information to the user. We are able to train a learner that can mark unordinary parts of a legal document for manual scrutinization. Our learning process contains two phases. At the first phase, we pick some legal documents that contain common patterns, e.g. software user agreements, to form a knowledge base for the trainer. We then run LDA [1] model on these documents. The LDA model will return us with a set of common topics across the knowledge base. At the second phase, we take a new piece of legal document as the test sample. We first remove common topic words from the test document to increase differences between sentences. We then use Word2Vec [2], [3] to convert sentences into vectors. After generating the feature space, we run Agglomerative Clustering and Local Outlier Factor(LOF) [4] algorithms on the feature vectors to detect special sentences in the given document. Last, we use PCA and t-SNE to visualize our result.