A Clustering Algorithm for Short Documents Based On Concept Similarity

In recent years, there has been an increasing interest in data clustering of short documents. Existing works consider seldom the concept similarity between the words, so the quality of clustering is often very low. This paper proposes a new document-clustering algorithm based on concept similarity in Chinese text processing. Different from tradition method, the algorithm converts text into a words vector space model at first; it splits words into a set of concepts at second; 3rd, it gets the similarity between words through computing the inner products between concepts; 4th, it computes the similarity of text based on the similarity of words. Finally, through two-phased steps, the algorithm finishes the clustering of a specified set of document. The extensive experiments prove the validity and performance of the algorithm.