A generalization based hybrid algorithm for clustering semi-structured data

Various clustering algorithms have been developed to group data into classes in diverse domains. These clustering algorithms work effectively on structured data, but they perform poorly on semi-structured data. This is because semi-structure data usually have the properties of high dimensionality and less rigid structure. Additionally, traditional clustering algorithms assume there are no relationships among attributes and treat each attribute as an independent entity when calculating the similarity among objects. In this work, a generalized based methodology that combines attribute hierarchy construction, object generalization and data clustering is presented. The algorithm works well on semi-structured data and requires only a minimum of domain knowledge. Since the algorithm reduces the dimensionality of the semi-structured data, clustering of the resulting generalized data often requires less execution time and computer memory. Experimental results are provided that show this proposed methodology can significantly improve the quality of clustering significantly in some cases. Moreover, when the number of data points is substantially larger than the number of the attributes, this new approach produces more efficient results in less execution time.