Research on Deep Web Query Interface Clustering Based on Hadoop

How to cluster different query interfaces effectively is one of the most core issues when generating integrated query interface on Deep Web integration domain. However, with the rapid development of Internet technology, the number of Deep Web query interface shows an explosive growth trend. For this reason, the traditional stand-alone Deep Web query interface clustering approaches encounter bottlenecks in terms of time complexity and space complexity. After further study of the Hadoop distributed platforms and Map Reduce programming model, a Deep Web query interface clustering algorithm based on Hadoop platform is designed and implemented, in which the Vector Space Model (VSM) and Latent S emantic A nalysis (LSA) are employed to represent “Query Interfaces-Attributes” relationships. The experimental results show that the proposed algorithm has better scalability and speedup ratio by using Hadoop architecture.

[1]  Peng Jiang,et al.  Multi-objective optimization integration of query interfaces for the Deep Web based on attribute constraints , 2013, Data Knowl. Eng..

[2]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[3]  Tim Furche,et al.  OXPath: A language for scalable data extraction, automation, and crawling on the deep web , 2012, The VLDB Journal.

[4]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[5]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[6]  Tim Furche,et al.  The ontological key: automatically understanding and integrating forms to access the deep Web , 2013, The VLDB Journal.

[7]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[8]  Oleg V. Shylo,et al.  On Maximum Speedup Ratio of Restart Algorithm Portfolios , 2013, INFORMS J. Comput..

[9]  James M. Tien,et al.  Big Data: Unleashing information , 2013, 2013 10th International Conference on Service Systems and Service Management.

[10]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[11]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[12]  Andrew Olney,et al.  Generalizing Latent Semantic Analysis , 2009, 2009 IEEE International Conference on Semantic Computing.

[13]  Subbarao Kambhampati,et al.  Assessing relevance and trust of the deep web sources and results based on inter-source agreement , 2013, TWEB.

[14]  H. Peter Hofstee,et al.  Big Data text-oriented benchmark creation for Hadoop , 2013, IBM J. Res. Dev..

[15]  T. Velmurugan,et al.  Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points , 2010 .