A Genre-based Clustering Approach to Content Extraction

The content of a webpage is usually contained within a small body of text and images, or perhaps several articles on the same page; however, the content may be lost in the clutter (defined as cosmetic features such as animations, menus, sidebars, obtrusive banners). Automatic content extraction has many applications, including browsing on small cell phone and PDA screens, speech rendering for the visually impaired, and reducing noise for information retrieval systems. We have developed a framework, Crunch, which employs various heuristics for content extraction in the form of filters applied to the webpage’s DOM tree; the filters aim to prune or transform the clutter, leaving only the content. Crunch allows users to tune what we call “settings”, consisting of thresholds for applying a particular filter and/or for toggling a filter on/off, because the HTML components that characterize clutter can vary significantly from website to website. However, we have found that the same settings tend to work well across different websites of the same genre, e.g., news or shopping, since the designers often employ similar page layouts. In particular, Crunch could obtain the settings for a previously unknown website by automatically classifying it as sufficiently similar to a cluster of known websites with previously adjusted settings. We present our approach to clustering a large corpus of websites into genres, using their pre-extraction textual material augmented by the snippets generated by searching for the website’s domain name in web search engines. Including these snippets increases the frequency of function words needed for clustering. We use existing Manhattan distance measure and hierarchical clustering techniques, with some modifications, to pre-classify the corpus into genres offline. Our method does not require prior knowledge of the set of genres that websites fit into, but to be useful a priori settings must be available for some member of each cluster or a nearby cluster (otherwise defaults are used). Crunch classifies newly encountered websites online in linear-time, and then applies the corresponding filter settings, with no noticeable delay added by our content-extracting web proxy.

[1]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[2]  Gail E. Kaiser,et al.  Automating Content Extraction of HTML Documents , 2005, World Wide Web.

[3]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[4]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[5]  Koby Crammer,et al.  Online Classification on a Budget , 2003, NIPS.

[6]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[7]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[8]  Daniela Rus,et al.  Journal of Graph Algorithms and Applications the Star Clustering Algorithm for Static and Dynamic Information Organization , 2022 .

[9]  Stefan Siersdorfer,et al.  Restrictive clustering and metaclustering for self-organizing document collections , 2004, SIGIR '04.

[10]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[11]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[12]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[13]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[14]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[15]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[16]  Sung-Hyon Myaeng,et al.  Text genre classification with genre-revealing and subject-revealing features , 2002, SIGIR '02.

[17]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.