A Tale of Two (Similar) Cities - Inferring City Similarity through Geo-spatial Query Log Analysis

Understanding the backgrounds and interest of the people who are consuming a piece of content, such as a news story, video, or music, is vital for the content producer as well the advertisers who rely on the content to provide a channel on which to advertise. We extend traditional search-engine query log analysis, which has primarily concentrated on analyzing either single or small groups of queries or users, to examining the complete query stream of very large groups of users – the inhabitants of 13,377 cities across the United States. Query logs can be a good representation of the interests of the city’s inhabitants and a useful characterization of the city itself. Further, we demonstrate how query logs can be effectively used to gather city-level statistics sufficient for providing insights into the similarities and differences between cities. Cities that are found to be similar through the use of query analysis correspond well to the similar cities as determined through other large-scale and time-consuming direct measurement studies, such as those undertaken by the Census Bureau. 1. CE SUS & QUERY LOGS Understanding the backgrounds and interest of the people who are consuming a piece of content, such as a news story, video, or music, is vital for the content producer as well the advertisers who rely on the content to provide a channel on which to advertise. A variety of sources for demographic and behavioral information exist today. One of the largest-scale efforts to understand people across the United States is conducted every 10 years by the US Census Bureau. This massive operation, which gathers statistics about population, ethnicity and race, is supplemented by smaller surveys, such as the American Community Survey, that gathers a variety of more in-depth information about households. Advertisers often use the high-level information gathered by these surveys to help target their ad campaigns to the most appropriate regions and cities in the US. In contrast to the Census studies, passive studies of search engine query logs have become common since the introduction of search engines and the massive adoption of the Internet to quickly find information (Jansen and Spink, 2006)(Silverstein et. al., 1999). These studies provide the quantitative data to not only improve the search engine’s results, but also to provide a deeper understanding of the user and the user’s interests than the data collected by the Census and similar surveys. The goal of our work is to extend techniques and data sources that have commonly been used for online single-user (or small group) understanding to extremely large groups (up to millions of users) that are usually only taken on by large studies by the Census. We want to determine whether the query stream emanating from groups of users – the inhabitants of 13,377 cities across the United States – is a good representation for the interests of the city’s inhabitants, and therefore a useful characterization of the city itself. Figure 1 shows the geographic distribution of the queries analyzed in this study. Figure 1: Geographic distribution of query samples used

[1]  Torsten Suel,et al.  Efficient query processing in geographic web search engines , 2006, SIGMOD Conference.

[2]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[3]  Jon M. Kleinberg,et al.  Spatial variation in search engine queries , 2008, WWW.

[4]  M. Sanderson,et al.  Analyzing geographic queries , 2004 .

[5]  Wei Vivian Zhang,et al.  Geographic intention and modification in web search , 2008, Int. J. Geogr. Inf. Sci..

[6]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[7]  Torsten Suel,et al.  Analysis of geographic queries in a search engine log , 2008, LocWeb.

[8]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[9]  Mário J. Silva,et al.  Relevance Ranking for Geographic IR , 2006, GIR.

[10]  Hema Raghavan,et al.  Discovering users' specific geo intention in web search , 2009, WWW '09.

[11]  C. Lee Giles,et al.  Modeling and visualizing geo-sensitive queries based on user clicks , 2008, LocWeb.

[12]  Fernando Diaz,et al.  A case study of using geographic cues to predict query news intent , 2009, GIS.

[13]  HenzingerMonika,et al.  Analysis of a very large web search engine query log , 1999 .

[14]  Fernando Diaz,et al.  Geographic features in web search retrieval , 2008, GIR '08.

[15]  C. Lee Giles,et al.  Towards Click-Based Models of Geographic Interests in Web Search , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.