Georeferencing Web Ngrams

Motivation and Goal: Ngrams (n-word combinations in combination with frequencies/probabilities) are used to index large bodies of written text. In recent years, Google as well as Bing allowed access to their Ngram collections representing all one to five word combinations on the Internet (i.e. hundreds of billions of web pages). This information has often been used in different scientific domains (e.g. computer linguistics, artificial intelligence, genetics, etc.). Geography, however, has fallen somewhat short in using Ngrams for spatial analysis or for learning about the use of geographic concepts. One important reason for this gap is that Ngrams are particularly challenging for georeferencing, which is a precondition for follow-up analyses. In this project, our aim is to find means for associating arbitrary words or word combinations with spatial footprints, which in turn opens the door for an in-depth spatial analysis of a broad set of geographic research issues.