Corpus in GIScience: Going Beyond Butterfly Collecting

Workshop "Corpus in GIScience: Going Beyond Butterfly Collecting"

GIScience ConferenceMelbourne, Australia, August 28, 2018.

Workshop Description and Scope

With the increasing availability of unstructured and semi-structured text data in the form of user-generated content and digitized corpora, GIScience is actively exploring the potential of these data to answer questions ranging from traditional to cutting-edge. Interestingly, the word “corpus” is seldom used in this research.  “Collection” is the preferred term.  As the range of questions addressed through various corpora expands, it is high time for the GIScience community to gain an overview of emerging work as well as reflect upon a number of important corpus-related questions, such as: Aren’t we “butterfly collecting”[1] at best and being opportunistic at worst? Do concepts such as “language in use” and “representativeness” matter or should we take a pragmatic approach and only ask, “is corpus A good for task B”?

The aim of this workshop is to provide an overview of existing (types of) corpora, outline key methods and research questions addressed through text corpora in GIScience, as well as discuss the importance of aspects such as corpus characteristics or representativeness.

The scope of the workshop includes the following:

  • characteristics of existing geospatial and general corpora (including the Web as a corpus, digitized corpora, etc.)
  • corpus-building strategies and frameworks
  • making a corpus publicly available: tools and pitfalls
  • methods for spatial and thematic exploration of a corpus (geographic information retrieval and beyond)
  • development of spatial markup languages
  • approaches to annotation of large corpora (including crowdsourcing)
  • areas of application of corpus-based and –driven research in GIScience (e.g. from environmental monitoring to the investigation of variation in the use of spatial language)

[1] Chomsky’s famous critique of corpus linguistics (Chomsky, N. 1979. Language and Responsibility: Based on Conversations with Mitsou Ronat. New York: Pantheon. Translated by John Viertel, p.57)



9:15 Introduction

Parisa Kordjamshidi “Corpus-based Spatial Information Extraction from Natural Language”

10:20  Research speed dating
10:50 Coffee break
11:20 Jingyi Xiao and Werner Kuhn “Thoughts on Geospatial Corpus”

Panel session I
with Alice Gaby, Parisa Kordjamshidi and Alan MacEachren

12:45 Lunch

Krzysztof Janowicz “Ontological Considerations in Creating and Using Corpora in GIScience”


Clematide et al. “Crowdsourcing Toponym Annotation for Natural Features: How Hard Is It?”

15:15 Coffee break

Panel session II
with Ben Adams, Christopher Jones, Maria Vasardani


Summary and take home message

17:00 End



Speaker: Parisa Kordjamshidi

Affiliation: Tulane University/Florida Institute for Human and Machine Cognition

Title: "Corpus-based Spatial Information Extraction from Natural Language"

Abstract: Natural language text is a rich resource of spatial information including geographical data. It becomes progressively important for real-world applications to be able to automatically extract this information, for example, for early detecting the location of events such as natural hazards. In this talk, I will discuss the recent research efforts on the extraction of spatial information from natural language with a machine learning perspective. I will discuss a) the recent annotation schemes such as SpatialML, Spatial Role Labeling, and ISO-space; b) the types of textual corpora that we have annotated; c) the aspects of spatial information that have been expressed in the current annotated data; d) and the type of concepts that we are able to automatically extract from text using corpus-based techniques. I will point to the state-of-the-art machine learning models that we have developed towards spatial language understanding and the current research results and challenges.


Speaker: Krzysztof Janowicz

Affiliation: University of California, Santa Barbara

Title: "Ontological Considerations in Creating and Using Corpora in GIScience"

Abstract: Semantic signatures are an analogy to spectral signatures in remote sensing applied to social sensing in urban environments. They can be extracted from various sources including text corpora, e.g., to understand how people talk about places of particular types. In this talk, I will briefly introduce semantic signatures, how they were created, and their application areas. Next, I will discuss the ontological decisions that went into each step and how they impact the resulting signatures and their application. I will close by generalizing the findings to the creation and usage of corpora in GIScience more broadly.


Programme Committee:

Ben Adams (University of Canterbury)

Tim Baldwin (University of Melbourne)

Christophe Claramunt (French Naval Academy Research Institute)

Mauro Gaio (University of Pau and Pays de l’Adour)

Morteza Karimzadeh (Ohio State University)

Parisa Kordjamshidi (Tulane University)

Bruno Martins (University of Lisbon)

Ludovic Moncla (French Naval Academy Research Institute)

Ross Purves (University of Zurich)

James Pustejovsky (Brandeis University)

Tanja Samardžić (University of Zurich)

Thora Tenbrink (Bangor University)

Jan Oliver Wallgrün (Pennsylvania State University)


For further questions, please contact the organizers: Ekaterina Egorova (, Kristin Stock (, Lesley Stirling (