The size of language corpora (= collections of machine-readable texts) is currently measured in billions of tokens. These vast records of language use represent a great potential source of data for linguistic research. This opportunity, however, comes with a great challenge: How do we turn hundreds of thousands of observations into linguistic evidence?
In the corpus-linguistic laboratory (CorpusLab), computers are used as lab instruments. We extract data from language corpora automatically using natural language processing. We measure linguistic phenomena based on corpus counts. We apply statistical modelling and inference to understand the structures and the rules behind the observed language use.
We are especially interested in studying linguistic variation in space. We develop methods and tools for comparing languages and linguistic structures using corpora.
Visualisation by Phillip Ströbel