Morphological typology through massive parallel corpora

Abstract

This PhD project addresses the notion of morphological richness of languages in a large-scale morphological typological analysis using massively parallel corpora. Morphologically rich languages express multiple levels of information already at the word level and thus they are expected to have a higher level of word types variations and in turn, low frequency of word types. Therefore, measures based on distribution of word types can differentiate between morphologically rich and morphologically poor languages. However, distribution of word types is only a partial indicator since it does not distinguish between morphological and lexical diversity. On the other hand, a comparison based on word alignments, i.e. how many words in one language correspond to a word type in another language, is expected to distinguish between these two types. Given that the word boundaries is uncertain phenomena, monolingual tests concerning different definitions of words will be performed for a subset of languages.

The project will further focus on the variation in space of the obtained morphological richness structure. Geographical distribution for measures of similarities and differences between languages is one of the objectives of contemporary typology. Thus, the proposed research will provide tools and materials for addressing language contact effects and for potential further investigations of language evolution. The use of corpora will serve as a valuable contribution to this research field since most of the work is currently based on grammars.

The main research questions can be therefore expressed as:

  1. How languages are distributed on a morphological richness scale based on corpora?
  2. How is morphological richness distributed in geographical space?

PhD candidate

Tatyana Ruzsics

Supervision

Tanja Samardžić, Balthasar Bickel, Martin Volk

Funding

URPP Language and Space