Tatyana Ruzsics (Soldatova) joined the URPP Language and Space in May 2016
Morphological typology through massive parallel corpora
This PhD project addresses the notion of morphological richness of languages in a large-scale morphological typological analysis using massively parallel corpora. Morphologically rich languages express multiple levels of information already at the word level and thus they are expected to have a higher level of word types variations and in turn, low frequency of word types. Therefore, measures based on distribution of word types can differentiate between morphologically rich and morphologically poor languages. However, distribution of word types is only a partial indicator since it does not distinguish between morphological and lexical diversity. On the other hand, a comparison based on word alignments, i.e. how many words in one language correspond to a word type in another language, is expected to distinguish between these two types. Given that the word boundaries is uncertain phenomena, monolingual tests concerning different definitions of words will be performed for a subset of languages.
The project will further focus on the variation in space of the obtained morphological richness structure. Geographical distribution for measures of similarities and differences between languages is one of the objectives of contemporary typology. Thus, the proposed research will provide tools and materials for addressing language contact effects and for potential further investigations of language evolution. The use of corpora will serve as a valuable contribution to this research field since most of the work is currently based on grammars.
The main research questions can be therefore expressed as:
- How languages are distributed on a morphological richness scale based on corpora?
- How is morphological richness distributed in geographical space?
Supervisor: Tanja Samardžić, Balthasar Bickel, Martin Volk
Funding source: URPP Language and Space
Lusetti, M., T. Ruzsics, A. Göhring, T. Samardžić and E. Stark (2018). "Encoder-Decoder Methods for Text Normalization". In Proceedings of the Workshop Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (COLING 2018). Santa Fe, New Mexico, USA, 18- 28. Association for Computational Linguistics.
Ruzsics, T. and T. Samardžić (2017). "Neural Sequence-to-sequence Learning of Internal Word Structure". In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada, 184-194. Association for Computational Linguistics.
Makarov P., T. Ruzsics, and S. Clematide (2017). "Align and copy: UZH at SIGMORPHON 2017 shared task for morphological reinflection". In Proceedings of the CoNLL- SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection, Vancouver, Canada, 49–57. Association for Computational Linguistics. Overall winner of task 1.
Bentz, C., T. Ruzsics, A. Koplenig, and T. Samardžić (2016). "A comparison between morphological complexity measures: Typological data vs. language corpora". In Proceedings of the Workshop Computational Linguistics for Linguistic Complexity (COLING 2016). Osaka, Japan, 142-153. Association for Computational Linguistics.
"Encoder-Decoder Methods for Text Normalization", SwissText 2018, ZHAW, Winterthur
"Morphological segmentation", March 2017, Institute of Computational Linguistics Colloquium, University of Zurich
„Morphological richness through massive parallel corpora“ with T. Samardžić, September 2016, URPP Language and Space, Second Meeting with Scientific Advisory Board, University of Zurich
|2016 - present||
University of Zurich, Corpus Lab, URPP “Language and Space”
PhD in General Linguistics
Research topic: "Morphological typology through massive parallel corpora"
CAS in Computer Science with a focus on Information Systems
|2012 - 2015||
ETH Zurich / University of Zurich
MSc in Quantitative Finance
|2003 - 2008||
Moscow State University
MSc in Mathematics