The project applies information theory, statistical modelling and machine learning to the study of language adaptation using linguistic data extracted from multilingual corpora. In addition to the theoretical findings, the project will provide a data set consisting of text samples of 100 languages facilitating future use of corpus-based computational methods in scientific approaches to linguistic diversity and change.
Project members: Olga Sozinova (PhD student), Ximena Gutierrez-Vasques (PostDoc), Christian Bentz (PostDoc, external collaborator), Steven Moran (PostDoc, external collaborator) and Tanja Samardžić (PI).
Funding: SNF grant #176305 2018—2022.
Computational methods to describe languages
TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP
Steven Moran, Christian Bentz, Ximena Gutierrez-Vasques, Olga Sozinova and Tanja Samardzic. 2022. "TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP”. In Proceedings of The International Conference on Language Resources and Evaluation (LREC), Marseille, France. 20—25 June 2022.
The turning point of BPE merges
Interpretability for morphological inflection
Data for the analysis consists of texts in 100 languages, which will be published as a multilingual corpus. The chosen 100-language sample is proposed by WALS. The text collection for this sample of languages will be our original contribution.
Current number of tokens collected per genre: fiction, non-fiction, conversation, professional, technical, grammar examples.