Navigation auf uzh.ch
Darja Fišer (Ljubljana)
In the talk we will present an overview of the current results and on-going activities of the JANES project, http://nl.ijs.si/janes. The project started in 2014 and focuses on nonstandard Slovene. Its objectives are to build a large corpus of user-generated Slovene as found on the Internet which will serve as the basis for linguistic analyses and will also help to improve language-technology tools for processing texts written in nonstandard Slovene. We have compiled the first version of the JANES corpus of user-generated Slovene which contains four types of text: tweets, forums, news comments and blogs comprising just over 160 million tokens. The corpus has been tokenized, normalised, part-of-speech tagged and lemmatised. It should be noted that the method for word normalisation is currently only a prototype, which we plan to improve in the continuation of the project. We have also developed a method to predict the level of non-standardness of texts in the corpus. We propose that non-standardness comes in two basic varieties, technical and linguistic, and developed a machine-learning method to discriminate between standard and non-standard text in these two dimensions. The subcorpus of tweets has also been automatically annotated for sentiment and geographic region as well as manually annotated for gender (female/male/neutral) and type of user (corporate/private). The corpus has already been used for linguistic analyses by researchers and students, and their results have been presented at two national and three international conferences.
Room SOD 1 105, Schönberggasse 9, 8001 Zürich
Room KOL G 203, Rämistrasse 71, 8006 Zürich
Room SOD 1 105, Schönberggasse 9, 8001 Zürich
Room SOD 1 105, Schönberggasse 9, 8001 Zürich
Vortrag Leelo Keevallik: Vocal practices of synchronizing the bodies in a dance class
Vortrag Mathias Broth: The accountability of braking in driving instruction
Room SOD 1 105, Schönberggasse 9, 8001 Zürich
Room SOD 1 105, Schönberggasse 9, 8001 Zürich