Basic Natural Language Processing for Swiss German Texts

Abstract

The goal of the project was improving systems and tools for processing Swiss German texts. The work on the project included several activities:

a) improving the normalisation of orthographic and regional variation in spelling,

b) developing a system for semi-automatic part-of-speech tagging, 

c) developing a system for semi-automatic transcription,

d) promotion of the methods and tools used to process Swiss German.

 

a) Most of the normalisation of orthographic and regional spelling variation was carried out in collaboration with international experts on character-level machine translation, Yves Scherrer (University of Helsinki, previously Geneva) and Nikola Ljubešić (J. Stefan Institute and University of Zagreb).

b) The system for semi-automatic part-of-speech tagging was developed through a collaboration with TakeLab at the University of Zagreb, a computer science lab specialised in machine learning for natural language processing. In order to optimise the amount of manual annotation of new instances, we rely on active learning, a general framework for annotating training examples for machine learning algorithms. The active learning interface is especially interesting in the context of processing texts with strong variation, as it is able to identify specific segments that need more attention. Using this interface, we can put the data from all the variants together, annotate automatically the overlapping examples and manually the specific examples, characteristic of just one or two variants, not accounted for by the tagger's generalisations. The interface is installed on the CorpusLab virtual server, running on the UZH ScienceCloud, and used by our collaborators for increasing the training set. In order to gain more control over the automatic speech-to-text processing for the purpose of our own research, we have engaged the UZH computing support service S3IT to install and train the open-source system Kaldi.

c) The task of semi-automatic transcription turned out to be the most difficult, requiring most efforts. It was performed in collaboration with the private company Spitch, with which we had been collaborating since June 2015 and with the S3IT computing service at the University of Zurich.

d) The annotated ArchiMob corpus is now accessible on our web page. A summary of the results of this project was presented at the conference Swiss Text, 9 June 2017. This project resulted in a number of tools and methods that helped us improve the automatic analysis of texts in Swiss German and increase the training data set. The outcomes of the project allow us further improvements relying on the up-to-date computational technology with optimal exploitation of the manual input from language experts.

In the period after the project, we continue to use the developed tools for a) improving speech-to-text conversion, b) improving the spelling normalisation, and c) increasing the size of the data sets to be used for machine learning. Based on these improvements, we will elaborate a new project proposal with the goal to set up our resources and tools as web services.

Project leadership

Tanja Samardžić

In collaboration with

Fatima Stadler, Noëmi Aepli

Funding

Hasler Foundation