The ArchiMob Corpus

The ArchiMob corpus represents German linguistic varieties spoken within the territory of Switzerland. This corpus is the first electronic resource containing long samples of transcribed text in Swiss German, intended for studying the spatial distribution of morphosyntactic features and for natural language processing.

This corpus is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Release 2 (2019)

The second new version of the ArchiMob corpus is now out featuring:

Newly transcribed documents (9 more than in the first release)
Speech-to-text alignment at the level of utterance (4-10 seconds)
Improved normalisation
Improved part-of-speech tagging

You can find more information on new features of the corpus in the Release 2 notes.

Access

Download XML files, metadata and annotation guidelines via SWISSUbase
Download Audio samples and the full set of audio files via SWISSUbase

Publications

Scherrer, Y., T. Samardžić, E. Glaser (2019). "Digitising Swiss German -- How to process and study a polycentric spoken language". Language Resources and Evaluation. (First online)

Scherrer, Y., T. Samardžić, E. Glaser (2019). "ArchiMob: Ein multidialektales Korpus schweizerdeutscher Spontansprache". Linguistik Online, 98(5), 425-454. https://doi.org/10.13092/lo.98.5947

Release 1 (2016)

Details of the corpus composition, formatting, and annotation can be found in the ArchiMob Release 1 Documentation.

Access

Download XML files and documentation via SWISSUbase

Publications

Samardžić, T., Y. Scherrer, E. Glaser (2016) “ArchiMob - A Corpus of Spoken Swiss German”. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia.

Samardžić, T., Y. Scherrer, E. Glaser (2015) "Normalising orthographic and dialectal variants for the automatic processing of Swiss German", In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland.

DOI https://doi.org/10.5281/zenodo.1158572

Map by Yves Scherrer