The ArchiMob corpus represents German linguistic varieties spoken within the territory of Switzerland. This corpus is the first electronic resource containing long samples of transcribed text in Swiss German, intended for studying the spatial distribution of morphosyntactic features and for natural language processing. The size of the current version of the corpus is 528 381 tokens.
Details of the corpus composition, formatting, and annotation can be found in the ArchiMob Release 1 Documentation (PDF, 317 KB).
Release 2 coming soon!
- For the XML archive, follow the download (ZIP, 5411 KB) link.
- For online searches, go to Sketch Engine or ANNIS.
- For the audio sources, contact us.
This corpus is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. If you wish to use the corpus for commercial purposes, please contact us.
Scherrer, Y., T. Samardžić, E. Glaser (2019). "Digitising Swiss German -- How to process and study a polycentric spoken language". Language Resources and Evaluation. (First online)
Samardžić, T., Y. Scherrer, E. Glaser (2016) “ArchiMob - A Corpus of Spoken Swiss German”. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia.
Samardžić, T., Y. Scherrer, E. Glaser (2015) "Normalising orthographic and dialectal variants for the automatic processing of Swiss German", In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland.
Map by Yves Scherrer