The ArchiMob Corpus


The ArchiMob corpus represents German linguistic varieties spoken within the territory of Switzerland. This corpus is the first electronic resource containing long samples of transcribed text in Swiss German, intended for studying the spatial distribution of morphosyntactic features and for natural language processing.

This corpus is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Release 2 (2019)

The new version of the ArchiMob corpus is now out featuring: 

  • Newly transcribed documents (9 more than in the first release)
  • Speech-to-text alignment at the level of utterance (4-10 seconds)
  • Improved normalisation
  • Improved part-of-speech tagging


You can find more information on new features of the corpus in the Release 2 notes.


XML Download 


Transcription guidelines (in German): main further specifications 

Normalisation guidelines (in German): latest version

Online query with NoSketch

 Samples (.wav) of audio sources, contact us for the full data set.


Scherrer, Y., T. Samardžić, E. Glaser (2019). "Digitising Swiss German -- How to process and study a polycentric spoken language". Language Resources and Evaluation. (First online) 

Scherrer, Y., T. Samardžić, E. Glaser (2019). "ArchiMob: Ein multidialektales Korpus schweizerdeutscher Spontansprache". Linguistik Online98(5), 425-454.


Release 1 (2016)

Details of the corpus composition, formatting, and annotation  can be found in the ArchiMob Release 1 Documentation (PDF, 317 KB).   


 XML download (ZIP, 5 MB) 

Online query with NoSketch  or ANNIS.


Samardžić, T., Y. Scherrer, E. Glaser (2016) “ArchiMob - A Corpus of Spoken Swiss German”. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia.


Samardžić, T., Y. Scherrer, E. Glaser (2015) "Normalising orthographic and dialectal variants for the automatic processing of Swiss German", In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland.


Map by Yves Scherrer