Release 2 notes
A detailed documentation of the corpus is shared with the first release (download here (PDF, 317 KB)). What follows is an overview of the main features of Release 2 and corrections performed since the previous release.
- 9 newly transcribed recordings (new total 43 documents). Compared to the first release, one additional transcription phase of nine recordings has been carried out with EXMARaLDA (1044, 1053, 1163, 1203, 1224, 1235, 1263, 1240, 1255).
POS tagging. This task was carried out as follows:
- The BTagger has been replaced by a CRF tagger.
- The tagging model has been enhanced by adding normalized forms as features.
- The training data set has been increased by adding data corrected through an active learning procedure, developed at the TakeLab (University of Zagreb). Human annotators are presented with automatically-tagged low-confidence utterances and their task is to correct the wrong tags. The corrected tags are then added to the training set for the next iteration and the procedure is repeated as long as it yields improvements on the test set.
Some corrections of glitches and inconsistencies have been carried out in two phases, as described in the next sections.
First correction phase
The first correction phase took place until October 2018. Some transcription mistakes and inconsistencies have been corrected (for example, schsch → sch, qu → kw, sp → schp). Further corrections concerned the writing of schwa (initially ä, corrected to e) in the files 1008, 1055, 1138, 1188, 1189, 1205, and the change from ò to o in 1008.
Second correction phase
The second correction phase took place in the second half of 2019. Below is an overview of the various issues that were addressed in this phase.
Errors in transcribed forms
A thorough analysis revealed some remaining errors in the transcribed forms. Most of them are strings that contain a parenthesis followed by the $ sign. The cause of the error can be explained as follows: in the .exb file that represents the transcription performed with the EXMARaLDA tool, parentheses are used by the transcriber as implicit annotation to delimit an unclear sequence (tag
<unclear> in the .xml file), whereas the $ is used as hesitation marker after a sequence such as ää (tag
<vocal> in the .xml file). The Python script used to convert the .exb format into .xml is unable to process an utterance in which a vocalized unit is contained in an unclear sequence. An example of such an occurrence is the erroneous output
(aso_rot_ä$ in ID 1044, which results from the following .exb annotation:
<event start="T486" end="T487">si wüssed (aso rot ä$)
khulturbolschewismus hed s uf e ganz e wiite
khulturberiich erschtreggt oder so </event>
This type of error affects the structure of the .xml file. The actual output is as follows:
However, the output should be as follows:
<w normalised="also" tag="ADV" xml:id="d1044-u535-w3">aso</w>
<w normalised="rot" tag="ADJD" xml:id="d1044-u535-w4">rot</w>
Adding elements to the utterance would mean that the numbering of the words should be changed: since the words
w5 have been added, the following word should have the index 6 instead of the original 4. The final decision was that these instances should be corrected manually by adding the opening and closing tags
<unclear>. This will treat the entire unclear sequence as a single vocalized element, thus not compromising the numbering. The output of the example illustrated above would then be as follows:
The remaining errors are:
five occurrences of a question mark after a word (ID 1082_2, 1082_3, 1121 and 1147). These have been corrected by means of a JSON file, using the same procedure as the Archimob correction page.7
one occurrence of a
welement that consists of a period sign (ID 1147), whereby the element is the only one in an utterance by a person with the ID
Two empty elements, which were once occurrences of hyphens, then removed: ID 1228, as a result of the expression fliegerbeobachtungs- und meldedienst; and ID 1248, where the element is the only one in an utterance pronounced by the interviewed person. Such empty elements are problematic for the KALDI tool, since it cannot process them and crashes. Re-introducing the hyphens that were removed would not prevent the KALDI tool to crash. Therefore the solution adopted consists in deleting the empty elements, since this does not compromise the order of the sound files.
Some elements in the XML files had an empty
normalised attribute. The missing elements have been added manually by means of a JSON file.
In the entire file 1163.xml, the
normalised attribute was “xxx”, due to a misalignment during the import of normalized forms into the XML file. This has been corrected.
Erroneously deleted sound tracks
For the purpose of anonymization, some sound segments have been deleted. However, entire blocks of segments in doc 1188 seem to have been deleted even if they do not contain instances of anonymization. Though the cause of this glitch could not be determined, the missing sound files have been reinstated.
In the file
person_file.xml, part of the XML archive available on the Archimob website, the following errors have been found and corrected:
<person xml:id=PRos sex=f>, the recording number is missing (1073)
<person xml:id=SErh sex=m>, the recording number is missing (1075)
<person xml:id=DHan1963 sex=m>, the recording number is incorrect (it should be 1163)
Archimob correction page, accessible only through the UZH net: http://linguistik-web.uzh.ch:4000/correct_archimob↩