Upstream text processing

Abstract

Many practical applications in Natural Language Processing (NLP), such as machine translation and speech recognition, benefit from text preprocessing steps which reduce data sparsity. For example, morphological text processing can help reduce sparsity through segmenting words into morphemes (morphological segmentation) or mapping inflected forms of words to their lemmas (lemmatization). Another example is normalization of writing: mapping surface word forms to their canonical forms through reducing dialectological variation or correcting spelling errors. In many cases, such upstream tasks can be formulated as sequence transformation tasks and solved with the same neural sequence-to-sequence technology that is used in neural machine translation (NMT) and speech processing. In this project, we develop systems for a range of upstream tasks by enriching character-level sequence-to-sequence models with structural signal derived from multiple text organization layers: characters, morphemes, words and sentences.

Project members: Tatiana Ruzsics (PhD student) and Tanja Samardžić (PI).

Funding: URPP "Language and Space" (UZH internal)

NMT System with target context encoding via Higher-Level Language Model: Synchronized decoding

Ruzsics, T. and T. Samardžić (2017). "Neural Sequence-to-sequence Learning of Internal Word Structure"

NMT System with source context encoding via Hierarchical biLSTM and PoS tags

Ruzsics, T. and T. Samardžić (Draft). "Multilevel text normalization with sequence-to-sequence networks and multisource learning" . ArXiv

NMT System with Hard Attention and Copy Mechanism

Makarov P., T. Ruzsics, and S. Clematide (2017). "Align and copy: UZH at SIGMORPHON 2017 shared task for morphological reinflection"

Weiterführende Informationen

Ruzsics, T., O. Sozinova, X. Gutierrez-Vasques and T. Samardžić (2021). "Interpretability for Morphological Inflection: from Character-level Predictions to Subword-level Rules". Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online.

Ruzsics, T. and T. Samardžić (Draft).

"Multilevel text normalization with sequence-to-sequence networks and multisource learning" . ArXiv

Ruzsics, T., Lusetti, M., A. Göhring, T. Samardžić and E. Stark (2019). "Neural text normalization with adapted decoding and PoS features". Natural Language Engineering. 585 - 605. Cambridge University Press. Pre-print

Lusetti, M., T. Ruzsics, A. Göhring, T. Samardžić and E. Stark (2018). "Encoder-Decoder Methods for Text Normalization". In Proceedings of the Workshop Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (COLING 2018). Santa Fe, New Mexico, USA.

Makarov P., T. Ruzsics, and S. Clematide (2017). "Align and copy: UZH at SIGMORPHON 2017 shared task for morphological reinflection". In Proceedings of the CoNLL- SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection, Vancouver, Canada. Overall winner of Task 1.

Ruzsics, T. and T. Samardžić (2017). "Neural Sequence-to-sequence Learning of Internal Word Structure". In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada.

Neural Text Normalization with Adapted Decoding and POS Feature at SwissText 2019

Encoder-Decoder Methods for Text Normalization at SwissText 2018

URPP Language and Space Language and Space Lab

Quicklinks und Sprachwechsel

Main navigation