Upstream text processing


Many practical applications in Natural Language Processing (NLP), such as machine translation and speech recognition, benefit from text preprocessing steps which reduce data sparsity. For example, morphological text processing can help reduce sparsity through segmenting words into morphemes (morphological segmentation) or mapping inflected forms of words to their lemmas (lemmatization). Another example is normalization of writing: mapping surface word forms to their canonical forms through reducing dialectological variation or correcting spelling errors. In many cases, such upstream tasks can be formulated as sequence transformation tasks and solved with the same neural sequence-to-sequence technology that is used in neural machine translation (NMT) and speech processing. In this project, we develop systems for a range of upstream tasks by enriching character-level sequence-to-sequence models with structural signal derived from multiple text organization layers: characters, morphemes, words and sentences. 

Project members: Tatiana Ruzsics (PhD student) and Tanja Samardžić (PI).

Funding: URPP "Language and Space" (UZH internal)

NMT System with target context encoding via Higher-Level Language Model: Synchronized decoding

NMT System with source context encoding via Hierarchical biLSTM and PoS tags

NMT System with Hard Attention and Copy Mechanism