Unmasking the Myth of Effortless Big Data - Making an Open Source Multilingual Infrastructure and Building Language Resources from Scratch

被引:0
|
作者
Wiechetek, Linda [1 ]
Hiovain-Asikainen, Katri [1 ]
Mikkelsen, Inga Lill Sigga [1 ]
Moshagen, Sjur N. [1 ]
Pirinen, Flammie A. [1 ]
Trosterud, Trond [1 ]
Gaup, Borre [1 ]
机构
[1] UiT Arctic Univ Norway, Dept Language & Culture, Tromso, Norway
关键词
infrastructure; corpus; text processing; minority languages; finite state technology; knowledge-based nlp; grammar checking; TTS; ASR; speech technology; spellchecking;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Machine learning (ML) approaches have dominated Natural Language Processing (NLP) during the last two decades. From machine translation and speech technology, machine learning tools are now also in use for spellchecking and grammar checking, with a blurry distinction between the two. We unmask the myth of effortless big data by illuminating the efforts and time that lay behind building a multi-purpose corpus with regard to collecting, marking up and building from scratch. We also discuss what kind of language technology tools minority language communities actually need, and to what extent the dominating paradigm has been able to deliver these tools. In this context we present our alternative to corpus-based language technology - knowledge-based language technology - and we show how this approach can provide language technology solutions for languages being outside the reach of machine learning procedures. We present a stable and mature infrastructure (GiellaLT) containing more than hundred languages and building a number of language technology tools that are useful for language communities.
引用
收藏
页码:1167 / 1177
页数:11
相关论文
empty
未找到相关数据