Unmasking the Myth of Effortless Big Data - Making an Open Source Multilingual Infrastructure and Building Language Resources from Scratch

被引：0

作者：

Wiechetek, Linda ^{[1
]}

Hiovain-Asikainen, Katri ^{[1
]}

Mikkelsen, Inga Lill Sigga ^{[1
]}

Moshagen, Sjur N. ^{[1
]}

Pirinen, Flammie A. ^{[1
]}

Trosterud, Trond ^{[1
]}

Gaup, Borre ^{[1
]}

机构：

[1] UiT Arctic Univ Norway, Dept Language & Culture, Tromso, Norway

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

关键词：

infrastructure; corpus; text processing; minority languages; finite state technology; knowledge-based nlp; grammar checking; TTS; ASR; speech technology; spellchecking;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Machine learning (ML) approaches have dominated Natural Language Processing (NLP) during the last two decades. From machine translation and speech technology, machine learning tools are now also in use for spellchecking and grammar checking, with a blurry distinction between the two. We unmask the myth of effortless big data by illuminating the efforts and time that lay behind building a multi-purpose corpus with regard to collecting, marking up and building from scratch. We also discuss what kind of language technology tools minority language communities actually need, and to what extent the dominating paradigm has been able to deliver these tools. In this context we present our alternative to corpus-based language technology - knowledge-based language technology - and we show how this approach can provide language technology solutions for languages being outside the reach of machine learning procedures. We present a stable and mature infrastructure (GiellaLT) containing more than hundred languages and building a number of language technology tools that are useful for language communities.

引用

页码：1167 / 1177

页数：11