Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

被引:0
|
作者
Mahmoud El-Haj
Udo Kruschwitz
Chris Fox
机构
[1] Lancaster University,School of Computing and Communications
[2] University of Essex,CSEE
来源
关键词
Resources; Summarisation; Arabic; Under-resourced languages;
D O I
暂无
中图分类号
学科分类号
摘要
Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented.
引用
收藏
页码:549 / 580
页数:31
相关论文
共 50 条
  • [31] A Phone Mapping Technique for Acoustic Modeling of Under-resourced Languages
    Van Hai Do
    Xiao, Xiong
    Chng, Eng Siong
    Li, Haizhou
    2012 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2012), 2012, : 233 - 236
  • [32] Multi-task learning in under-resourced Dravidian languages
    Adeep Hande
    Siddhanth U. Hegde
    Bharathi Raja Chakravarthi
    Journal of Data, Information and Management, 2022, 4 (2): : 137 - 165
  • [33] Crawl and crowd to bring machine translation to under-resourced languages
    Toral, Antonio
    Espla-Gomis, Miquel
    Klubicka, Filip
    Ljubesic, Nikola
    Papavassiliou, Vassilis
    Prokopidis, Prokopis
    Rubino, Raphael
    Way, Andy
    LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (04) : 1019 - 1051
  • [34] Network-Enabled Keyword Extraction for Under-Resourced Languages
    Beliga, Slobodan
    Martincic-Ipsic, Sanda
    SEMANTIC KEYWORD-BASED SEARCH ON STRUCTURED DATA SOURCES, IKC 2016, 2017, 10151 : 124 - 135
  • [35] Text Spotting In Large Speech Databases For Under-Resourced Languages
    Buzo, Andi
    Cucu, Horia
    Burileanu, Corneliu
    2013 7TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN - COMPUTER DIALOGUE (SPED), 2013,
  • [36] A Statistical Method for Translating Chinese into Under-resourced Minority Languages
    Chen, Lei
    Li, Miao
    Zhang, Jian
    Zhu, Zede
    Yang, Zhenxin
    MACHINE TRANSLATION, CWMT 2014, 2014, 493 : 49 - 60
  • [37] Automating the Creation of Speech Recognition Systems for Under-Resourced Languages
    Khusainov, Aidar
    Suleymanov, Dzhavdet
    2015 FOURTEENTH MEXICAN INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (MICAI), 2015, : 28 - 32
  • [38] Crawl and crowd to bring machine translation to under-resourced languages
    Antonio Toral
    Miquel Esplá-Gomis
    Filip Klubička
    Nikola Ljubešić
    Vassilis Papavassiliou
    Prokopis Prokopidis
    Raphael Rubino
    Andy Way
    Language Resources and Evaluation, 2017, 51 : 1019 - 1051
  • [39] Towards Learning Morphology for Under-Resourced Fusional and Agglutinating Languages
    Shalonova, Ksenia
    Golenia, Bruno
    Flach, Peter
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (05): : 956 - 965
  • [40] Speech recognition of under-resourced languages using mismatched transcriptions
    Do, Van Hai
    Chen, Nancy F.
    Lim, Boon Pang
    Hasegawa-Johnson, Mark
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2016, : 112 - 115