Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

被引:0
|
作者
Mahmoud El-Haj
Udo Kruschwitz
Chris Fox
机构
[1] Lancaster University,School of Computing and Communications
[2] University of Essex,CSEE
来源
关键词
Resources; Summarisation; Arabic; Under-resourced languages;
D O I
暂无
中图分类号
学科分类号
摘要
Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented.
引用
收藏
页码:549 / 580
页数:31
相关论文
共 50 条
  • [21] Text-based Language Identification for Some of the Under-resourced Languages of South Africa
    Sefara, Tshephisho Joseph
    Manamela, Madimetja Jonas
    Malatji, Promise Tshepiso
    2016 THIRD INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND ENGINEERING (ICACCE 2016), 2016, : 303 - 307
  • [22] Introduction to the special issue on processing under-resourced languages
    Besacier, Laurent
    Barnard, Etienne
    Karpov, Alexey
    Schultz, Tanja
    SPEECH COMMUNICATION, 2014, 56 : 83 - 84
  • [23] Automatic speech recognition for under-resourced languages: A survey
    Besacier, Laurent
    Barnard, Etienne
    Karpov, Alexey
    Schultz, Tanja
    SPEECH COMMUNICATION, 2014, 56 : 85 - 100
  • [24] Phonetic alignment for speech synthesis in under-resourced languages
    van Niekerk, D. R.
    Barnard, E.
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 856 - +
  • [25] SHARING HIGH-QUALITY LANGUAGE RESOURCES IN THE LEGAL DOMAIN TO DEVELOP NEURAL MACHINE TRANSLATION FOR UNDER-RESOURCED EUROPEAN LANGUAGES
    Bago, Petra
    Castilho, Sheila
    Celeste, Edoardo
    Dunne, Jane
    Gaspari, Federico
    Gislason, Niels Runar
    Kasen, Andre
    Klubicka, Filip
    Kristmannsson, Gauti
    McHugh, Helen
    Moran, Roisin
    Ni Loinsigh, Orla
    Olsen, Jon Arild
    Escartin, Carla Parra
    Ramesh, Akshai
    Resende, Natalia
    Sheridan, Paraic
    Way, Andy
    REVISTA DE LLENGUA I DRET-JOURNAL OF LANGUAGE AND LAW, 2022, (78) : 9 - 34
  • [26] Semantic speech recognition in the Basque context Part II: language identification for under-resourced languages
    Nora Barroso
    Karmele López de Ipiña
    Carmen Hernández
    Aitzol Ezeiza
    Manuel Graña
    International Journal of Speech Technology, 2012, 15 (1) : 41 - 47
  • [27] Semantic speech recognition in the Basque context Part II: language identification for under-resourced languages
    Barroso, Nora
    Lopez de Ipina, Karmele
    Hernandez, Carmen
    Ezeiza, Aitzol
    Grana, Manuel
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2012, 15 (01) : 41 - 47
  • [28] ADAPTING ASR FOR UNDER-RESOURCED LANGUAGES USING MISMATCHED TRANSCRIPTIONS
    Liu, Chunxi
    Jyothi, Preethi
    Tang, Hao
    Manohar, Vimal
    Sloan, Rose
    Kekona, Tyler
    Hasegawa-Johnson, Mark
    Khudanpur, Sanjeev
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5840 - 5844
  • [29] Cross-Lingual Link Discovery for Under-Resourced Languages
    Rosner, Michael
    Ahmadi, Sina
    Apostol, Elena-Simona
    Bosque-Gil, Julia
    Chiarcos, Christian
    Dojchinovski, Milan
    Gkirtzou, Katerina
    Gracia, Jorge
    Gromann, Dagmar
    Liebeskind, Chaya
    Oleskeviene, Giedre Valunaite
    Serasset, Gilles
    Truica, Ciprian-Octavian
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 181 - 192
  • [30] WordNet construction for under-resourced languages using personalized PageRank
    Berangi, Parisa
    Mousavi, Zahra
    Faili, Heshaam
    Shakery, Azadeh
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2021, 36 (03) : 565 - 580