SHARING HIGH-QUALITY LANGUAGE RESOURCES IN THE LEGAL DOMAIN TO DEVELOP NEURAL MACHINE TRANSLATION FOR UNDER-RESOURCED EUROPEAN LANGUAGES

被引:4
|
作者
Bago, Petra [1 ]
Castilho, Sheila [2 ]
Celeste, Edoardo [3 ,4 ]
Dunne, Jane [2 ]
Gaspari, Federico [2 ]
Gislason, Niels Runar [5 ]
Kasen, Andre
Klubicka, Filip [1 ]
Kristmannsson, Gauti [5 ]
McHugh, Helen [2 ]
Moran, Roisin [7 ]
Ni Loinsigh, Orla [2 ]
Olsen, Jon Arild [6 ]
Escartin, Carla Parra [7 ]
Ramesh, Akshai [7 ]
Resende, Natalia [2 ]
Sheridan, Paraic [7 ]
Way, Andy [2 ]
机构
[1] Univ Zagreb, Fac Humanities & Social Sci, Zagreb, Croatia
[2] Dublin City Univ, ADAPT Ctr, Dublin, Ireland
[3] Dublin City Univ, Sch Law & Govt, Dublin, Ireland
[4] ADAPT Ctr, Limerick, Ireland
[5] Univ Iceland, Reykjavik, Iceland
[6] Natl Lib Norway, Oslo, Norway
[7] Icon Translat Machines Ltd, Dublin, Ireland
基金
爱尔兰科学基金会;
关键词
language resources; under-resourced languages; legal translation; neural machine translation; evaluation;
D O I
10.2436/rld.i78.2022.3741
中图分类号
D9 [法律]; DF [法律];
学科分类号
0301 ;
摘要
This article reports some of the main achievements of the European Union-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages: Croatian, Irish, Norwegian, and Icelandic. After illustrating the significance of this work for developing translation technologies in the context of the European Union and the European Economic Area, the article outlines the main steps of data collection, curation, and sharing of the LRs gathered with the support of public and private data contributors. This is followed by a description of the development pipeline and key features of the state-of-the-art, bespoke neural machine translation (MT) engines for the legal domain that were built using this data. The MT systems were evaluated with a combination of automatic and human methods to validate the quality of the LRs collected in the project, and the high-quality LRs were subsequently shared with the wider community via the ELRC-SHARE repository. The main challenges encountered in this work are discussed, emphasising the importance and the key benefits of sharing high-quality digital LRs.
引用
收藏
页码:9 / 34
页数:26
相关论文
共 15 条
  • [1] The Use of Machine Translation to Provide Resources for Under-Resourced Languages - Image Captioning Task
    Ahmed, Basem H.
    Saad, Motaz
    2021 PALESTINIAN INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY (PICICT 2021), 2021, : 25 - 29
  • [2] Crawl and crowd to bring machine translation to under-resourced languages
    Toral, Antonio
    Espla-Gomis, Miquel
    Klubicka, Filip
    Ljubesic, Nikola
    Papavassiliou, Vassilis
    Prokopidis, Prokopis
    Rubino, Raphael
    Way, Andy
    LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (04) : 1019 - 1051
  • [3] Crawl and crowd to bring machine translation to under-resourced languages
    Antonio Toral
    Miquel Esplá-Gomis
    Filip Klubička
    Nikola Ljubešić
    Vassilis Papavassiliou
    Prokopis Prokopidis
    Raphael Rubino
    Andy Way
    Language Resources and Evaluation, 2017, 51 : 1019 - 1051
  • [4] Creating language resources for under-resourced languages: methodologies, and experiments with Arabic
    El-Haj, Mahmoud
    Kruschwitz, Udo
    Fox, Chris
    LANGUAGE RESOURCES AND EVALUATION, 2015, 49 (03) : 549 - 580
  • [5] InterlinguaPlus Machine Translation Approach for Under-Resourced Languages: Ekegusii & Swahili
    Ombui, Edward O.
    Wagacha, Peter W.
    Ng'ang'a, Wanjiku
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [6] Creating language resources for under-resourced languages: methodologies, and experiments with Arabic
    Mahmoud El-Haj
    Udo Kruschwitz
    Chris Fox
    Language Resources and Evaluation, 2015, 49 : 549 - 580
  • [7] Shallow Discourse Parsing for Under-Resourced Languages: Combining Machine Translation and Annotation Projection
    Sluyter-Gaethje, Henny
    Bourgonje, Peter
    Stede, Manfred
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1044 - 1050
  • [8] Finding Translation Examples for Under-Resourced Language Pairs or for Narrow Domains; the Case for Machine Translation
    Tufis, Dan
    COMPUTER SCIENCE JOURNAL OF MOLDOVA, 2012, 20 (02) : 227 - 245
  • [9] Using Resources from a Closely-related Language to Develop ASR for a Very Under-resourced Language: A Case Study for Iban
    Juan, Sarah Samson
    Besacier, Laurent
    Lecouteux, Benjamin
    Dyab, Mohamed
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1270 - 1274
  • [10] Efficient and High-Quality Neural Machine Translation with OpenNMT
    Klein, Guillaume
    Zhang, Dakun
    Chouteau, Clement
    Crego, Josep
    Senellart, Jean
    NEURAL GENERATION AND TRANSLATION, 2020, : 211 - 217