Building Comparable Corpora for Assessing Multi-Word Term Alignment

被引:0
|
作者
Adjali, Omar [1 ]
Morin, Emmanuel [2 ]
Zweigenbaum, Pierre [1 ]
机构
[1] Univ Paris Saclay, CNRS, Lab Interdisciplinaire Sci Numer, Orsay, France
[2] Nantes Univ, CNRS, Laboratoire Sci Numer Nantes, Nantes, France
关键词
EXTRACTION;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Recent work has demonstrated the importance of dealing with Multi-Word Terms (MWTs) in several Natural Language Processing applications. In particular, MWTs pose serious challenges for alignment and machine translation systems because of their syntactic and semantic properties. Thus, developing algorithms that handle MWTs is becoming essential for many NLP tasks. However, the availability of bilingual and more generally multi-lingual resources is limited, especially for low-resourced languages and in specialized domains. In this paper, we propose an approach for building comparable corpora and bilingual term dictionaries that help evaluate bilingual term alignment in comparable corpora. To that aim, we exploit parallel corpora to perform automatic bilingual MWT extraction and comparable corpus construction. Parallel information helps to align bilingual MWTs and makes it easier to build comparable specialized sub-corpora. Experimental validation on an existing dataset and on manually annotated data shows the interest of the proposed methodology.
引用
收藏
页码:3103 / 3112
页数:10
相关论文
共 50 条
  • [1] Building wordnets with multi-word expressions from parallel corpora
    Simoes, Alberto
    Gomez Guinovart, Xavier
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (64): : 45 - 52
  • [2] MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
    Han, Lifeng
    Jones, Gareth J. F.
    Smeaton, Alan F.
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2970 - 2979
  • [3] A Contrastive Approach to Multi-word Term Extraction from Domain Corpora
    Bonin, Francesca
    Dell'Orletta, Felice
    Venturi, Giulia
    Montemagni, Simonetta
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
  • [4] Harvesting Multi-Word Expressions from Parallel Corpora
    Vintar, Spela
    Fiser, Darja
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1091 - 1096
  • [5] Compositionality and lexical alignment of multi-word terms
    Emmanuel Morin
    Béatrice Daille
    [J]. Language Resources and Evaluation, 2010, 44 : 79 - 95
  • [6] A contrastive Approach to Multi-word Term Extraction from Domain-specific Corpora
    Bonin, Francesca
    Dell' Orletta, Felice
    Venturi, Giulia
    Montemagni, Simonetta
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
  • [7] Compositionality and lexical alignment of multi-word terms
    Morin, Emmanuel
    Daille, Beatrice
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2010, 44 (1-2) : 79 - 95
  • [8] Extraction of multi-word expressions from small parallel corpora
    Tsvetkov, Yulia
    Wintner, Shuly
    [J]. NATURAL LANGUAGE ENGINEERING, 2012, 18 : 549 - 573
  • [9] A multi-word term extraction system
    Chen, Jisong
    Yeh, Chung-Hsing
    Chau, Rowena
    [J]. PRICAI 2006: TRENDS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4099 : 1160 - 1165
  • [10] Tibetan Multi-word Expressions Identification Framework Based on News Corpora
    Nuo, Minghua
    Lun, Congjun
    Liu, Huidan
    [J]. NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016), 2016, 10102 : 16 - 26