Computational Phraseology light: automatic translation of multiword expressions without translation resources

被引:1
|
作者
Mitkov, Ruslan [1 ]
机构
[1] Wolverhampton Univ, Res Inst Informat & Language Proc, Wolverhampton, W Midlands, England
关键词
multiword expressions (MWEs); extraction of MWEs; translation of MWEs; comparable corpora; association measures;
D O I
10.1515/phras-2016-0008
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and retrieving their translations for any pair of languages. The task of translating multiword expressions is viewed as a two-stage process. The first stage is the extraction of MWEs in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents. This project pursues the development of a knowledge-poor approach for any pair of languages which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. In line with this philosophy, the methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled. The first proof-of-concept stage of this project covers English and Spanish and focuses on a particular subclass of MWEs: verb-noun expressions (collocations) such as take advantage, make sense, prestar atencion and tener derecho. The choice of genre was determined by the fact that newswire is a widespread genre and available in different languages. An additional motivation was the fact that the methodology was developed as language independent with the objective of applying it to and testing it for different languages. The ACCURAT toolkit (Pinnis et al. 2012; Skadina et al. 2012; Su and Babych 2012a) was employed to compile automatically the comparable corpora and documents only above a specific threshold were considered for inclusion. More specifically, only pairs of English and Spanish documents with comparability score (cosine similarity) higher 0.45 were extracted. Statistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. This study focused on and compared four popular and established measures along with frequency: Log-likelihood ratio, T-Score, Log Dice and Salience. This project follows the distributional similarity premise which stipulates that translation equivalents share common words in their contexts and this applies also to multiword expressions. The Vector Space Model is traditionally used to represent words with their co-occurrences and to measure similarity. The vector representation for any word is constructed from the statistics of the occurrences of that word with other specific/context words in a corpus of texts. In this study, the word2vec method (Mikolov et al. 2013) was employed. Mikolov et al.'s method utilises patterns of word co-occurrences within a small window to predict similarities among words. Evaluation results are reported for both extracting MWEs and their automatic translation. A finding of the evaluation worth mentioning is that the size of the comparable corpora is more important for the performance of automatic translation of MWEs than the similarity between them as long as the comparable corpora used are of minimal similarity.
引用
收藏
页码:149 / 166
页数:18
相关论文
共 21 条
  • [1] Multiword Expressions in Machine Translation
    Kordoni, Valia
    Simova, Iliana
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1208 - 1211
  • [2] Unsupervised Compositional Translation of Multiword Expressions
    Gamallo, Pablo
    Garcia, Marcos
    [J]. JOINT WORKSHOP ON MULTIWORD EXPRESSIONS AND WORDNET (MWE-WN 2019), 2019, : 40 - 48
  • [3] Dictionary of Multiword Expressions for Translation into Highly Inflected Languages
    Deksne, Daiga
    Skadins, Raivis
    Skadina, Inguna
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1401 - 1405
  • [4] Putting the Horses Before the Cart: Identifying Multiword Expressions Before Translation
    Ramisch, Carlos
    [J]. COMPUTATIONAL AND CORPUS-BASED PHRASEOLOGY, EUROPHRAS 2017, 2017, 10596 : 69 - 84
  • [5] Automated error analysis for multiword expressions: Using BLEU-type scores for automatic discovery of potential translation errors
    Babych, Bogdan
    Hartley, Anthony
    [J]. LINGUISTICA ANTVERPIENSIA NEW SERIES-THEMES IN TRANSLATION STUDIES, 2009, 8 : 81 - 104
  • [6] Extraction of terminology and phraseology towards the design of instructional resources for legal translation
    Rubio Donat, Tamara
    Angel Candel-Mora, Miguel
    [J]. MULTIMODAL COMMUNICATION IN THE 21ST CENTURY: PROFESSIONAL AND ACADEMIC CHALLENGES, 2015, 212 : 250 - 255
  • [7] Integrating Specialized Bilingual Lexicons of Multiword Expressions for Domain Adaptation in Statistical Machine Translation
    Semmar, Nasredine
    Laib, Meriama
    [J]. COMPUTATIONAL LINGUISTICS, PACLING 2017, 2018, 781 : 101 - 114
  • [8] Identification and translation of verb+noun Multiword Expressions: A Spanish-Basque study
    Inurrieta, Uxoa
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (64): : 123 - 126
  • [9] Extracting Multiword Expressions in Machine Translation from English to Urdu using Relational Data Approach
    Bilal, Kashif
    Muhammad, Uzair
    Khan, Atif
    Khan, M. Nasir
    [J]. PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 6, 2005, : 312 - 314
  • [10] Automatic Translation of Continuous and Fixed Arabic Frozen Expressions Using the NooJ Platform
    Kourtin, Asmaa
    Mbarki, Samir
    [J]. FORMALIZING NATURAL LANGUAGES: APPLICATIONS TO NATURAL LANGUAGE PROCESSING AND DIGITAL HUMANITIES, NOOJ 2023, 2024, 1816 : 213 - 224