A Simple Yet Robust Algorithm for Automatic Extraction of Parallel Sentences: A Case Study on Arabic-English Wikipedia Articles

被引：2

作者：

Althobaiti, Maha Jarallah ^{[1
]}

机构：

[1] Taif Univ, Dept Comp Sci, At Taif 21944, Saudi Arabia

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Semantics; Online services; Internet; Encyclopedias; Dictionaries; Machine translation; Computational modeling; Automatic creation of parallel corpus; automatic sentence alignment; deep learning; neural machine translation; transformer model; word embedding; LINKAGE;

D O I：

10.1109/ACCESS.2021.3137830

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Parallel corpora are vital components in several applications of Natural Language Processing (NLP), particularly in machine translation. In this paper, we present a novel method for automatically creating parallel sentences from comparable corpora. The method requires a bilingual dictionary as well as an adequate word-vectorisation method. We use Arabic and English Wikipedia as a comparable corpus to apply our proposed method and construct a parallel corpus between Arabic and English. The created Arabic-English corpus consists of 105,010 parallel sentences with a total number of 4.6M words. During our study, we compared two methods of word vectorisation, word embedding and term frequency-inverse document frequency, in terms of their usefulness in computing similarities between well-formed and syntactically ill-formed sentences. We also quantitatively and qualitatively examined the parallel corpus produced by our proposed method and compared it with other available Arabic-English parallel corpora counterparts: GlobalVoices, TED, and Wiki-OPUS. We explored the main advantages and shortcomings of these corpora when used for NLP applications, such as word semantic similarity identification and Neural Machine Translation (NMT). The word semantic similarity models trained on our parallel corpus outperformed models trained on other corpora in the task of English non-similar word identification. Our parallel corpus also proved competitive when building Arabic-English NMT systems, yielding results comparable to those of the automatically created Wiki-OPUS corpus and of the manually created TED corpus, while achieving results superior to the smaller GlobalVoices corpus.

引用

页码：401 / 420

页数：20