Noisy Parallel Corpus Filtering through Projected Word Embeddings

被引:0
|
作者
Kurfali, Murathan [1 ]
Ostling, Robert [1 ]
机构
[1] Stockholm Univ, Dept Linguist, Stockholm, Sweden
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a very simple method for parallel text cleaning of low-resource languages, based on projection of word embeddings trained on large monolingual corpora in high-resource languages. In spite of its simplicity, we approach the strong baseline system in the downstream machine translation evaluation.
引用
收藏
页码:277 / 281
页数:5
相关论文
共 50 条
  • [1] Towards Robust Word Embeddings for Noisy Texts
    Doval, Yerai
    Vilares, Jesus
    Gomez-Rodriguez, Carlos
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (19): : 1 - 15
  • [2] Faster Parallel Training of Word Embeddings
    Wszola, Eliza
    Jaggi, Martin
    Puschel, Markus
    [J]. 2021 IEEE 28TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC 2021), 2021, : 31 - 41
  • [3] Learning Word Embeddings in Parallel by Alignment
    Zubair, Sahil
    Zubair, Mohammad
    [J]. 2017 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2017, : 566 - 571
  • [4] Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair
    Nikolova-Stoupak, Iglika
    Shimizu, Shuichiro
    Chu, Chenhui
    Kurohashi, Sadao
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE COMPUTATIONAL LINGUISTICS IN BULGARIA, CLIB 2022, 2022, : 39 - 48
  • [5] The impact of corpus domain on word representation: a study on Persian word embeddings
    Hadifar, Amir
    Momtazi, Saeedeh
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2018, 52 (04) : 997 - 1019
  • [6] The impact of corpus domain on word representation: a study on Persian word embeddings
    Amir Hadifar
    Saeedeh Momtazi
    [J]. Language Resources and Evaluation, 2018, 52 : 997 - 1019
  • [7] Doubts on the reliability of parallel corpus filtering
    Moon, Hyeonseok
    Park, Chanjun
    Koo, Seonmin
    Lee, Jungseob
    Lee, Seungjun
    Seo, Jaehyung
    Eo, Sugyeong
    Jang, Yoonna
    Kim, Hyunjoong
    Lee, Hyoung-gyu
    Lim, Heuiseok
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 233
  • [8] Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
    Kvapilikova, Ivana
    Artetxe, Mikel
    Labaka, Gorka
    Agirre, Eneko
    Bojar, Ondrej
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 255 - 262
  • [9] Filtering noisy parallel corpora of Web pages
    Nie, JY
    Cai, J
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 453 - 458
  • [10] Comparing Word Embeddings through Visualisation
    Santos, Pedro
    Datia, Nuno
    Pato, Matilde
    Sobral, Jose
    [J]. 2022 26TH INTERNATIONAL CONFERENCE INFORMATION VISUALISATION (IV), 2022, : 91 - 97