Recovering document annotations for sentence-level bitext

被引:0
|
作者
Wicks, Rachel [1 ,2 ]
Post, Matt [1 ,2 ,3 ]
Koehn, Philipp [1 ,2 ]
机构
[1] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[3] Microsoft, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather than those that may have been sentence-level machine translated. Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation. We release our dataset, PARADOCS, and resulting models as a resource to the community.
引用
收藏
页码:9876 / 9890
页数:15
相关论文
共 50 条
  • [41] Sentence-Level Sentiment Analysis via BERT and BiGRU
    Shen, Jianghong
    Liao, Xiaodong
    Tao, Zhuang
    2019 INTERNATIONAL CONFERENCE ON IMAGE AND VIDEO PROCESSING, AND ARTIFICIAL INTELLIGENCE, 2019, 11321
  • [42] Evaluation of novelty metrics for sentence-level novelty mining
    Tsai, Flora S.
    Tang, Wenyin
    Chan, Kap Luk
    INFORMATION SCIENCES, 2010, 180 (12) : 2359 - 2374
  • [43] Sentence-Level Sarcasm Detection in English and Filipino Tweets
    Samonte, Mary Jane C.
    Dollete, Carl Justine T.
    Capanas, Paolo Mikkael M.
    Flores, Maristela Louise C.
    Soriano, Caroline B.
    2018 4TH INTERNATIONAL CONFERENCE ON INDUSTRIAL AND BUSINESS ENGINEERING (ICIBE 2018), 2018, : 181 - 186
  • [44] Neural circuitry underlying sentence-level linguistic prosody
    Tong, YX
    Gandour, J
    Talavage, T
    Wong, D
    Dzemidzic, M
    Xu, YS
    Li, XJ
    Lowe, M
    NEUROIMAGE, 2005, 28 (02) : 417 - 428
  • [45] Neural Sentence-level Sentiment Classification with Heterogeneous Supervision
    Yuan, Zhigang
    Wu, Fangzhao
    Liu, Junxin
    Wu, Chuhan
    Huang, Yongfeng
    Xie, Xing
    2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 1410 - 1415
  • [46] Sentence-level constructions: A demonstration in support of Ambridge (2020)
    Chandler, Steve
    FIRST LANGUAGE, 2020, 40 (5-6) : 569 - 572
  • [47] What Is Important to Measure in Sentence-Level Language Comprehension?
    Sarmiento, Cherish M.
    Truckenmiller, Adrea J.
    ASSESSMENT FOR EFFECTIVE INTERVENTION, 2024, 49 (04) : 202 - 213
  • [48] Towards Sentence-Level Brain Decoding with Distributed Representations
    Sun, Jingyuan
    Wang, Shaonan
    Zhang, Jiajun
    Zong, Chengqing
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 7047 - 7054
  • [49] Attentive preference personalized recommendation with sentence-level explanations
    Xie, Jin
    Zhu, Fuxi
    Li, Xuefei
    Huang, Sheng
    Liu, Shichao
    NEUROCOMPUTING, 2021, 426 : 235 - 247
  • [50] Comparing Sentence-Level Features for Authorship Analysis in Portuguese
    Sousa-Silva, Rui
    Sarmento, Luis
    Grant, Tim
    Oliveira, Eugenio
    Maia, Belinda
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROCEEDINGS, 2010, 6001 : 51 - +