Recovering document annotations for sentence-level bitext

被引:0
|
作者
Wicks, Rachel [1 ,2 ]
Post, Matt [1 ,2 ,3 ]
Koehn, Philipp [1 ,2 ]
机构
[1] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[3] Microsoft, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather than those that may have been sentence-level machine translated. Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation. We release our dataset, PARADOCS, and resulting models as a resource to the community.
引用
收藏
页码:9876 / 9890
页数:15
相关论文
共 50 条
  • [11] Exploring the Potential Impact of Sentence-Level Comprehension and Sentence-Level Fluency on Deaf Students' Passage Comprehension
    Zhao, Ying
    Wu, Xinchun
    Chen, Hongjun
    Sun, Peng
    Xie, Ruibo
    Feng, Jie
    JOURNAL OF SPEECH LANGUAGE AND HEARING RESEARCH, 2020, 63 (07): : 2281 - 2292
  • [12] Sentence-Level Emotion and Valence Tagging
    Das, Dipankar
    Bandyopadhyay, Sivaji
    COGNITIVE COMPUTATION, 2012, 4 (04) : 420 - 435
  • [13] Sentence-level ranking with quality estimation
    Avramidis, Eleftherios
    MACHINE TRANSLATION, 2013, 27 (3-4) : 239 - 256
  • [14] Sentence-Level Sentiment Analysis in Persian
    Basiri, Mohammad Ehsan
    Kabiri, Arman
    2017 3RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION AND IMAGE ANALYSIS (IPRIA), 2017, : 84 - 89
  • [15] An algorithm for fuzzy-based Sentence-level Document Clustering for Micro-level Contradiction Analysis
    Mehta, R. Vasanth Kumar
    Sankarasubramaniam, B.
    Rajalakshmi, S.
    PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI'12), 2012, : 102 - 105
  • [16] Moving TIGER beyond Sentence-Level
    Falenska, Agnieszka
    Eckart, Kerstin
    Kuhn, Jonas
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2203 - 2210
  • [17] Responding to sentence-level errors in writing
    Yates, R
    Kenkel, J
    JOURNAL OF SECOND LANGUAGE WRITING, 2002, 11 (01) : 29 - 47
  • [18] Recognizing Sentence-level Logical Document Structures with the Help of Context-free Grammars
    Hildebrand, Jonathan
    Hemati, Wahed
    Mehler, Alexander
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5282 - 5290
  • [19] Sentence-Level Emotion and Valence Tagging
    Dipankar Das
    Sivaji Bandyopadhyay
    Cognitive Computation, 2012, 4 : 420 - 435
  • [20] Topic model for long document extractive summarization with sentence-level features and dynamic memory unit
    Han, Chunlong
    Feng, Jianzhou
    Qi, Haotian
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238