Recovering document annotations for sentence-level bitext

被引:0
|
作者
Wicks, Rachel [1 ,2 ]
Post, Matt [1 ,2 ,3 ]
Koehn, Philipp [1 ,2 ]
机构
[1] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[3] Microsoft, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather than those that may have been sentence-level machine translated. Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation. We release our dataset, PARADOCS, and resulting models as a resource to the community.
引用
收藏
页码:9876 / 9890
页数:15
相关论文
共 50 条
  • [21] Explainable document-level event extraction via back-tracing to sentence-level event clues
    Lu, Shudong
    Zhao, Gang
    Li, Si
    Guo, Jun
    KNOWLEDGE-BASED SYSTEMS, 2022, 248
  • [22] Sentence-Level Resampling for Named Entity Recognition
    Wang, Xiaochen
    Wang, Yue
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2151 - 2165
  • [23] Sentence-level Chinese Character input method
    Xu, Zhiming
    Wang, Xiaolong
    Jiang, Shouxu
    Gaojishu Tongxin/High Technology Letters, 2000, 10 (01): : 51 - 55
  • [24] Phoneme and Sentence-Level Ensembles for Speech Recognition
    Christos Dimitrakakis
    Samy Bengio
    EURASIP Journal on Audio, Speech, and Music Processing, 2011
  • [25] Sentence-level event classification in unstructured texts
    Naughton, M.
    Stokes, N.
    Carthy, J.
    INFORMATION RETRIEVAL, 2010, 13 (02): : 132 - 156
  • [26] Sentence-Level Sentiment Analysis in the Presence of Modalities
    Liu, Yang
    Yu, Xiaohui
    Liu, Bing
    Chen, Zhongshuai
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PART II, 2014, 8404 : 1 - 16
  • [27] Phoneme and Sentence-Level Ensembles for Speech Recognition
    Dimitrakakis, Christos
    Bengio, Samy
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2011,
  • [28] An Evaluative Baseline for Sentence-Level Semantic Division
    Cai, Kuangsheng
    Chen, Zugang
    Guo, Hengliang
    Wang, Shaohua
    Li, Guoqing
    Li, Jing
    Chen, Feng
    Feng, Hang
    MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2024, 6 (01): : 41 - 52
  • [29] AN APPROACH TO SENTENCE-LEVEL ANAPHORA IN MACHINE TRANSLATION
    VANNOORD, G
    DORREPAAL, J
    ARNOLD, D
    KRAUWER, S
    SADLER, L
    DESTOMBE, L
    FOURTH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 1989, : 299 - 307
  • [30] Sentence-level event classification in unstructured texts
    M. Naughton
    N. Stokes
    J. Carthy
    Information Retrieval, 2010, 13 : 132 - 156