Recovering document annotations for sentence-level bitext

被引:0
|
作者
Wicks, Rachel [1 ,2 ]
Post, Matt [1 ,2 ,3 ]
Koehn, Philipp [1 ,2 ]
机构
[1] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[3] Microsoft, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather than those that may have been sentence-level machine translated. Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation. We release our dataset, PARADOCS, and resulting models as a resource to the community.
引用
收藏
页码:9876 / 9890
页数:15
相关论文
共 50 条
  • [1] Sentence-level Privacy for Document Embeddings
    Meehan, Casey
    Mrini, Khalil
    Chaudhuri, Kamalika
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 3367 - 3380
  • [2] Analyzing Continuous-Time and Sentence-Level Annotations for Speech Emotion Recognition
    Martinez-Lucas, Luz
    Lin, Wei-Cheng
    Busso, Carlos
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (03) : 1754 - 1768
  • [3] Document Summarization Using Sentence-Level Semantic Based on Word Embeddings
    Al-Sabahi, Kamal
    Zhang Zuping
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2019, 29 (02) : 177 - 196
  • [4] Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval
    Yilmaz, Zeynep Akkalyoncu
    Yang, Wei
    Zhang, Haotian
    Lin, Jimmy
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 3490 - 3496
  • [5] Exploiting Sentence-Level Features for Near-Duplicate Document Detection
    Wang, Jenq-Haur
    Chang, Hung-Chi
    INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2009, 5839 : 205 - +
  • [6] Sentiment classification in English from sentence-level annotations of emotions regarding models of affect
    Trilla, Alexandre
    Alias, Francesc
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 508 - 511
  • [7] Sentence combining: A sentence-level writing intervention
    Saddler, B
    READING TEACHER, 2005, 58 (05): : 468 - 471
  • [8] Sentence-Level Attachment Prediction
    Albakour, M-Dyaa
    Kruschwitz, Udo
    Lucas, Simon
    ADVANCES IN MULTIDISCIPLINARY RETRIEVAL, 2010, 6107 : 6 - 19
  • [9] A Sentence-Level Hierarchical BERT Model for Document Classification with Limited Labelled Data
    Lu, Jinghui
    Henchion, Maeve
    Bacher, Ivan
    Mac Namee, Brian
    DISCOVERY SCIENCE (DS 2021), 2021, 12986 : 231 - 241
  • [10] Aggregating Sentence-level Features for Chinese Near-duplicate Document Detection
    Liang, Yan
    Tao, Yizheng
    Feng, Ning
    Wan, Zhenjing
    Xu, Feng
    Jiang, Xue
    Gao, Shan
    PROCEEDINGS OF THE 2017 IEEE 14TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC 2017), 2017, : 174 - 179