Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

被引:0
|
作者
Sharjeel, Muhammad [1 ]
Muneer, Iqra [2 ]
Nosheen, Sumaira [3 ]
Nawab, Rao Muhammad Adeel [1 ]
Rayson, Paul [4 ]
机构
[1] COMSATS Univ Islamabad, Dept Comp Sci, Lahore Campus, Lahore 54000, Pakistan
[2] Univ Engn & Technol Lahore, Narowal Campus, Narowal 54000, Pakistan
[3] Bahria Univ, Lahore Campus, Lahore 54000, Pakistan
[4] Univ Lancaster, Lancaster, England
关键词
Cross-lingual text reuse; cross-lingual text reuse detection; English-Urdu language pair; cross-lingual sentence embedding; Translation Plus Mono-lingual Analysis;
D O I
10.1145/3592761
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to detect. In previous studies, different corpora and methods have been developed for CLTRD at the sentence/passage level for the English-Urdu language pair. However, there is a lack of large standard corpora and methods for CLTRD for the English-Urdu language pair at the document level. To overcome this limitation, the significant contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus, called the TREU (Text Reuse for English-Urdu) corpus. It contains English to Urdu real cases of text reuse at the document level. The corpus is manually labelled into three categories (Wholly Derived = 672, Partially Derived = 888, and Non Derived = 697) with the source text in English and the derived text in the Urdu language. Another contribution of this study is the evaluation of the TREU corpus using a diversified range of methods to show its usefulness and how it can be utilized in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level. The best evaluation results, for both binary (F-1 = 0.78) and ternary (F-1 = 0.66) classification tasks, are obtained using a combination of all Translation plus Mono-lingual Analysis (T+MA) based methods. The TREU corpus is publicly available to promote CLTRD research in an under-resourced language, i.e., Urdu.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Language-dependent and language-independent approaches to cross-lingual text retrieval
    Kamps, J
    Monz, C
    de Rijke, M
    Sigurbjörnsson, R
    COMPARATIVE EVALUATION OF MULTILINGUAL INFORMATION ACCESS SYSTEMS, 2003, 3237 : 152 - 165
  • [22] Evaluation of a Cross-lingual Romanian-English Multi-document Summariser
    Orasan, Constantin
    Chiorean, Oana Andreea
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 2114 - 2119
  • [23] Cross-lingual text alignment for fine-grained plagiarism detection
    Ehsan, Nava
    Shakery, Azadeh
    Tompa, Frank Wm
    JOURNAL OF INFORMATION SCIENCE, 2019, 45 (04) : 443 - 459
  • [24] The Impact of Pretrained Language Models on Negation and Speculation Detection in Cross-Lingual Medical Text: Comparative Study
    Rivera Zavala, Renzo
    Martinez, Paloma
    JMIR MEDICAL INFORMATICS, 2020, 8 (12)
  • [25] Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information
    Ehsan, Nava
    Shakery, Azadeh
    INFORMATION PROCESSING & MANAGEMENT, 2016, 52 (06) : 1004 - 1017
  • [26] Siamese-Based Architecture for Cross-Lingual Plagiarism Detection in English-Hindi Language Pairs
    Agarwal, Basant
    Gupta, Mukesh Kumar
    Sharma, Harish
    Poonia, Ramesh Chandra
    BIG DATA, 2023, 11 (01) : 48 - 58
  • [27] Generalized Funnelling: Ensemble Learning and Heterogeneous Document Embeddings for Cross-Lingual Text Classification
    Moreo, Alejandro
    Pedrotti, Andrea
    Sebastiani, Fabrizio
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2023, 41 (02)
  • [28] Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies
    Badenes-Olmedo, Carlos
    Redondo-Garcia, Jose Luis
    Corcho, Oscar
    PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE (K-CAP '19), 2019, : 147 - 153
  • [29] Employing word mover's distance for cross-lingual plagiarized text detection
    Chang C.-M.
    Chang C.-H.
    Hwang S.-Y.
    Proceedings of the Association for Information Science and Technology, 2020, 57 (01)
  • [30] Transfer language selection for zero-shot cross-lingual abusive language detection
    Eronen, Juuso
    Ptaszynski, Michal
    Masui, Fumito
    Arata, Masaki
    Leliwa, Gniewosz
    Wroczynski, Michal
    INFORMATION PROCESSING & MANAGEMENT, 2022, 59 (04)