Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

被引:0
|
作者
Sharjeel, Muhammad [1 ]
Muneer, Iqra [2 ]
Nosheen, Sumaira [3 ]
Nawab, Rao Muhammad Adeel [1 ]
Rayson, Paul [4 ]
机构
[1] COMSATS Univ Islamabad, Dept Comp Sci, Lahore Campus, Lahore 54000, Pakistan
[2] Univ Engn & Technol Lahore, Narowal Campus, Narowal 54000, Pakistan
[3] Bahria Univ, Lahore Campus, Lahore 54000, Pakistan
[4] Univ Lancaster, Lancaster, England
关键词
Cross-lingual text reuse; cross-lingual text reuse detection; English-Urdu language pair; cross-lingual sentence embedding; Translation Plus Mono-lingual Analysis;
D O I
10.1145/3592761
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to detect. In previous studies, different corpora and methods have been developed for CLTRD at the sentence/passage level for the English-Urdu language pair. However, there is a lack of large standard corpora and methods for CLTRD for the English-Urdu language pair at the document level. To overcome this limitation, the significant contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus, called the TREU (Text Reuse for English-Urdu) corpus. It contains English to Urdu real cases of text reuse at the document level. The corpus is manually labelled into three categories (Wholly Derived = 672, Partially Derived = 888, and Non Derived = 697) with the source text in English and the derived text in the Urdu language. Another contribution of this study is the evaluation of the TREU corpus using a diversified range of methods to show its usefulness and how it can be utilized in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level. The best evaluation results, for both binary (F-1 = 0.78) and ternary (F-1 = 0.66) classification tasks, are obtained using a combination of all Translation plus Mono-lingual Analysis (T+MA) based methods. The TREU corpus is publicly available to promote CLTRD research in an under-resourced language, i.e., Urdu.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Cross-Lingual Text Reuse Detection at sentence level for English-Urdu language pair
    Muneer, Iqra
    Nawab, Rao Muhammad Adeel
    COMPUTER SPEECH AND LANGUAGE, 2022, 75
  • [2] Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair
    Muneer, Iqra
    Nawab, Rao Muhammad Adeel
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
  • [3] Developing a Cross-lingual Semantic Word Similarity Corpus for English-Urdu Language Pair
    Fatima, Ghazeefa
    Nawab, Rao Muhammad Adeel
    Khan, Muhammad Salman
    Saeed, Ali
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
  • [4] Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels
    Muneer, Iqra
    Nawab, Rao Muhammad Adeel
    LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (04) : 1103 - 1130
  • [5] Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels
    Iqra Muneer
    Rao Muhammad Adeel Nawab
    Language Resources and Evaluation, 2022, 56 : 1103 - 1130
  • [6] CLEU - A Cross-language english-urdu corpus and benchmark for text reuse experiments
    Muneer, Iqra
    Sharjeel, Muhammad
    Iqbal, Muntaha
    Nawab, Rao Muhammad Adeel
    Rayson, Paul
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2019, 70 (07) : 729 - 741
  • [7] Mono-lingual text reuse detection for the Urdu language at lexical level
    Noreen, Ayesha
    Muneer, Iqra
    Nawab, Rao Muhammad Adeel
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 136
  • [8] Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair
    Haneef, Israr
    Nawab, Rao Muhammad Adeel
    Munir, Ehsan Ullah
    Bajwa, Imran Sarwar
    SCIENTIFIC PROGRAMMING, 2019, 2019
  • [9] Heterogeneous Document Embeddings for Cross-Lingual Text Classification
    Moreo, Alejandro
    Pedrotti, Andrea
    Sebastiani, Fabrizio
    36TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2021, 2021, : 685 - 688
  • [10] Lexical and Semantic Features for Cross-lingual Text Reuse Classification: an Experiment in English and Latin Paraphrases
    Moritz, Maria
    Steding, David
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1976 - 1980