Urdu Text Reuse Detection at Phrasal level using Sentence Transformer-based approach

被引:1
|
作者
Mehak, Gull [1 ]
Muneer, Iqra [2 ]
Nawab, Rao Muhammad Adeel [1 ]
机构
[1] Comsats Univ Islamabad, Lahore Campus, Lahore, Pakistan
[2] Univ Engn & Technol Lahore, Narowal Campus, Narowal 54000, Pakistan
关键词
Text Reuse Detection; Urdu; Sentence Transformer; CORPUS;
D O I
10.1016/j.eswa.2023.121063
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text reuse is a process of creating new text(s) from pre-existing text(s). In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. Text Reuse Detection (TRD) has many potential applications in Plagiarism detection, Paraphrase detection, Paraphrase generation, and Analysis of text reuse in web content. In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. In previous studies, the problem of UTRD has been explored at the sentence level (Hafeez, 2022; Hafeez et al., 2023), sentence/passage level (Sameen et al., 2017), and document level (Sharjeel et al., 2017), along with benchmark corpora and approaches. However, the problem of UTRD has not been explored at the Phrasal level with respect to corpora and approaches. To fulfill this research gap, this research study has made a major contribution by developing a large benchmark manually annotated corpus of 25,001 text pairs at two levels of a rewrite: (1) Derived = 15,105 and (2) Non-Derived = 9896. In addition, we have developed, applied, evaluated, and compared baseline approaches (N-gram Overlap and Word Embedding-based approaches) with proposed Sentence Transformer-based approaches on the proposed UTRD-Phr-23 Corpus. As another contribution, we proposed a novel Sentence Transformers-based model (using a combination of eight different Sentence Transformers (ST) including paraphrase-multilingual-mpnet-base-v2, distiluse-base-multilingual-cased-v, paraphrase-multilingual-MiniLM-L12-v2, LaBSE, xlm-r-distilroberta-base-paraphrase-v1, xlm-r-100langs-bert-base-nli-mean-tokens, xlm-r-bert-base-nli-stsb-mean-tokens, and xlm-r-100langs-bert-base-nli-stsb-mean-tokens). Our proposed model outperforms with an F1 score of 0.63 compared to the best results obtained using N-gram Overlap (baseline) approach (F1 = 0.53).
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Cross-Lingual Text Reuse Detection at sentence level for English-Urdu language pair
    Muneer, Iqra
    Nawab, Rao Muhammad Adeel
    [J]. COMPUTER SPEECH AND LANGUAGE, 2022, 75
  • [2] Transformer-based Text Detection in the Wild
    Raisi, Zobeir
    Naiel, Mohamed A.
    Younes, Georges
    Wardell, Steven
    Zelek, John S.
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 3156 - 3165
  • [3] Mono-lingual text reuse detection for the Urdu language at lexical level
    Noreen, Ayesha
    Muneer, Iqra
    Nawab, Rao Muhammad Adeel
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 136
  • [4] Transformer-Based Approach to Melanoma Detection
    Cirrincione, Giansalvo
    Cannata, Sergio
    Cicceri, Giovanni
    Prinzi, Francesco
    Currieri, Tiziana
    Lovino, Marta
    Militello, Carmelo
    Pasero, Eros
    Vitabile, Salvatore
    [J]. SENSORS, 2023, 23 (12)
  • [5] Roman Urdu Hate Speech Detection Using Transformer-Based Model for Cyber Security Applications
    Bilal, Muhammad
    Khan, Atif
    Jan, Salman
    Musa, Shahrulniza
    Ali, Shaukat
    [J]. SENSORS, 2023, 23 (08)
  • [6] A transformer-based Urdu image caption generation
    Muhammad Hadi
    Iqra Safder
    Hajra Waheed
    Farooq Zaman
    Naif Radi Aljohani
    Raheel Nawaz
    Saeed Ul Hassan
    Raheem Sarwar
    [J]. Journal of Ambient Intelligence and Humanized Computing, 2024, 15 (9) : 3441 - 3457
  • [7] A transformer-based approach to Nigerian Pidgin text generation
    Garba, Kabir
    Kolajo, Taiwo
    Agbogun, Joshua B.
    [J]. International Journal of Speech Technology, 2024, 27 (04) : 1027 - 1037
  • [8] Urdu Short Paraphrase Detection at Sentence Level
    Hafeez, Hamza
    Muneer, Iqra
    Sharjeel, Muhammad
    Ashraf, Muhammad Adnan
    Nawab, Rao Muhammad Adeel
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (04)
  • [9] A transformer-based approach to irony and sarcasm detection
    Rolandos Alexandros Potamias
    Georgios Siolas
    Andreas - Georgios Stafylopatis
    [J]. Neural Computing and Applications, 2020, 32 : 17309 - 17320
  • [10] A transformer-based approach to irony and sarcasm detection
    Potamias, Rolandos Alexandros
    Siolas, Georgios
    Stafylopatis, Andreas-Georgios
    [J]. NEURAL COMPUTING & APPLICATIONS, 2020, 32 (23): : 17309 - 17320