Urdu Text Reuse Detection at Phrasal level using Sentence Transformer-based approach

被引:1
|
作者
Mehak, Gull [1 ]
Muneer, Iqra [2 ]
Nawab, Rao Muhammad Adeel [1 ]
机构
[1] Comsats Univ Islamabad, Lahore Campus, Lahore, Pakistan
[2] Univ Engn & Technol Lahore, Narowal Campus, Narowal 54000, Pakistan
关键词
Text Reuse Detection; Urdu; Sentence Transformer; CORPUS;
D O I
10.1016/j.eswa.2023.121063
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text reuse is a process of creating new text(s) from pre-existing text(s). In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. Text Reuse Detection (TRD) has many potential applications in Plagiarism detection, Paraphrase detection, Paraphrase generation, and Analysis of text reuse in web content. In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. In previous studies, the problem of UTRD has been explored at the sentence level (Hafeez, 2022; Hafeez et al., 2023), sentence/passage level (Sameen et al., 2017), and document level (Sharjeel et al., 2017), along with benchmark corpora and approaches. However, the problem of UTRD has not been explored at the Phrasal level with respect to corpora and approaches. To fulfill this research gap, this research study has made a major contribution by developing a large benchmark manually annotated corpus of 25,001 text pairs at two levels of a rewrite: (1) Derived = 15,105 and (2) Non-Derived = 9896. In addition, we have developed, applied, evaluated, and compared baseline approaches (N-gram Overlap and Word Embedding-based approaches) with proposed Sentence Transformer-based approaches on the proposed UTRD-Phr-23 Corpus. As another contribution, we proposed a novel Sentence Transformers-based model (using a combination of eight different Sentence Transformers (ST) including paraphrase-multilingual-mpnet-base-v2, distiluse-base-multilingual-cased-v, paraphrase-multilingual-MiniLM-L12-v2, LaBSE, xlm-r-distilroberta-base-paraphrase-v1, xlm-r-100langs-bert-base-nli-mean-tokens, xlm-r-bert-base-nli-stsb-mean-tokens, and xlm-r-100langs-bert-base-nli-stsb-mean-tokens). Our proposed model outperforms with an F1 score of 0.63 compared to the best results obtained using N-gram Overlap (baseline) approach (F1 = 0.53).
引用
收藏
页数:9
相关论文
共 50 条
  • [21] EchoBERT: A Transformer-Based Approach for Behavior Detection in Echograms
    Maloy, Hakon
    IEEE ACCESS, 2020, 8 : 218372 - 218385
  • [22] Automatic text summarization using transformer-based language models
    Rao, Ritika
    Sharma, Sourabh
    Malik, Nitin
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2024, 15 (06) : 2599 - 2605
  • [23] Development of a Text Classification Framework using Transformer-based Embeddings
    Yeasmin, Sumona
    Afrin, Nazia
    Saif, Kashfia
    Huq, Mohammad Rezwanul
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2022, : 74 - 82
  • [24] Transformer-Based Bidirectional Encoder Representations for Emotion Detection from Text
    Kumar, Ashok J.
    Cambria, Erik
    Trueman, Tina Esther
    2021 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI 2021), 2021,
  • [25] Psychological disorder detection: A multimodal approach using a transformer-based hybrid model
    Ghosh, Debadrita
    Karande, Hema
    Gite, Shilpa
    Pradhan, Biswajeet
    METHODSX, 2024, 13
  • [26] Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair
    Sharjeel, Muhammad
    Muneer, Iqra
    Nosheen, Sumaira
    Nawab, Rao Muhammad Adeel
    Rayson, Paul
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
  • [27] Transformer-based image captioning by leveraging sentence information
    Chahkandi, Vahid
    Fadaeieslam, Mohammad Javad
    Yaghmaee, Farzin
    JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (04)
  • [28] TIRec: Transformer-based Invoice Text Recognition
    Chen, Yanlan
    2023 2ND ASIA CONFERENCE ON ALGORITHMS, COMPUTING AND MACHINE LEARNING, CACML 2023, 2023, : 175 - 180
  • [29] Transformer-Based Flood Detection Using Multiclass Segmentation
    Park, Joo-Chan
    Kim, Dong-Geon
    Yang, Ji-Ro
    Kang, Kyo-Seok
    2023 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING, BIGCOMP, 2023, : 291 - 292
  • [30] Practical Transformer-based Multilingual Text Classification
    Wang, Cindy
    Banko, Michele
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, NAACL-HLT 2021, 2021, : 121 - 129