Urdu Text Reuse Detection at Phrasal level using Sentence Transformer-based approach

被引:1
|
作者
Mehak, Gull [1 ]
Muneer, Iqra [2 ]
Nawab, Rao Muhammad Adeel [1 ]
机构
[1] Comsats Univ Islamabad, Lahore Campus, Lahore, Pakistan
[2] Univ Engn & Technol Lahore, Narowal Campus, Narowal 54000, Pakistan
关键词
Text Reuse Detection; Urdu; Sentence Transformer; CORPUS;
D O I
10.1016/j.eswa.2023.121063
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text reuse is a process of creating new text(s) from pre-existing text(s). In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. Text Reuse Detection (TRD) has many potential applications in Plagiarism detection, Paraphrase detection, Paraphrase generation, and Analysis of text reuse in web content. In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. In previous studies, the problem of UTRD has been explored at the sentence level (Hafeez, 2022; Hafeez et al., 2023), sentence/passage level (Sameen et al., 2017), and document level (Sharjeel et al., 2017), along with benchmark corpora and approaches. However, the problem of UTRD has not been explored at the Phrasal level with respect to corpora and approaches. To fulfill this research gap, this research study has made a major contribution by developing a large benchmark manually annotated corpus of 25,001 text pairs at two levels of a rewrite: (1) Derived = 15,105 and (2) Non-Derived = 9896. In addition, we have developed, applied, evaluated, and compared baseline approaches (N-gram Overlap and Word Embedding-based approaches) with proposed Sentence Transformer-based approaches on the proposed UTRD-Phr-23 Corpus. As another contribution, we proposed a novel Sentence Transformers-based model (using a combination of eight different Sentence Transformers (ST) including paraphrase-multilingual-mpnet-base-v2, distiluse-base-multilingual-cased-v, paraphrase-multilingual-MiniLM-L12-v2, LaBSE, xlm-r-distilroberta-base-paraphrase-v1, xlm-r-100langs-bert-base-nli-mean-tokens, xlm-r-bert-base-nli-stsb-mean-tokens, and xlm-r-100langs-bert-base-nli-stsb-mean-tokens). Our proposed model outperforms with an F1 score of 0.63 compared to the best results obtained using N-gram Overlap (baseline) approach (F1 = 0.53).
引用
收藏
页数:9
相关论文
共 50 条
  • [31] A Transformer-based network intrusion detection approach for cloud security
    Zhenyue Long
    Huiru Yan
    Guiquan Shen
    Xiaolu Zhang
    Haoyang He
    Long Cheng
    [J]. Journal of Cloud Computing, 13
  • [32] A Transformer-Based Framework for Scene Text Recognition
    Selvam, Prabu
    Koilraj, Joseph Abraham Sundar
    Tavera Romero, Carlos Andres
    Alharbi, Meshal
    Mehbodniya, Abolfazl
    Webber, Julian L.
    Sengan, Sudhakar
    [J]. IEEE ACCESS, 2022, 10 : 100895 - 100910
  • [33] A Transformer-based network intrusion detection approach for cloud security
    Long, Zhenyue
    Yan, Huiru
    Shen, Guiquan
    Zhang, Xiaolu
    He, Haoyang
    Cheng, Long
    [J]. JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS, 2024, 13 (01):
  • [34] A Transformer-Based Approach to Leakage Detection in Water Distribution Networks
    Luo, Juan
    Wang, Chongxiao
    Yang, Jielong
    Zhong, Xionghu
    [J]. Sensors, 2024, 24 (19)
  • [35] RTIDS: A Robust Transformer-Based Approach for Intrusion Detection System
    Wu, Zihan
    Zhang, Hong
    Wang, Penghai
    Sun, Zhibo
    [J]. IEEE ACCESS, 2022, 10 : 64375 - 64387
  • [36] Am I Hurt?: Evaluating Psychological Pain Detection in Hindi Text Using Transformer-based Models
    Kaur, Ravleen
    Bhatia, M. P. S.
    Kumar, Akshi
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (08)
  • [37] Cyberbullying Text Identification: A Deep Learning and Transformer-based Language Modeling Approach
    Saifullah K.
    Khan M.I.
    Jamal S.
    Sarker I.H.
    [J]. EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, 2024, 11 (01) : 1 - 12
  • [38] Sentence Classification Using N-Grams in Urdu Language Text
    Awan, Malik Daler Ali
    Ali, Sikandar
    Samad, Ali
    Iqbal, Nadeem
    Missen, Malik Muhammad Saad
    Ullah, Niamat
    [J]. SCIENTIFIC PROGRAMMING, 2021, 2021
  • [39] An Artificial Neural Network Approach for Sentence Boundary Disambiguation in Urdu Language Text
    Raj, Shazia
    Rehman, Zobia
    Rauf, Sonia
    Siddique, Rehana
    Anwar, Muhammad
    [J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2015, 12 (04) : 395 - 400
  • [40] Arabic abstractive text summarization using RNN-based and transformer-based architectures
    Bani-Almarjeh, Mohammad
    Kurdy, Mohamad-Bassam
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)