CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learning

被引:0
|
作者
Humayoun, Muhammad [1 ]
Akhtar, Naheed [2 ]
机构
[1] Higher Coll Technol, Comp Informat Sci Div, Abu Dhabi, U Arab Emirates
[2] Univ Educ, Dept Comp Sci, Lahore, Pakistan
来源
关键词
Natural language processing; Automatic text summarization; Single document summarization; Extraction based summarization; Extracts; Urdu summary corpus; Supervised learning; Urdu language; Resource poor language; CROSS-VALIDATION; TEXT; MODEL;
D O I
10.1016/j.iswa.2022.200129
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text summarization is the process of shortening the text so that it conveys the key points. Several text summarization methods and benchmark corpora are available for languages like English. A significant hurdle in developing and evaluating existing or new text summarization methods is the unavailability of standardized benchmark corpora, especially for South Asian languages. Among other things, a reference corpus enables researchers to compare existing state-of-the-art methods. Our study addresses this gap by developing a benchmark corpus for one of the widely spoken yet under-resourced language Urdu. The reported corpus contains 161 documents with manually written extractive summaries from the newswire domain. We also perform several experiments on the corpus to show how it can be used to develop, evaluate, and compare text summarization systems using a supervised learning approach for the Urdu language. Our results show that the state of the art classifiers are good candidates for Urdu text summarization when supervised learning techniques are employed. Also, a radical word segmentation technique such as fixed-length segmentation outperforms all other settings (Senetnce Match F1 = 57%, ROUGE-2 F1 = 64.4%). On the basic preprocessing of Urdu texts, we observe that tokenization of words on space is a reliable approach until the proper word segmentation tools for Urdu are mature enough. On word similarity features needed for supervised learning, it is observed that a radical stemming such as Ultra stemming with length (1 and 2) works better than the existing stemming and lemmatization tools for Urdu. Finally, the artificially generated datasets do not significantly improve results compared to the original data.
引用
收藏
页数:19
相关论文
共 18 条
  • [1] Does Supervised Learning of Sentence Candidates Produce the Best Extractive Summaries?
    Gutierrez Hinojosa, Sandra J.
    Calvo, Hiram
    Moreno-Armendariz, Marco A.
    Duchanoy, Carlos
    [J]. ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 293 - 296
  • [2] Contextual Urdu Text Emotion Detection Corpus and Experiments using Deep Learning Approaches
    Vardag, Muhammad Hamayon Khan
    Saeed, Ali
    Hayat, Umer
    Ullah, Muhammad Farhat
    Hussain, Naveed
    [J]. ADCAIJ-ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL, 2022, 11 (04): : 489 - 505
  • [3] CLEU - A Cross-language english-urdu corpus and benchmark for text reuse experiments
    Muneer, Iqra
    Sharjeel, Muhammad
    Iqbal, Muntaha
    Nawab, Rao Muhammad Adeel
    Rayson, Paul
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2019, 70 (07) : 729 - 741
  • [4] Extractive summarization using supervised and unsupervised learning
    Mao, Xiangke
    Yang, Hui
    Huang, Shaobin
    Liu, Ye
    Li, Rongsheng
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 133 : 173 - 181
  • [5] Urdu Sentiment Analysis Using Supervised Machine Learning Approach
    Mukhtar, Neelam
    Khan, Mohammad Abid
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2018, 32 (02)
  • [6] Extractive Document Summarization Using a Supervised Learning Approach
    Charitha, Sangaraju
    Chittaragi, Nagaratna B.
    Koolagudi, Shashidhar G.
    [J]. PROCEEDINGS OF 2018 IEEE DISTRIBUTED COMPUTING, VLSI, ELECTRICAL CIRCUITS AND ROBOTICS (DISCOVER), 2018, : 7 - 12
  • [7] An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods
    Ali, Mazhar
    Wagan, Asim Imdad
    [J]. MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2019, 38 (01) : 185 - 196
  • [8] Multilingual emotion classification using supervised learning: Comparative experiments
    Becker, Karin
    Moreira, Viviane P.
    dos Santos, Aline G. L.
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (03) : 684 - 704
  • [9] Optical Character Recognition System for Nastalique Urdu-Like Script Languages Using Supervised Learning
    Rizvi, S. S. R.
    Sagheer, A.
    Adnan, K.
    Muhammad, A.
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2019, 33 (10)
  • [10] A Corpus-based Approach for Keyword Identification using Supervised Learning Techniques
    TeCho, Jakkrit
    Nattee, Cholwich
    Theeramunkong, Thanaruk
    [J]. ECTI-CON 2008: PROCEEDINGS OF THE 2008 5TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY, VOLS 1 AND 2, 2008, : 33 - 36