Abstractive Text Summarization for the Urdu Language: Data and Methods

被引：0

作者：

Awais, Muhammad ^{[1
]}

Muhammad Adeel Nawab, Rao ^{[1
]}

机构：

[1] COMSATS Univ Islamabad, Dept Comp Sci, Lahore Campus, Lahore 54000, Pakistan

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Task analysis; Long short term memory; Deep learning; Benchmark testing; Social networking (online); Convolutional neural networks; Natural language processing; Abstracts; Text detection; Artificial intelligence; Publishing; Unsupervised learning; Machine learning; Text analysis; Text summarization; Abstractive text summarization; BART; corpus; deep learning models; GPT-3.5; large language models; Urdu;

D O I：

10.1109/ACCESS.2024.3378300

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The task of abstractive text summarization aims to automatically generate a short and concise summary of a given source article. In recent years, automatic abstractive text summarization has attracted the attention of researchers because large volumes of digital text are readily available in multiple languages on a wide range of topics. Automatically generating precise summaries from large text has potential application in the generation of news headlines, a summary of research articles, the moral of the stories, media marketing, search engine optimization, financial research, social media marketing, question-answering systems, and chatbots. In literature, the problem of abstractive text summarization has been mainly investigated for English and some other languages. However, it has not been thoroughly explored for the Urdu language despite having a huge amount of data available in digital format. To fulfill this gap, this paper presents a large benchmark corpus of 2,067,784 Urdu news articles for the Urdu abstractive text summarization task. As a secondary contribution, we applied a range of deep learning (LSTM, Bi-LSTM, LSTM with attention, GRU, Bi-GRU, and GRU with attention), and large language models (BART and GPT-3.5) on our proposed corpus. Our extensive evaluation on 20,000 test instances showed that GRU with attention model outperforms the other models with ROUGE- 1 = 46.7 , ROUGE- 2 = 24.1 , and ROUGE-L = 48.7. To foster research in Urdu, our proposed corpus is publically and freely available for research purposes under the Creative Common Licence.

引用

页码：61198 / 61210

页数：13

共 50 条

[1] Extractive Text Summarization Models for Urdu Language
Nawaz, Ali
Bakhtyar, Maheen
Baber, Junaid
Ullah, Ihsan
Noor, Waheed
Basit, Abdul
[J]. INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (06)
[2] End to End Urdu Abstractive Text Summarization With Dataset and Improvement in Evaluation Metric
Raza, Hassan
Shahzad, Waseem
[J]. IEEE ACCESS, 2024, 12 : 40311 - 40324
[3] A Survey of Abstractive Text Summarization Utilising Pretrained Language Models
Syed, Ayesha Ayub
Gaol, Ford Lumban
Boediman, Alfred
Matsuo, Tokuro
Budiharto, Widodo
[J]. INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2022, PT I, 2022, 13757 : 532 - 544
[4] An approach to Abstractive Text Summarization
Huong Thanh Le
Tien Manh Le
[J]. 2013 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR), 2013, : 371 - 376
[5] A Survey on Abstractive Text Summarization
Moratanch, N.
Chitrakala, S.
[J]. PROCEEDINGS OF IEEE INTERNATIONAL CONFERENCE ON CIRCUIT, POWER AND COMPUTING TECHNOLOGIES (ICCPCT 2016), 2016,
[6] Abstractive text summarization for Hungarian
Yang, Zijian Gyozo
Agocs, Adam
Kusper, Gabor
Varadi, Tamas
[J]. ANNALES MATHEMATICAE ET INFORMATICAE, 2021, 53 : 299 - 316
[7] Survey on Abstractive Text Summarization
Raphal, Nithin
Duwarah, Hemanta
Daniel, Philemon
[J]. PROCEEDINGS OF THE 2018 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING (ICCSP), 2018, : 513 - 517
[8] Two New Datasets for Italian-Language Abstractive Text Summarization
Landro, Nicola
Gallo, Ignazio
La Grassa, Riccardo
Federici, Edoardo
[J]. INFORMATION, 2022, 13 (05)
[9] Abstractive Text Summarization based on Language Model Conditioning and Locality Modeling
Aksenov, Dmitrii
Moreno-Schneider, Julian
Bourgonje, Peter
Schwarzenberg, Robert
Hennig, Leonhard
Rehm, Georg
[J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6680 - 6689
[10] Abstractive Text Summarization Using Hybrid Technique of Summarization
Liaqat, Muhammad Irfan
Hamid, Isma
Nawaz, Qamar
Shafique, Nida
[J]. 2022 14TH INTERNATIONAL CONFERENCE ON COMMUNICATION SOFTWARE AND NETWORKS (ICCSN 2022), 2022, : 141 - 144

← 1 2 3 4 5 →