A Mixed Malay-English Language COVID-19 Twitter Dataset: A Sentiment Analysis

被引:2
|
作者
Kong, Jeffery T. H. [1 ]
Juwono, Filbert H. H. [2 ]
Ngu, Ik Ying [3 ]
Nugraha, I. Gde Dharma [4 ]
Maraden, Yan [4 ]
Wong, W. K. [2 ]
机构
[1] Curtin Univ Malaysia, Dept Elect & Comp Engn, Miri 98009, Malaysia
[2] Univ Southampton Malaysia, Comp Sci Program, Iskandar Puteri 79100, Malaysia
[3] Curtin Univ Malaysia, Dept Media & Commun, Miri 98009, Malaysia
[4] Univ Indonesia, Dept Elect Engn, Depok 16424, Indonesia
关键词
BPE; CNN; COVID-19; fake news; M-BERT; Malaysia; sentiment analysis;
D O I
10.3390/bdcc7020061
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Social media has evolved into a platform for the dissemination of information, including fake news. There is a lot of false information about the current situation of the Coronavirus Disease 2019 (COVID-19) pandemic, such as false information regarding vaccination. In this paper, we focus on sentiment analysis for Malaysian COVID-19-related news on social media such as Twitter. Tweets in Malaysia are often a combination of Malay, English, and Chinese with plenty of short forms, symbols, emojis, and emoticons within the maximum length of a tweet. The contributions of this paper are twofold. Firstly, we built a multilingual COVID-19 Twitter dataset, comprising tweets written from 1 September 2021 to 12 December 2021. In particular, we collected 108,246 tweets, with over 67% in Malay language, 27% in English, 2% in Chinese, and 4% in other languages. We then manually annotated and assigned the sentiment of 11,568 tweets into three-class sentiments (positive, negative, and neutral) to develop a Malay-language sentiment analysis tool. For this purpose, we applied a data compression method using Byte-Pair Encoding (BPE) on the texts and used two deep learning approaches, i.e., the Multilingual Bidirectional Encoder Representation for Transformer (M-BERT) and convolutional neural network (CNN). BPE tokenization is used to encode rare and unknown words into smaller meaningful subwords. With the CNN, we converted the labeled tweets into image files. Our experiments explored different BPE vocabulary sizes with our BPE-Text-to-Image-CNN and BPE-M-BERT models. The results show that the optimal vocabulary size for BPE is 12,000; any values beyond that would not contribute much to the F1-score. Overall, our results show that BPE-M-BERT slightly outperforms the CNN model, thereby showing that the pre-trained M-BERT network has the advantage for our multilingual dataset.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Public attitudes toward COVID-19 vaccines on English-language Twitter: A sentiment analysis
    Liu, Siru
    Liu, Jialin
    VACCINE, 2021, 39 (39) : 5499 - 5505
  • [2] MELex: The Construction of Malay-English Sentiment Lexicon
    Mahadzir, Nurul Husna
    Omar, Mohd Faizal
    Nawi, Mohd Nasrun Mohd
    Salameh, Anas A.
    Hussin, Kasmaruddin Che
    Sohail, Abid
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (01): : 1789 - 1805
  • [3] Sentiment Analysis on COVID-19 Twitter Data
    Vijay, Tanmay
    Chawla, Ayan
    Dhanka, Balan
    Karmakar, Purnendu
    2020 5TH IEEE INTERNATIONAL CONFERENCE ON RECENT ADVANCES AND INNOVATIONS IN ENGINEERING (IEEE - ICRAIE-2020), 2020,
  • [4] Sentiment Analysis on COVID-19 Twitter Data: A Sentiment Timeline
    Karagkiozidou, Makrina
    Koukaras, Paraskevas
    Tjortjis, Christos
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2022, PART II, 2022, 647 : 350 - 359
  • [5] Covid-19 vaccine hesitancy: Text mining, sentiment analysis and machine learning on COVID-19 vaccination Twitter dataset
    Qorib, Miftahul
    Oladunni, Timothy
    Denis, Max
    Ososanya, Esther
    Cotae, Paul
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 212
  • [6] Twitter sentiment analysis for COVID-19 associated mucormycosis
    Singh, Maneet
    Dhillon, Hennaav Kaur
    Ichhpujani, Parul
    Iyengar, Sudarshan
    Kaur, Rishemjit
    INDIAN JOURNAL OF OPHTHALMOLOGY, 2022, 70 (05) : 1773 - +
  • [7] Twitter sentiment and stock market: a COVID-19 analysis
    Katsafados, Apostolos G.
    Nikoloutsopoulos, Sotirios
    Leledakis, George N.
    JOURNAL OF ECONOMIC STUDIES, 2023, 50 (08) : 1866 - 1888
  • [8] Covid-19 vaccine hesitancy on English-language Twitter
    Thelwall, Mike
    Kousha, Kayvan
    Thelwall, Saheeda
    PROFESIONAL DE LA INFORMACION, 2021, 30 (02):
  • [9] COVID-19 pandemic and the economy: sentiment analysis on Twitter data
    Fano, Shira
    Toschi, Gianluca
    INTERNATIONAL JOURNAL OF COMPUTATIONAL ECONOMICS AND ECONOMETRICS, 2022, 12 (04) : 429 - 444
  • [10] Analysis of Public Sentiment on COVID-19 Vaccination Using Twitter
    Jayasurya, Gutti Gowri
    Kumar, Sanjay
    Singh, Binod Kumar
    Kumar, Vinay
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2022, 9 (04) : 1101 - 1111