Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada

被引:0
|
作者
Tian, Yuan [1 ]
Zhang, Wenjing [1 ]
Duan, Lujie [1 ]
McDonald, Wade [1 ]
Osgood, Nathaniel [1 ]
机构
[1] Univ Saskatchewan, Dept Comp Sci, Saskatoon, SK, Canada
来源
基金
加拿大自然科学与工程研究理事会;
关键词
influenza; COVID-19; social media; transformer-based language models; digital surveillance;
D O I
10.3389/fdgth.2023.1203874
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
BackgroundThe use of social media data provides an opportunity to complement traditional influenza and COVID-19 surveillance methods for the detection and control of outbreaks and informing public health interventions. ObjectiveThe first aim of this study is to investigate the degree to which Twitter users disclose health experiences related to influenza and COVID-19 that could be indicative of recent plausible influenza cases or symptomatic COVID-19 infections. Second, we seek to use the Twitter datasets to train and evaluate the classification performance of Bidirectional Encoder Representations from Transformers (BERT) and variant language models in the context of influenza and COVID-19 infection detection. MethodsWe constructed two Twitter datasets using a keyword-based filtering approach on English-language tweets collected from December 2016 to December 2022 in Saskatchewan, Canada. The influenza-related dataset comprised tweets filtered with influenza-related keywords from December 13, 2016, to March 17, 2018, while the COVID-19 dataset comprised tweets filtered with COVID-19 symptom-related keywords from January 1, 2020, to June 22, 2021. The Twitter datasets were cleaned, and each tweet was annotated by at least two annotators as to whether it suggested recent plausible influenza cases or symptomatic COVID-19 cases. We then assessed the classification performance of pre-trained transformer-based language models, including BERT-base, BERT-large, RoBERTa-base, RoBERT-large, BERTweet-base, BERTweet-covid-base, BERTweet-large, and COVID-Twitter-BERT (CT-BERT) models, on each dataset. To address the notable class imbalance, we experimented with both oversampling and undersampling methods. ResultsThe influenza dataset had 1129 out of 6444 (17.5%) tweets annotated as suggesting recent plausible influenza cases. The COVID-19 dataset had 924 out of 11939 (7.7%) tweets annotated as inferring recent plausible COVID-19 cases. When compared against other language models on the COVID-19 dataset, CT-BERT performed the best, supporting the highest scores for recall (94.8%), F1(94.4%), and accuracy (94.6%). For the influenza dataset, BERTweet models exhibited better performance. Our results also showed that applying data balancing techniques such as oversampling or undersampling method did not lead to improved model performance. ConclusionsUtilizing domain-specific language models for monitoring users' health experiences related to influenza and COVID-19 on social media shows improved classification performance and has the potential to supplement real-time disease surveillance.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Towards COVID-19 fake news detection using transformer-based models
    Alghamdi, Jawaher
    Lin, Yuqing
    Luo, Suhuai
    [J]. KNOWLEDGE-BASED SYSTEMS, 2023, 274
  • [2] Depression detection in social media posts using transformer-based models and auxiliary features
    Kerasiotis, Marios
    Ilias, Loukas
    Askounis, Dimitris
    [J]. SOCIAL NETWORK ANALYSIS AND MINING, 2024, 14 (01)
  • [3] Comparison of Impressions of COVID-19 Vaccination and Influenza Vaccination in Japan by Analyzing Social Media Using Text Mining
    Mori, Yoshiro
    Miyatake, Nobuyuki
    Suzuki, Hiromi
    Mori, Yuka
    Okada, Setsuo
    Tanimoto, Kiyotaka
    [J]. VACCINES, 2023, 11 (08)
  • [4] Transformer-based deep learning models for the sentiment analysis of social media data
    Kokab, Sayyida Tabinda
    Asghar, Sohail
    Naz, Shehneela
    [J]. ARRAY, 2022, 14
  • [5] AraCovTexFinder: Leveraging the transformer-based language model for Arabic COVID-19 text identification
    Hossain, Md. Rajib
    Hoque, Mohammed Moshiul
    Siddique, Nazmul
    Dewan, Ali Akber
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
  • [6] A Transformer-Based Model for Evaluation of Information Relevance in Online Social-Media: A Case Study of Covid-19 Media Posts
    Sharma, Utkarsh
    Pandey, Prateek
    Kumar, Shishir
    [J]. NEW GENERATION COMPUTING, 2022, 40 (04) : 1029 - 1052
  • [7] A Transformer-Based Model for Evaluation of Information Relevance in Online Social-Media: A Case Study of Covid-19 Media Posts
    Utkarsh Sharma
    Prateek Pandey
    Shishir Kumar
    [J]. New Generation Computing, 2022, 40 : 1029 - 1052
  • [8] ADAPTATION OF DOMAIN-SPECIFIC TRANSFORMER MODELS WITH TEXT OVERSAMPLING FOR SENTIMENT ANALYSIS OF SOCIAL MEDIA POSTS ON COVID-19 VACCINE
    Bansal, Anmol
    Choudhry, Arjun
    Sharma, Anubhav
    Susan, Seba
    [J]. COMPUTER SCIENCE-AGH, 2023, 24 (02): : 167 - 186
  • [9] Public attention about COVID-19 on social media: An investigation based on data mining and text analysis
    Hou, Keke
    Hou, Tingting
    Cai, Lili
    [J]. PERSONALITY AND INDIVIDUAL DIFFERENCES, 2021, 175
  • [10] Am I Hurt?: Evaluating Psychological Pain Detection in Hindi Text Using Transformer-based Models
    Kaur, Ravleen
    Bhatia, M. P. S.
    Kumar, Akshi
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (08)