Sentiment analysis on big sparse data streams with limited labels

被引:0
|
作者
Iosifidis, Vasileios [1 ]
Ntoutsi, Eirini [1 ]
机构
[1] Leibniz Univ Hannover, L3S Res Ctr, Hannover, Germany
关键词
Sentiment analysis; Semi-supervised learning; Class imbalance; Data augmentation; CLASSIFICATION;
D O I
10.1007/s10115-019-01392-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sentiment analysis is an important task in order to gain insights over the huge amounts of opinionated texts generated on a daily basis in social media like Twitter. Despite its huge amount, standard supervised learning methods won't work upon such sort of data due to lack of labels and the impracticality of (human) labeling at this scale. In this work, we leverage distant supervision and semi-supervised learning to annotate a big stream of tweets from 2015 which consists of 228 million tweets without retweets (and 275 million with retweets). We present the insights from our annotation process regarding the effect of different semi-supervised learning approaches, namely Self-Learning, Co-Training and Expectation-Maximization. Moreover, we propose two annotation modes, the batch mode where all labeled and unlabeled data are available to the algorithms from the beginning and a lightweight streaming mode that processes the data in batches based on their arrival time in the stream. Our experiments show that stream processing with a sliding window of three months achieves comparable results to batch processing while being more efficient. Finally, to tackle the class imbalance problem, as our dataset is imbalanced toward the positive sentiment class, and its aggravation by the semi-supervised learning methods, we employ data augmentation in the semi-supervised learning process in order to equalize the class distribution. Our results show that semi-supervised learning coupled with data augmentation outperforms significantly the default semi-supervised annotation process. We make the so-called TSentiment15 sentiment-annotated dataset available to the community to be used for evaluation purposes and for developing new methods.
引用
收藏
页码:1393 / 1432
页数:40
相关论文
共 50 条
  • [1] Sentiment analysis on big sparse data streams with limited labels
    Vasileios Iosifidis
    Eirini Ntoutsi
    [J]. Knowledge and Information Systems, 2020, 62 : 1393 - 1432
  • [2] Sentiment analysis on big sparse data streams with limited labels
    Iosifidis, Vasileios
    Ntoutsi, Eirini
    [J]. Knowledge and Information Systems, 2020, 62 (04): : 1393 - 1432
  • [3] Distributed Real-Time Sentiment Analysis for Big Data Social Streams
    Rahnama, Amir Hossein Akhavan
    [J]. 2014 INTERNATIONAL CONFERENCE ON CONTROL, DECISION AND INFORMATION TECHNOLOGIES (CODIT), 2014, : 789 - 794
  • [4] Sample size determination for biomedical big data with limited labels
    Richter, Aaron N.
    Khoshgoftaar, Taghi M.
    [J]. NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2020, 9 (01):
  • [5] Approximating Learning Curves for Imbalanced Big Data with Limited Labels
    Richter, Aaron N.
    Khoshgoftaar, Taghi M.
    [J]. 2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 237 - 242
  • [6] Sample size determination for biomedical big data with limited labels
    Aaron N. Richter
    Taghi M. Khoshgoftaar
    [J]. Network Modeling Analysis in Health Informatics and Bioinformatics, 2020, 9
  • [7] Learning High-Dimensional Evolving Data Streams With Limited Labels
    Din, Salah Ud
    Kumar, Jay
    Shao, Junming
    Mawuli, Cobbinah Bernard
    Ndiaye, Waldiodio David
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (11) : 11373 - 11384
  • [8] SENTIMENT ANALYSIS USING BIG DATA
    Ramanujam, R. Suresh
    Nancyamala, R.
    Nivedha, J.
    Kokila, J.
    [J]. 2015 INTERNATIONAL CONFERENCE ON COMPUTATION OF POWER, ENERGY, INFORMATION AND COMMUNICATION (ICCPEIC), 2015, : 480 - 484
  • [9] Large Scale Sentiment Learning with Limited Labels
    Iosifidis, Vasileios
    Ntoutsi, Eirini
    [J]. KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 1823 - 1832
  • [10] Sentiment Analysis in Tourism: Capitalizing on Big Data
    Alaei, Ali Reza
    Becken, Susanne
    Stantic, Bela
    [J]. JOURNAL OF TRAVEL RESEARCH, 2019, 58 (02) : 175 - 191