The PolitiFact-Oslo Corpus: A New Dataset for Fake News Analysis and Detection

被引:4
|
作者
Poldvere, Nele [1 ]
Uddin, Zia [2 ]
Thomas, Aleena [2 ]
机构
[1] Univ Oslo, Dept Literature Area Studies & European Languages, N-0315 Oslo, Norway
[2] Sintef Digital, N-0373 Oslo, Norway
关键词
corpus development; text type; sentiment; part-of-speech; Bi-LSTM; transformers;
D O I
10.3390/info14120627
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This study presents a new dataset for fake news analysis and detection, namely, the PolitiFact-Oslo Corpus. The corpus contains samples of both fake and real news in English, collected from the fact-checking website PolitiFact.com. It grew out of a need for a more controlled and effective dataset for fake news analysis and detection model development based on recent events. Three features make it uniquely placed for this: (i) the texts have been individually labelled for veracity by experts, (ii) they are complete texts that strictly correspond to the claims in question, and (iii) they are accompanied by important metadata such as text type (e.g., social media, news and blog). In relation to this, we present a pipeline for collecting quality data from major fact-checking websites, a procedure which can be replicated in future corpus building efforts. An exploratory analysis based on sentiment and part-of-speech information reveals interesting differences between fake and real news as well as between text types, thus highlighting the importance of adding contextual information to fake news corpora. Since the main application of the PolitiFact-Oslo Corpus is in automatic fake news detection, we critically examine the applicability of the corpus and another PolitiFact dataset built based on less strict criteria for various deep learning-based efficient approaches, such as Bidirectional Long Short-Term Memory (Bi-LSTM), LSTM fine-tuned transformers such as Bidirectional Encoder Representations from Transformers (BERT) and RoBERTa, and XLNet.
引用
收藏
页数:32
相关论文
共 50 条
  • [21] Ax-to-Grind Urdu: Benchmark Dataset for Urdu Fake News Detection
    Harris, Sheetal
    Liu, Jinshuo
    Hadi, Hassan Jalil
    Cao, Yue
    arXiv, 1600,
  • [22] Automatic Ground Truth Dataset Creation for Fake News Detection in Social Media
    Karidi, Danae Pla
    Nakos, Harry
    Stavrakas, Yannis
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2019, PT I, 2019, 11871 : 424 - 436
  • [23] Ax-to-Grind Urdu: Benchmark Dataset for Urdu Fake News Detection
    Harris, Sheetal
    Liu, Jinshuo
    Hadi, Hassan Jalil
    Cao, Yue
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 2440 - 2447
  • [24] Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines
    Sung, Yoo Yeon
    Boyd-Graber, Jordan
    Hassan, Naeemul
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023,
  • [25] Inclusive Study of Fake News Detection for COVID-19 with New Dataset using Supervised Learning Algorithms
    Qalaja, Emad K.
    Al-Haija, Qasem Abu
    Tareef, Afaf
    Al-Nabhan, Mohammad M.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (08) : 1 - 12
  • [26] An overview of fake news detection: From a new perspective
    Hu, Bo
    Mao, Zhendong
    Zhang, Yongdong
    FUNDAMENTAL RESEARCH, 2025, 5 (01): : 332 - 346
  • [27] Fake News Detection Methods: A Survey and New Perspectives
    Hamida, Zineb Ferhat
    Refoufi, Allaoua
    Drif, Ahlem
    ADVANCED INTELLIGENT SYSTEMS FOR SUSTAINABLE DEVELOPMENT (AI2SD'2020), VOL 2, 2022, 1418 : 123 - 141
  • [28] Multimodal Fake News Detection on Fakeddit Dataset Using Transformer-Based Architectures
    Kalra, Sakshi
    Kumar, Chitneedi Hemanth Sai
    Sharma, Yashvardhan
    Chauhan, Gajendra Singh
    MACHINE LEARNING, IMAGE PROCESSING, NETWORK SECURITY AND DATA SCIENCES, MIND 2022, PT II, 2022, 1763 : 281 - 292
  • [29] Multimodal Fake News Detection on Fakeddit Dataset Using Transformer-Based Architectures
    Kalra, Sakshi
    Kumar, Chitneedi Hemanth Sai
    Sharma, Yashvardhan
    Chauhan, Gajendra Singh
    Communications in Computer and Information Science, 2022, 1763 CCIS : 281 - 292
  • [30] A Hybrid Model for Effective Fake News Detection with a Novel COVID-19 Dataset
    Kaliyar, Rohit Kumar
    Goswami, Anurag
    Narang, Pratik
    ICAART: PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 2, 2021, : 1066 - 1072