Human detection of political speech deepfakes across transcripts, audio, and video

被引：4

作者：

Groh, Matthew ^{[1
]}

Sankaranarayanan, Aruna ^{[2
,3
]}

Singh, Nikhil ^{[2
]}

Kim, Dong Young ^{[2
]}

Lippman, Andrew ^{[2
]}

Picard, Rosalind ^{[2
]}

机构：

[1] Northwestern Univ, Kellogg Sch Management, Evanston, IL 60208 USA

[2] MIT, Media Lab, Cambridge, MA USA

[3] MIT, CSAIL, Cambridge, MA USA

来源：

NATURE COMMUNICATIONS | 2024年 / 15卷 / 01期

关键词：

SOCIAL MEDIA; NEWS; MISINFORMATION; DISINFORMATION; ATTENTION; KNOWLEDGE; SCIENCE; PHOTOS; IMPACT;

D O I：

10.1038/s41467-024-51998-z

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Recent advances in technology for hyper-realistic visual and audio effects provoke the concern that deepfake videos of political speeches will soon be indistinguishable from authentic video. We conduct 5 pre-registered randomized experiments with N = 2215 participants to evaluate how accurately humans distinguish real political speeches from fabrications across base rates of misinformation, audio sources, question framings with and without priming, and media modalities. We do not find base rates of misinformation have statistically significant effects on discernment. We find deepfakes with audio produced by the state-of-the-art text-to-speech algorithms are harder to discern than the same deepfakes with voice actor audio. Moreover across all experiments and question framings, we find audio and visual information enables more accurate discernment than text alone: human discernment relies more on how something is said, the audio-visual cues, than what is said, the speech content. With advances in generative AI, political speech deepfakes are becoming more realistic. Here, the authors show that people's ability to distinguish between real and fake speeches relies on audio and visual information more than the speech content.

引用

页数：16

共 50 条

[21] Speech and crosstalk detection in multichannel audio
Wrigley, SN
Brown, GJ
Wan, V
Renals, S
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2005, 13 (01): : 84 - 91
[22] Audio to audio-video speech conversion with the help of phonetic knowledge integration
Bothe, HH
SMC '97 CONFERENCE PROCEEDINGS - 1997 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: CONFERENCE THEME: COMPUTATIONAL CYBERNETICS AND SIMULATION, 1997, : 1632 - 1637
[23] Fusion of audio and video information for detecting speech events
Asano, F
Motomura, Y
Asoh, H
Yoshimura, T
Ichimura, N
Nakamura, S
FUSION 2003: PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE OF INFORMATION FUSION, VOLS 1 AND 2, 2003, : 386 - 393
[24] Audio-video feature correlation:: Faces and speech
Durand, G
Montacié, C
Caraty, MJ
Faudemay, P
MULTIMEDIA STORAGE AND ARCHIVING SYSTEMS IV, 1999, 3846 : 102 - 112
[25] Cleft Palate Speech and Resonance: An Audio and Video Resource
VanLue, Michael
PLASTIC AND RECONSTRUCTIVE SURGERY, 2021, 147 (04) : 1029 - 1030
[26] Deepfakes and Disinformation: Exploring the Impact of Synthetic Political Video on Deception, Uncertainty, and Trust in News
Vaccari, Cristian
Chadwick, Andrew
SOCIAL MEDIA + SOCIETY, 2020, 6 (01):
[27] Constructing a speech audio–video corpus by aligning long segments of speech and text
Karpukhin I.A.
Konushin A.S.
Moscow University Computational Mathematics and Cybernetics, 2017, 41 (2) : 97 - 103
[28] Providing detection strategies to improve human detection of deepfakes: An experimental study
Somoray, Klaire
Miller, Dan J.
COMPUTERS IN HUMAN BEHAVIOR, 2023, 149
[29] NON-SPEECH AUDIO EVENT DETECTION
Portelo, Jose
Bugalho, Miguel
Trancoso, Isabel
Neto, Joao
Abad, Alberto
Serralheiro, Antonio
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 1973 - 1976
[30] Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for Enhanced Video Forgery Detection
Das, Sayantan
Kolahdouzi, Mojtaba
Ozparlak, Levent
Hickie, Will
Etemad, Ali
2023 IEEE INTERNATIONAL JOINT CONFERENCE ON BIOMETRICS, IJCB, 2023,

← 1 2 3 4 5 →