Audio-Visual Speech Synchronization Detection Using a Bimodal Linear Prediction Model

被引：0

作者：

Kumar, Kshitiz ^{[1
]}

Navratil, Jiri ^{[2
]}

Marcheret, Etienne ^{[2
]}

Libal, Vit ^{[2
]}

Ramaswamy, Ganesh ^{[2
]}

Potamianos, Gerasimos ^{[3
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA

[3] NCSR Demokritos, Inst Informat & Telecommun, GR-15310 Athens, Greece

来源：

2009 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPR WORKSHOPS 2009), VOLS 1 AND 2 | 2009年

关键词：

Audio-Visual Synchronization; Mutual Information; Linear Prediction; Visual Features;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we study the problem of detecting audiovisual (AV) synchronization in video segments containing a speaker in frontal head pose. The problem holds important applications in biometrics, for example spoofing detection, and it constitutes an important step in AV segmentation necessary for deriving AV fingerprints in multimodal speaker recognition. To attack the problem, we propose a time-evolution model for AV features and derive an analytical approach to capture the notion of synchronization between them. We report results on an appropriate AV database, using two hypes of visual features extracted from the speaker's facial area: geometric ones and features based on the discrete cosine image transform. Our results demonstrate that the proposed approach provides substantially better AV synchrony detection over a baseline method that employs mutual information, with the geometric visual features outperforming the image transform ones.

引用

页码：670 / +

页数：2

共 50 条

[1] Robust Audio-Visual Speech Synchrony Detection by Generalized Bimodal Linear Prediction
Kumar, Kshitiz
Navratil, Jiri
Marcheret, Etienne
Libal, Vit
Potamianos, Gerasimos
[J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2219 - +
[2] Audio-visual modeling for bimodal speech recognition
Kaynak, MN
Zhi, Q
Cheok, AD
Sengupta, K
Chung, KC
[J]. 2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
[3] Bimodal fusion in audio-visual speech recognition
Zhang, XZ
Mersereau, RM
Clements, M
[J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
[4] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Tamura, Satoshi
Ishikawa, Masato
Hashiba, Takashi
Takeuchi, Shin'ichi
Hayamizu, Satoru
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
[5] CLASSIFYING LAUGHTER AND SPEECH USING AUDIO-VISUAL FEATURE PREDICTION
Petridis, Stavros
Asghar, Ali
Pantic, Maja
[J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5254 - 5257
[6] On the Audio-visual Synchronization for Lip-to-Speech Synthesis
Niu, Zhe
Mak, Brian
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 7809 - 7818
[7] Investigating the audio-visual speech detection advantage
Kim, J
Davis, C
[J]. SPEECH COMMUNICATION, 2004, 44 (1-4) : 19 - 30
[8] Two-Level Bimodal Association for Audio-Visual Speech Recognition
Lee, Jong-Seok
Ebrahimi, Touradj
[J]. ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, PROCEEDINGS, 2009, 5807 : 133 - 144
[9] A ROBUST AUDIO-VISUAL SPEECH ENHANCEMENT MODEL
Wang, Wupeng
Xing, Chao
Wang, Dong
Chen, Xiao
Sun, Fengyu
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7529 - 7533
[10] An audio-visual distance for audio-visual speech vector quantization
Girin, L
Foucher, E
Feng, G
[J]. 1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528

← 1 2 3 4 5 →