Audio-Visual Speech Synchronization Detection Using a Bimodal Linear Prediction Model

被引:0
|
作者
Kumar, Kshitiz [1 ]
Navratil, Jiri [2 ]
Marcheret, Etienne [2 ]
Libal, Vit [2 ]
Ramaswamy, Ganesh [2 ]
Potamianos, Gerasimos [3 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
[3] NCSR Demokritos, Inst Informat & Telecommun, GR-15310 Athens, Greece
关键词
Audio-Visual Synchronization; Mutual Information; Linear Prediction; Visual Features;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we study the problem of detecting audiovisual (AV) synchronization in video segments containing a speaker in frontal head pose. The problem holds important applications in biometrics, for example spoofing detection, and it constitutes an important step in AV segmentation necessary for deriving AV fingerprints in multimodal speaker recognition. To attack the problem, we propose a time-evolution model for AV features and derive an analytical approach to capture the notion of synchronization between them. We report results on an appropriate AV database, using two hypes of visual features extracted from the speaker's facial area: geometric ones and features based on the discrete cosine image transform. Our results demonstrate that the proposed approach provides substantially better AV synchrony detection over a baseline method that employs mutual information, with the geometric visual features outperforming the image transform ones.
引用
收藏
页码:670 / +
页数:2
相关论文
共 50 条
  • [1] Robust Audio-Visual Speech Synchrony Detection by Generalized Bimodal Linear Prediction
    Kumar, Kshitiz
    Navratil, Jiri
    Marcheret, Etienne
    Libal, Vit
    Potamianos, Gerasimos
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2219 - +
  • [2] Audio-visual modeling for bimodal speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Chung, KC
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
  • [3] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    [J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
  • [4] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [5] CLASSIFYING LAUGHTER AND SPEECH USING AUDIO-VISUAL FEATURE PREDICTION
    Petridis, Stavros
    Asghar, Ali
    Pantic, Maja
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5254 - 5257
  • [6] On the Audio-visual Synchronization for Lip-to-Speech Synthesis
    Niu, Zhe
    Mak, Brian
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 7809 - 7818
  • [7] Investigating the audio-visual speech detection advantage
    Kim, J
    Davis, C
    [J]. SPEECH COMMUNICATION, 2004, 44 (1-4) : 19 - 30
  • [8] Two-Level Bimodal Association for Audio-Visual Speech Recognition
    Lee, Jong-Seok
    Ebrahimi, Touradj
    [J]. ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, PROCEEDINGS, 2009, 5807 : 133 - 144
  • [9] A ROBUST AUDIO-VISUAL SPEECH ENHANCEMENT MODEL
    Wang, Wupeng
    Xing, Chao
    Wang, Dong
    Chen, Xiao
    Sun, Fengyu
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7529 - 7533
  • [10] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    [J]. 1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528