Audio-Visual Speech Synchronization Detection Using a Bimodal Linear Prediction Model

被引:0
|
作者
Kumar, Kshitiz [1 ]
Navratil, Jiri [2 ]
Marcheret, Etienne [2 ]
Libal, Vit [2 ]
Ramaswamy, Ganesh [2 ]
Potamianos, Gerasimos [3 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
[3] NCSR Demokritos, Inst Informat & Telecommun, GR-15310 Athens, Greece
关键词
Audio-Visual Synchronization; Mutual Information; Linear Prediction; Visual Features;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we study the problem of detecting audiovisual (AV) synchronization in video segments containing a speaker in frontal head pose. The problem holds important applications in biometrics, for example spoofing detection, and it constitutes an important step in AV segmentation necessary for deriving AV fingerprints in multimodal speaker recognition. To attack the problem, we propose a time-evolution model for AV features and derive an analytical approach to capture the notion of synchronization between them. We report results on an appropriate AV database, using two hypes of visual features extracted from the speaker's facial area: geometric ones and features based on the discrete cosine image transform. Our results demonstrate that the proposed approach provides substantially better AV synchrony detection over a baseline method that employs mutual information, with the geometric visual features outperforming the image transform ones.
引用
收藏
页码:670 / +
页数:2
相关论文
共 50 条
  • [41] Audio-visual speech perception is special
    Tuomainen, J
    Andersen, TS
    Tiippana, K
    Sams, M
    [J]. COGNITION, 2005, 96 (01) : B13 - B22
  • [42] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [43] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
    Estellers, Virginia
    Thiran, Jean-Philippe
    [J]. 19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069
  • [44] Using Twin-HMM-Based Audio-Visual Speech Enhancement as a Front-End for Robust Audio-Visual Speech Recognition
    Abdelaziz, Ahmed Hussen
    Zeiler, Steffen
    Kolossa, Dorothea
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 867 - 871
  • [45] Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
    Ahmad, Rehan
    Zubair, Syed
    Alquhayz, Hani
    Ditta, Allah
    [J]. SENSORS, 2019, 19 (23)
  • [46] Audio-Visual Speech Cue Combination
    Arnold, Derek H.
    Tear, Morgan
    Schindel, Ryan
    Roseboom, Warrick
    [J]. PLOS ONE, 2010, 5 (04):
  • [47] Audio-visual integration for speech recognition
    Kober, R
    Harz, U
    [J]. NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184
  • [48] Audio-visual speech recognition by speechreading
    Zhang, XZ
    Mersereau, RM
    Clements, MA
    [J]. DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 1069 - 1072
  • [49] Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization)
    Deligne, S
    Potamianos, G
    Neti, C
    [J]. SAM2002: IEEE SENSOR ARRAY AND MULTICHANNEL SIGNAL PROCESSING WORKSHOP PROCEEDINGS, 2002, : 68 - 71
  • [50] Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss
    Takashima, Yuki
    Aihara, Ryo
    Takiguchi, Tetsuya
    Ariki, Yasuo
    Mitani, Nobuyuki
    Omori, Kiyohiro
    Nakazono, Kaoru
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 277 - 281