Audio-Visual Speech Synchronization Detection Using a Bimodal Linear Prediction Model

被引：0

作者：

Kumar, Kshitiz ^{[1
]}

Navratil, Jiri ^{[2
]}

Marcheret, Etienne ^{[2
]}

Libal, Vit ^{[2
]}

Ramaswamy, Ganesh ^{[2
]}

Potamianos, Gerasimos ^{[3
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA

[3] NCSR Demokritos, Inst Informat & Telecommun, GR-15310 Athens, Greece

来源：

2009 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPR WORKSHOPS 2009), VOLS 1 AND 2 | 2009年

关键词：

Audio-Visual Synchronization; Mutual Information; Linear Prediction; Visual Features;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we study the problem of detecting audiovisual (AV) synchronization in video segments containing a speaker in frontal head pose. The problem holds important applications in biometrics, for example spoofing detection, and it constitutes an important step in AV segmentation necessary for deriving AV fingerprints in multimodal speaker recognition. To attack the problem, we propose a time-evolution model for AV features and derive an analytical approach to capture the notion of synchronization between them. We report results on an appropriate AV database, using two hypes of visual features extracted from the speaker's facial area: geometric ones and features based on the discrete cosine image transform. Our results demonstrate that the proposed approach provides substantially better AV synchrony detection over a baseline method that employs mutual information, with the geometric visual features outperforming the image transform ones.

引用

页码：670 / +

页数：2

共 50 条

[41] Audio-visual speech perception is special
Tuomainen, J
Andersen, TS
Tiippana, K
Sams, M
[J]. COGNITION, 2005, 96 (01) : B13 - B22
[42] Deep Audio-Visual Speech Recognition
Afouras, Triantafyllos
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
[43] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
Estellers, Virginia
Thiran, Jean-Philippe
[J]. 19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069
[44] Using Twin-HMM-Based Audio-Visual Speech Enhancement as a Front-End for Robust Audio-Visual Speech Recognition
Abdelaziz, Ahmed Hussen
Zeiler, Steffen
Kolossa, Dorothea
[J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 867 - 871
[45] Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
Ahmad, Rehan
Zubair, Syed
Alquhayz, Hani
Ditta, Allah
[J]. SENSORS, 2019, 19 (23)
[46] Audio-Visual Speech Cue Combination
Arnold, Derek H.
Tear, Morgan
Schindel, Ryan
Roseboom, Warrick
[J]. PLOS ONE, 2010, 5 (04):
[47] Audio-visual integration for speech recognition
Kober, R
Harz, U
[J]. NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184
[48] Audio-visual speech recognition by speechreading
Zhang, XZ
Mersereau, RM
Clements, MA
[J]. DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 1069 - 1072
[49] Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization)
Deligne, S
Potamianos, G
Neti, C
[J]. SAM2002: IEEE SENSOR ARRAY AND MULTICHANNEL SIGNAL PROCESSING WORKSHOP PROCEEDINGS, 2002, : 68 - 71
[50] Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss
Takashima, Yuki
Aihara, Ryo
Takiguchi, Tetsuya
Ariki, Yasuo
Mitani, Nobuyuki
Omori, Kiyohiro
Nakazono, Kaoru
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 277 - 281

← 1 2 3 4 5 →