Audio-Visual Speaker Recognition for Video Broadcast News

被引:0
|
作者
Benoît Maison
Chalapathy Neti
Andrew Senior
机构
[1] IBM Thomas J. Watson Research Center,
关键词
speaker identification; face recognition; multimodal; fusion; broadcast news;
D O I
暂无
中图分类号
学科分类号
摘要
Audio-based speaker identification degrades severely when there is a mismatch between training and test conditions due either to channel or to noise. In this paper, we explore various techniques to combine video based speaker identification with audio-based speaker identification to improve the performance under mismatched conditions. Specifically, we explore techniques to optimally determine the relative weights of the independent decisions based on audio and video to achieve the best combination. Experiments on video broadcast news data show that significant improvements can be achieved by the fusion in acoustically degraded conditions.
引用
收藏
页码:71 / 79
页数:8
相关论文
共 50 条
  • [41] Video clip recognition using joint audio-visual processing model
    Kulesh, V
    Petrushin, VA
    Sethi, IK
    [J]. 16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL I, PROCEEDINGS, 2002, : 500 - 503
  • [42] Audio-visual speaker identification based on the use of dynamic audio and visual features
    Fox, N
    Reilly, RB
    [J]. AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 743 - 751
  • [43] Video clip recognition using joint audio-visual processing model
    Kulesh, Victor
    Petrushin, Valery A.
    Sethi, Ishwar K.
    [J]. Proceedings - International Conference on Pattern Recognition, 2002, 16 (01): : 500 - 503
  • [44] Integrating audio-visual features and text information for story segmentation of news video
    Liu, Hua-Yong
    Zhou, Dong-Ru
    [J]. Wuhan University Journal of Natural Sciences, 2003, 8 (04) : 1070 - 1074
  • [46] A Visual Signal Reliability for Robust Audio-Visual Speaker Identification
    Tariquzzaman, Md.
    Kim, Jin Young
    Na, Seung You
    Kim, Hyoung-Gook
    Har, Dongsoo
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (10): : 2052 - 2055
  • [47] Transcribing broadcast news for audio and video indexing
    Gauvain, JL
    Lamel, L
    Adda, G
    [J]. COMMUNICATIONS OF THE ACM, 2000, 43 (02) : 64 - 70
  • [48] Audio-visual Speaker Recognition via Multi-modal Correlated Neural Networks
    Geng, Jiajia
    Liu, Xin
    Cheung, Yiu-ming
    [J]. 2016 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE WORKSHOPS (WIW 2016), 2016, : 123 - 128
  • [49] Audio-visual speaker recognition using time-varying stream reliability prediction
    Chaudhari, UV
    Ramaswamy, GN
    Potamianos, G
    Neti, C
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 712 - 715
  • [50] AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION
    Roth, Joseph
    Chaudhuri, Sourish
    Klejch, Ondrej
    Marvin, Radhika
    Gallagher, Andrew
    Kaver, Liat
    Ramaswamy, Sharadh
    Stopczynski, Arkadiusz
    Schmid, Cordelia
    Xi, Zhonghua
    Pantofaru, Caroline
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4492 - 4496