AUDIO-VISUAL PERSON RECOGNITION IN MULTIMEDIA DATA FROM THE IARPA JANUS PROGRAM

被引:0
|
作者
Sell, Gregory [1 ]
Duh, Kevin [1 ]
Snyder, David [1 ]
Etter, Dave [1 ]
Garcia-Romero, Daniel [1 ]
机构
[1] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA
关键词
multimodal; audio-visual; speaker recognition; face recognition; multimedia;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Currently, datasets that support audio-visual recognition of people in videos are scarce and limited. In this paper, we introduce an expansion of video data from the IARPA Janus program to support this research area. We refer to the expanded set, which adds labels for voice to the already-existing face labels, as the Janus Multimedia dataset. We first describe the speaker labeling process, which involved a combination of automatic and manual criteria. We then discuss two evaluation settings for this data. In the core condition, the voice and face of the labeled individual are present in every video. In the full condition, no such guarantee is made. The power of audiovisual fusion is then shown using these publicly-available videos and labels, showing significant improvement over only recognizing voice or face alone. In addition to this work, several other possible paths for future research with this dataset are discussed.
引用
收藏
页码:3031 / 3035
页数:5
相关论文
共 50 条
  • [1] A Deep Neural Network for Audio-Visual Person Recognition
    Alam, Mohammad Rafiqul
    Bennamoun, Mohammed
    Togneri, Roberto
    Sohel, Ferdous
    [J]. 2015 IEEE 7TH INTERNATIONAL CONFERENCE ON BIOMETRICS THEORY, APPLICATIONS AND SYSTEMS (BTAS 2015), 2015,
  • [2] Multi-Feature Audio-Visual Person Recognition
    Das, Amitav
    Manyam, Ohil K.
    Tapaswi, Makarand
    [J]. 2008 IEEE WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2008, : 227 - 232
  • [3] Dynamic Audio-Visual Biometric Fusion for Person Recognition
    Alsaedi, Najlaa Hindi
    Jaha, Emad Sami
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (01): : 1283 - 1311
  • [4] Incremental Audio-Visual Fusion for Person Recognition in Earthquake Scene
    You, Sisi
    Zuo, Yukun
    Yao, Hantao
    Xu, Changsheng
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (02)
  • [5] Large vocabulary audio-visual speech recognition using the Janus speech recognition toolkit
    Kratt, J
    Metze, F
    Stiefelhagen, R
    Waibel, A
    [J]. PATTERN RECOGNITION, 2004, 3175 : 488 - 495
  • [6] Audio-visual interaction in multimedia communication
    Chen, TH
    Rao, RR
    [J]. 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 179 - 182
  • [7] Audio-visual interaction: Multimedia applications
    Zonja, S.
    Livun, N.
    Jambrosic, K.
    [J]. PROCEEDINGS ELMAR-2006, 2006, : 143 - +
  • [8] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    [J]. 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [9] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    [J]. INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [10] Building a data corpus for audio-visual speech recognition
    Chitu, Alin G.
    Rothkrantz, Leon J. M.
    [J]. EUROMEDIA '2007, 2007, : 88 - 92