Visually Supervised Speaker Detection and Localization via Microphone Array

被引:4
|
作者
Berghi, Davide [1 ]
Hilton, Adrian [1 ]
Jackson, Philip J. B. [1 ]
机构
[1] Univ Surrey, CVSSP, Guildford, Surrey, England
基金
“创新英国”项目; 芬兰科学院;
关键词
speaker localization; self-supervised learning; voice activity detection; microphone array beamforming;
D O I
10.1109/MMSP53017.2021.9733678
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Active speaker detection (ASD) is a multi-modal task that aims to identify who, if anyone, is speaking from a set of candidates. Current audio-visual approaches for ASD typically rely on visually pre-extracted face tracks (sequences of consecutive face crops) and the respective monaural audio. However, their recall rate is often low as only the visible faces are included in the set of candidates. Monaural audio may successfully detect the presence of speech activity but fails in localizing the speaker due to the lack of spatial cues. Our solution extends the audio front-end using a microphone array. We train an audio convolutional neural network (CNN) in combination with beamforming techniques to regress the speaker's horizontal position directly in the video frames. We propose to generate weak labels using a pre-trained active speaker detector on pre-extracted face tracks. Our pipeline embraces the "student-teacher" paradigm, where a trained "teacher" network is used to produce pseudo-labels visually. The "student" network is an audio network trained to generate the same results. At inference, the student network can independently localize the speaker in the visual frames directly from the audio input. Experimental results on newly collected data prove that our approach significantly outperforms a variety of other baselines as well as the teacher network itself. It results in an excellent speech activity detector too.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] AUDIO INPUTS FOR ACTIVE SPEAKER DETECTION AND LOCALIZATION VIA MICROPHONE ARRAY
    Berghi, Davide
    Jackson, Philip J. B.
    [J]. 2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
  • [2] Performance of speaker localization using microphone array
    Visalakshi, R.
    Dhanalakshmi, P.
    Palanivel, S.
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2016, 19 (03) : 467 - 483
  • [3] Speaker localization using microphone array in a reverberant room
    Zou, QY
    Rahardja, S
    Cai, ZB
    [J]. 2002 6TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS I AND II, 2002, : 354 - 357
  • [4] Robust speech recognition with speaker localization by a microphone array
    Yamada, T
    Nakamura, S
    Shikano, K
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1317 - 1320
  • [5] Speech recognition in cars by speaker localization using microphone array
    Kondo, Keisuke
    Nagai, Takayuki
    Kaneko, Masahide
    Kurematsu, Akira
    [J]. Systems and Computers in Japan, 2003, 34 (08) : 1 - 12
  • [6] Microphone Array for Speaker Localization and Identification in Shared Autonomous Vehicles
    Marques, Ivo
    Sousa, Joao
    Sa, Bruno
    Costa, Diogo
    Sousa, Pedro
    Pereira, Samuel
    Santos, Afonso
    Lima, Carlos
    Hammerschmidt, Niklas
    Pinto, Sandro
    Gomes, Tiago
    [J]. ELECTRONICS, 2022, 11 (05)
  • [7] Visually Assisted Self-supervised Audio Speaker Localization and Tracking
    Zhao, Jinzheng
    Wu, Peipei
    Goudarzi, Shidrokh
    Liu, Xubo
    Sun, Jianyuan
    Xu, Yong
    Wang, Wenwu
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 787 - 791
  • [8] Joint Identification and Localization of a Speaker in Adverse Conditions Using a Microphone Array
    Salvati, Daniele
    Drioli, Carlo
    Foresti, Gian Luca
    [J]. 2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2018, : 21 - 25
  • [9] Speaker tracking and identifying based on indoor localization system and microphone array
    Chen, Xiaojie
    Shi, Yuanchun
    Jiang, Wenfeng
    [J]. 21ST INTERNATIONAL CONFERENCE ON ADVANCED NETWORKING AND APPLICATIONS WORKSHOPS/SYMPOSIA, VOL 2, PROCEEDINGS, 2007, : 347 - +
  • [10] Combination of Nested Microphone Array and Subband Processing for Multiple Simultaneous Speaker Localization
    Firoozabadi, Ali Dehghan
    Abutalebi, Hamid Reza
    [J]. 2012 SIXTH INTERNATIONAL SYMPOSIUM ON TELECOMMUNICATIONS (IST), 2012, : 907 - 912