A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

被引:0
|
作者
Kynych, Frantisek [1 ]
Cerva, Petr [1 ]
Zdansky, Jindrich [1 ]
Svendsen, Torbjorn [2 ]
Salvi, Giampiero [2 ,3 ]
机构
[1] Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studies, Studentska 2, Liberec 46117, Czech Republic
[2] Norwegian Univ Sci & Technol, Dept Elect Syst, NO-7491 Trondheim, Norway
[3] KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Brinellvagen 8, SE-10044 Stockholm, Sweden
来源
关键词
Speaker diarization; Streamed data processing; Multi-modal; Audio-visual; Deep learning; SOURCE SEPARATION; MEETINGS;
D O I
10.1186/s13636-024-00382-2
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Real-time speaker localization and speech separation by audio-visual integration
    Nakadai, K
    Hidai, K
    Okuno, HG
    Kitano, H
    2002 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VOLS I-IV, PROCEEDINGS, 2002, : 1043 - 1049
  • [2] Speaker Diarization based on Audio-Visual Integration for Smart Posterboard
    Wakabayashi, Yukoh
    Inoue, Koji
    Yoshimoto, Hiromasa
    Kawahara, Tatsuya
    2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,
  • [3] Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
    Gebru, Israel D.
    Ba, Sileye
    Li, Xiaofei
    Horaud, Radu
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (05) : 1086 - 1099
  • [4] LATE AUDIO-VISUAL FUSION FOR IN-THE-WILD SPEAKER DIARIZATION
    Pan, Zexu
    Wichern, Gordon
    Germain, Francois G.
    Subramanian, Aswin
    Le Roux, Jonathan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 174 - 178
  • [5] End-to-End Audio-Visual Neural Speaker Diarization
    He, Mao-kui
    Du, Jun
    Lee, Chin-Hui
    INTERSPEECH 2022, 2022, : 1461 - 1465
  • [6] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
    Ding, Yifan
    Xu, Yong
    Zhang, Shi-Xiong
    Cong, Yahuan
    Wang, Liqiang
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
  • [7] AVA-AVD: Audio-Visual Speaker Diarization in the Wild
    Xu, Eric Zhongcong
    Song, Zeyang
    Tsutsui, Satoshi
    Feng, Chao
    Ye, Mang
    Shou, Mike Zheng
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3838 - 3847
  • [8] DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization
    Wuerkaixi, Abudukelimu
    Yan, Kunda
    Zhang, You
    Duan, Zhiyao
    Zhang, Changshui
    2022 IEEE 24TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2022,
  • [9] A Bayesian approach to audio-visual speaker identification
    Nefian, AV
    Liang, LH
    Fu, TY
    Liu, XX
    AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 761 - 769
  • [10] Real-Time Audio-Visual Analysis for Multiperson Videoconferencing
    Motlicek, Petr
    Duffner, Stefan
    Korchagin, Danil
    Bourlard, Herve
    Scheffler, Carl
    Odobez, Jean-Marc
    Del Galdo, Giovanni
    Kallinger, Markus
    Thiergart, Oliver
    ADVANCES IN MULTIMEDIA, 2013, 2013