A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

被引:0
|
作者
Kynych, Frantisek [1 ]
Cerva, Petr [1 ]
Zdansky, Jindrich [1 ]
Svendsen, Torbjorn [2 ]
Salvi, Giampiero [2 ,3 ]
机构
[1] Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studies, Studentska 2, Liberec 46117, Czech Republic
[2] Norwegian Univ Sci & Technol, Dept Elect Syst, NO-7491 Trondheim, Norway
[3] KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Brinellvagen 8, SE-10044 Stockholm, Sweden
来源
关键词
Speaker diarization; Streamed data processing; Multi-modal; Audio-visual; Deep learning; SOURCE SEPARATION; MEETINGS;
D O I
10.1186/s13636-024-00382-2
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Real-time monitoring of participants' interaction in a meeting using audio-visual sensors
    Busso, Carlos
    Georgiou, Panayiotis G.
    Narayanan, Shrikanth S.
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 685 - +
  • [32] Speaker Diarization System for Autism Children's Real-Life Audio Data
    Zhou, Tianyan
    Cai, Weicheng
    Chen, Xiaoyan
    Zou, Xiaobing
    Zhang, Shiki
    Li, Ming
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [33] Who said that?: Audio-visual speaker diarisation of real-world meetings
    Chung, Joon Son
    Lee, Bong-Jin
    Han, Icksang
    INTERSPEECH 2019, 2019, : 371 - 375
  • [34] A Novel Real-Time, Lightweight Chaotic-Encryption Scheme for Next-Generation Audio-Visual Hearing Aids
    Adeel, Ahsan
    Ahmad, Jawad
    Larijani, Hadi
    Hussain, Amir
    COGNITIVE COMPUTATION, 2020, 12 (03) : 589 - 601
  • [35] A Novel Real-Time, Lightweight Chaotic-Encryption Scheme for Next-Generation Audio-Visual Hearing Aids
    Ahsan Adeel
    Jawad Ahmad
    Hadi Larijani
    Amir Hussain
    Cognitive Computation, 2020, 12 : 589 - 601
  • [36] Adaptive recovery techniques for real-time audio streams
    Liao, WT
    Chen, JC
    Chen, MS
    IEEE INFOCOM 2001: THE CONFERENCE ON COMPUTER COMMUNICATIONS, VOLS 1-3, PROCEEDINGS: TWENTY YEARS INTO THE COMMUNICATIONS ODYSSEY, 2001, : 815 - 823
  • [37] Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings
    Yang, Chenyu
    Chen, Mengxi
    Wang, Yanfeng
    Wang, Yu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4031 - 4041
  • [38] Multi-Speaker Tracking From an Audio-Visual Sensing Device
    Qian, Xinyuan
    Brutti, Alessio
    Lanz, Oswald
    Omologo, Maurizio
    Cavallaro, Andrea
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (10) : 2576 - 2588
  • [39] Real-Time Idling Vehicles Detection Using Combined Audio-Visual Deep Learning
    Li, Xiwen
    Mangin, Tristalee
    Saha, Surojit
    Mohammed, Rehman
    Blanchard, Evan
    Tang, Dillon
    Poppe, Henry
    Choi, Ouk
    Kelly, Kerry
    Whitaker, Ross
    EMERGING CUTTING-EDGE DEVELOPMENTS IN INTELLIGENT TRAFFIC AND TRANSPORTATION SYSTEMS, ICITT 2023/ICCNT, 2024, 50 : 142 - 158
  • [40] Real-time audio-visual localization of user using microphone array and vision camera
    Choi, C
    Kong, DG
    Lee, S
    Park, K
    Hong, SG
    Lee, HK
    Bang, S
    Lee, Y
    Kim, S
    2005 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-4, 2005, : 497 - 502