A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

被引:0
|
作者
Kynych, Frantisek [1 ]
Cerva, Petr [1 ]
Zdansky, Jindrich [1 ]
Svendsen, Torbjorn [2 ]
Salvi, Giampiero [2 ,3 ]
机构
[1] Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studies, Studentska 2, Liberec 46117, Czech Republic
[2] Norwegian Univ Sci & Technol, Dept Elect Syst, NO-7491 Trondheim, Norway
[3] KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Brinellvagen 8, SE-10044 Stockholm, Sweden
来源
关键词
Speaker diarization; Streamed data processing; Multi-modal; Audio-visual; Deep learning; SOURCE SEPARATION; MEETINGS;
D O I
10.1186/s13636-024-00382-2
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Real time audio-visual person tracking
    Talantzis, Fotios
    Pnevmatikakis, Aristodemos
    Polymenakos, Lazaros C.
    2006 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2006, : 243 - +
  • [22] An audio-visual approach to simultaneous-speaker speech recognition
    Patterson, EK
    Gowdy, JN
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 780 - 783
  • [23] A MULTI-VIEW APPROACH TO AUDIO-VISUAL SPEAKER VERIFICATION
    Sari, Leda
    Singh, Kritika
    Zhou, Jiatong
    Torresani, Lorenzo
    Singhal, Nayan
    Saraf, Yatharth
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6194 - 6198
  • [24] Real-Time Human Intrusion Detection Using Audio-Visual Fusion
    Wang, Defu
    Zheng, Shibao
    Zhang, Chongyang
    ADVANCES ON DIGITAL TELEVISION AND WIRELESS MULTIMEDIA COMMUNICATIONS, 2012, 331 : 82 - 89
  • [25] Real-Time Audio-Visual Calls Detection System for a Chicken Robot
    Gribovskiy, Alexey
    Mondada, Francesco
    ICAR: 2009 14TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS, VOLS 1 AND 2, 2009, : 360 - 365
  • [26] Audio-visual Speaker Diarization: Improved Voice Activity Detection with CNN based Feature Extraction
    Fanaras, Konstantinos
    Tragoudaras, Antonios
    Antoniadis, Charalampos
    Massoud, Yehia
    2022 IEEE 65TH INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS (MWSCAS 2022), 2022,
  • [27] A real-time prototype for small-vocabulary audio-visual ASR
    Connell, JH
    Haas, N
    Marcheret, E
    Neti, C
    Potamianos, G
    Velipasalar, S
    2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL II, PROCEEDINGS, 2003, : 469 - 472
  • [28] Audio streaming on the Internet - Experiences with real-time streaming of audio streams
    Jonas, K
    Kanzow, P
    Kretschmer, M
    ISIE '97 - PROCEEDINGS OF THE IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS, VOLS 1-3, 1997, : SS71 - SS76
  • [29] Real-Time Sociometrics from Audio-Visual Features for Two-Person Dialogs
    Tahir, Yasir
    Chakraborty, Debsubhra
    Maszczyk, Tomasz
    Dauwels, Shoko
    Dauwels, Justin
    Thalmann, Nadia
    Thalmann, Daniel
    2015 IEEE INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2015, : 823 - 827
  • [30] Fusing data streams in continuous audio-visual speech recognition
    Rothkrantz, LJM
    Wojdel, JC
    Wiggers, P
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2005, 3658 : 33 - 44