A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

被引:0
|
作者
Kynych, Frantisek [1 ]
Cerva, Petr [1 ]
Zdansky, Jindrich [1 ]
Svendsen, Torbjorn [2 ]
Salvi, Giampiero [2 ,3 ]
机构
[1] Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studies, Studentska 2, Liberec 46117, Czech Republic
[2] Norwegian Univ Sci & Technol, Dept Elect Syst, NO-7491 Trondheim, Norway
[3] KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Brinellvagen 8, SE-10044 Stockholm, Sweden
来源
关键词
Speaker diarization; Streamed data processing; Multi-modal; Audio-visual; Deep learning; SOURCE SEPARATION; MEETINGS;
D O I
10.1186/s13636-024-00382-2
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Real-time sound source localization and separation based on active audio-visual integration
    Okuno, HG
    Nakadai, K
    COMPUTATIONAL METHODS IN NEURAL MODELING, PT 1, 2003, 2686 : 118 - 125
  • [42] NEW AUDIO-VISUAL APPROACH CUTS EMPLOYEE TRAINING TIME
    不详
    PERSONNEL JOURNAL, 1968, 47 (07) : 509 - 510
  • [43] Algorithm for real-time comparison of audio streams for broadcast supervision
    Lorkiewicz, Mateusz
    Stankowski, Jakub
    Klimaszewski, Krzysztof
    2018 25TH INTERNATIONAL CONFERENCE ON SYSTEMS, SIGNALS AND IMAGE PROCESSING (IWSSIP), 2018,
  • [44] CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement
    Gogate, Mandar
    Dashtipour, Kia
    Adeel, Ahsan
    Hussain, Amir
    INFORMATION FUSION, 2020, 63 : 273 - 285
  • [45] Development of Near Real-Time Audio-Visual Alarm System for the Philippine Earthquake Intensity Meter
    Merginio, Ivan Jonathan E.
    Christian Marcos, Earl Quinn
    Raquel, Edrianne
    Mark Payawal, John
    Aldrine Uy, Francis
    2019 IEEE 10TH CONTROL AND SYSTEM GRADUATE RESEARCH COLLOQUIUM (ICSGRC), 2019, : 62 - 65
  • [46] Real-Time Decreased Sensitivity to an Audio-Visual Illusion during Goal-Directed Reaching
    Tremblay, Luc
    Nguyen, Thanh
    PLOS ONE, 2010, 5 (01):
  • [47] Audio-visual speaker recognition using time-varying stream reliability prediction
    Chaudhari, UV
    Ramaswamy, GN
    Potamianos, G
    Neti, C
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 712 - 715
  • [48] Real-time audio and visual display of the Coronavirus genome
    Mark D. Temple
    BMC Bioinformatics, 21
  • [49] Real-time audio and visual display of the Coronavirus genome
    Temple, Mark D.
    BMC BIOINFORMATICS, 2020, 21 (01)
  • [50] Subjective Evaluation of Basic Emotions from Audio-Visual Data
    Kadiri, Sudarsana Reddy
    Alku, Paavo
    SENSORS, 2022, 22 (13)