A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

被引：0

作者：

Kynych, Frantisek ^{[1
]}

Cerva, Petr ^{[1
]}

Zdansky, Jindrich ^{[1
]}

Svendsen, Torbjorn ^{[2
]}

Salvi, Giampiero ^{[2
,3
]}

机构：

[1] Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studies, Studentska 2, Liberec 46117, Czech Republic

[2] Norwegian Univ Sci & Technol, Dept Elect Syst, NO-7491 Trondheim, Norway

[3] KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Brinellvagen 8, SE-10044 Stockholm, Sweden

来源：

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2024年 / 2024卷 / 01期

关键词：

Speaker diarization; Streamed data processing; Multi-modal; Audio-visual; Deep learning; SOURCE SEPARATION; MEETINGS;

D O I：

10.1186/s13636-024-00382-2

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.

引用

页数：16

共 50 条

[1] Real-time speaker localization and speech separation by audio-visual integration
Nakadai, K
Hidai, K
Okuno, HG
Kitano, H
2002 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VOLS I-IV, PROCEEDINGS, 2002, : 1043 - 1049
[2] Speaker Diarization based on Audio-Visual Integration for Smart Posterboard
Wakabayashi, Yukoh
Inoue, Koji
Yoshimoto, Hiromasa
Kawahara, Tatsuya
2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,
[3] Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
Gebru, Israel D.
Ba, Sileye
Li, Xiaofei
Horaud, Radu
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (05) : 1086 - 1099
[4] LATE AUDIO-VISUAL FUSION FOR IN-THE-WILD SPEAKER DIARIZATION
Pan, Zexu
Wichern, Gordon
Germain, Francois G.
Subramanian, Aswin
Le Roux, Jonathan
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 174 - 178
[5] End-to-End Audio-Visual Neural Speaker Diarization
He, Mao-kui
Du, Jun
Lee, Chin-Hui
INTERSPEECH 2022, 2022, : 1461 - 1465
[6] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
Ding, Yifan
Xu, Yong
Zhang, Shi-Xiong
Cong, Yahuan
Wang, Liqiang
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
[7] AVA-AVD: Audio-Visual Speaker Diarization in the Wild
Xu, Eric Zhongcong
Song, Zeyang
Tsutsui, Satoshi
Feng, Chao
Ye, Mang
Shou, Mike Zheng
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3838 - 3847
[8] DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization
Wuerkaixi, Abudukelimu
Yan, Kunda
Zhang, You
Duan, Zhiyao
Zhang, Changshui
2022 IEEE 24TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2022,
[9] A Bayesian approach to audio-visual speaker identification
Nefian, AV
Liang, LH
Fu, TY
Liu, XX
AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 761 - 769
[10] Real-Time Audio-Visual Analysis for Multiperson Videoconferencing
Motlicek, Petr
Duffner, Stefan
Korchagin, Danil
Bourlard, Herve
Scheffler, Carl
Odobez, Jean-Marc
Del Galdo, Giovanni
Kallinger, Markus
Thiergart, Oliver
ADVANCES IN MULTIMEDIA, 2013, 2013

← 1 2 3 4 5 →