A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

被引：0

作者：

Kynych, Frantisek ^{[1
]}

Cerva, Petr ^{[1
]}

Zdansky, Jindrich ^{[1
]}

Svendsen, Torbjorn ^{[2
]}

Salvi, Giampiero ^{[2
,3
]}

机构：

[1] Tech Univ Liberec, Fac Mechatron Informat & Interdisciplinary Studies, Studentska 2, Liberec 46117, Czech Republic

[2] Norwegian Univ Sci & Technol, Dept Elect Syst, NO-7491 Trondheim, Norway

[3] KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Brinellvagen 8, SE-10044 Stockholm, Sweden

来源：

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2024年 / 2024卷 / 01期

关键词：

Speaker diarization; Streamed data processing; Multi-modal; Audio-visual; Deep learning; SOURCE SEPARATION; MEETINGS;

D O I：

10.1186/s13636-024-00382-2

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.

引用

页数：16

共 50 条

[21] Real time audio-visual person tracking
Talantzis, Fotios
Pnevmatikakis, Aristodemos
Polymenakos, Lazaros C.
2006 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2006, : 243 - +
[22] An audio-visual approach to simultaneous-speaker speech recognition
Patterson, EK
Gowdy, JN
2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 780 - 783
[23] A MULTI-VIEW APPROACH TO AUDIO-VISUAL SPEAKER VERIFICATION
Sari, Leda
Singh, Kritika
Zhou, Jiatong
Torresani, Lorenzo
Singhal, Nayan
Saraf, Yatharth
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6194 - 6198
[24] Real-Time Human Intrusion Detection Using Audio-Visual Fusion
Wang, Defu
Zheng, Shibao
Zhang, Chongyang
ADVANCES ON DIGITAL TELEVISION AND WIRELESS MULTIMEDIA COMMUNICATIONS, 2012, 331 : 82 - 89
[25] Real-Time Audio-Visual Calls Detection System for a Chicken Robot
Gribovskiy, Alexey
Mondada, Francesco
ICAR: 2009 14TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS, VOLS 1 AND 2, 2009, : 360 - 365
[26] Audio-visual Speaker Diarization: Improved Voice Activity Detection with CNN based Feature Extraction
Fanaras, Konstantinos
Tragoudaras, Antonios
Antoniadis, Charalampos
Massoud, Yehia
2022 IEEE 65TH INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS (MWSCAS 2022), 2022,
[27] A real-time prototype for small-vocabulary audio-visual ASR
Connell, JH
Haas, N
Marcheret, E
Neti, C
Potamianos, G
Velipasalar, S
2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL II, PROCEEDINGS, 2003, : 469 - 472
[28] Audio streaming on the Internet - Experiences with real-time streaming of audio streams
Jonas, K
Kanzow, P
Kretschmer, M
ISIE '97 - PROCEEDINGS OF THE IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS, VOLS 1-3, 1997, : SS71 - SS76
[29] Real-Time Sociometrics from Audio-Visual Features for Two-Person Dialogs
Tahir, Yasir
Chakraborty, Debsubhra
Maszczyk, Tomasz
Dauwels, Shoko
Dauwels, Justin
Thalmann, Nadia
Thalmann, Daniel
2015 IEEE INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2015, : 823 - 827
[30] Fusing data streams in continuous audio-visual speech recognition
Rothkrantz, LJM
Wojdel, JC
Wiggers, P
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2005, 3658 : 33 - 44

← 1 2 3 4 5 →