Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation

被引:1
|
作者
Lyu, Ke-Ming [1 ]
Lyu, Ren-yuan [1 ]
Chang, Hsien-Tsung [1 ,2 ,3 ]
机构
[1] Chang Gung Univ, Comp Sci & Informat Engn, Taoyuan, Taiwan
[2] Chang Gung Mem Hosp, Phys Med & Rehabil, Taoyuan, Taiwan
[3] Chang Gung Univ, Bachelor Program Artificial Intelligence, Taoyuan, Taiwan
关键词
Automatic speech recognition; Speaker diarization; Real-time system; Incremental clustering;
D O I
10.7717/peerj-cs.1973
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This research presents the development of a cutting -edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI's Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, integrates advanced speech recognition with speaker diarization techniques optimized for real-time applications. These optimizations include handling model outputs efficiently and incorporating speaker embedding technology. The system was evaluated using data from Taiwanese talk shows and political commentary programs, featuring 46 diverse speakers. The results showed a promising word diarization error rate (WDER) of 2.68% in twospeaker scenarios and 11.65% in three -speaker scenarios, with an overall WDER of 6.96%. This performance is comparable to that of non -real-time baseline models, highlighting the system's ability to adapt to various complex conversational dynamics, a significant advancement in the field of real-time multilingual speech processing.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] A REAL-TIME SPEAKER DIARIZATION SYSTEM BASED ON SPATIAL SPECTRUM
    Zheng, Siqi
    Huang, Weilong
    Wang, Xianliang
    Suo, Hongbin
    Feng, Jinwei
    Yan, Zhijie
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7208 - 7212
  • [2] I-vector similarity based speech segmentation for interested speaker to speaker diarization system
    Bae, Ara
    Yoon, Ki-mu
    Jung, Jaehee
    Chung, Bokyung
    Kim, Wooil
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2020, 39 (05): : 461 - 467
  • [3] SEGMENTATION OF TV SHOWS INTO SCENES USING SPEAKER DIARIZATION AND SPEECH RECOGNITION
    Bredin, Herve
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 2377 - 2380
  • [4] Real-Time Speaker Adaptation for Speech Recognition on Mobile Devices
    Lee, Gil Ho
    [J]. 2010 7TH IEEE CONSUMER COMMUNICATIONS AND NETWORKING CONFERENCE-CCNC 2010, 2010, : 403 - 404
  • [5] A Real-Time End-to-End Multilingual Speech Recognition Architecture
    Gonzalez-Dominguez, Javier
    Eustis, David
    Lopez-Moreno, Ignacio
    Senior, Andrew
    Beaufays, Francoise
    Moreno, Pedro J.
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2015, 9 (04) : 749 - 759
  • [6] Speech Segmentation and Speaker Diarization using Time-Delay Neural Network
    Toruk, Mesut
    Serbes, Ahmet
    Bilgin, Gokhan
    [J]. 2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 335 - 339
  • [7] Chronological Self-Training for Real-Time Speaker Diarization
    Padfield, Dirk
    Liebling, Daniel J.
    [J]. INTERSPEECH 2021, 2021, : 4613 - 4617
  • [8] An Automatic Real Time Speech-Speaker Recognition System: A Real Time Approach
    Kakade, Mandar Nitin
    Salunke, D. B.
    [J]. ICCCE 2019: PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMMUNICATIONS AND CYBER-PHYSICAL ENGINEERING, 2020, 570 : 151 - 158
  • [9] Joint speaker diarization and speech recognition based on region proposal networks
    Huang, Zili
    Delcroix, Marc
    Garcia, Leibny Paola
    Watanabe, Shinji
    Raj, Desh
    Khudanpur, Sanjeev
    [J]. COMPUTER SPEECH AND LANGUAGE, 2022, 72
  • [10] A DOA based speaker diarization system for real meetings
    Araki, Shoko
    Fujimoto, Masakiyo
    Ishizuka, Kentaro
    Sawada, Hiroshi
    Makino, Shoji
    [J]. 2008 HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS, 2008, : 30 - 33