Automatic detection of multi-speaker fragments with high time resolution

被引：4

作者：

Kazimirova, E. ^{[1
]}

Belyaev, A. ^{[1
,2
]}

机构：

[1] Neurodatalab, Miami, FL 33137 USA

[2] Lomonosov MSU, Moscow, Russia

来源：

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES | 2018年

关键词：

multi-speaker detection; convolutional neural network; harmonics analysis; audio segmentation; overlapped speech; interruption; conversational analysis; HISTOGRAM EQUALIZATION; SPEAKER DIARIZATION;

D O I：

10.21437/Interspeech.2018-1878

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Interruptions and simultaneous talking represent important patterns of speech behavior. However, there is a lack of approaches to their automatic detection in continuous audio data. We have developed a solution for automatic labeling of multi speaker fragments using harmonic traces analysis. Since harmonic traces in multi-speaker intervals form an irregular pattern as opposed to the structured pattern typical for a single speaker, we resorted to computer vision methods to detect multi-speaker fragments. A convolutional neural network was trained on synthetic material to differentiate between single-speaker and multi speaker fragments. For evaluation of the proposed method the SSPNet Conflict Corpus with provided manual diarization was used. We also examined factors affecting algorithm performance. The main advantages of the proposed method are calculation simplicity and high time resolution. With our approach it is possible to detect segments with minimum duration of 0.5 seconds. The proposed method demonstrates highly accurate results and may be used for speech segmentation, speaker tracking, content analysis such as conflict detection, and other practical purposes.

引用

页码：1388 / 1392

页数：5

共 50 条

[1] Automatic speaker clustering from multi-speaker utterances
McLaughlin, J
Reynolds, D
Singer, E
O'Leary, GC
[J]. ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 817 - 820
[2] TIME DELAY DISTORTION IN MULTI-SPEAKER LOUDSPEAKER SYSTEMS
GERSTEN, M
[J]. JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 1970, 18 (03): : 333 - &
[3] Speaker detection using multi-speaker audio files for both enrollment and test
Bonastre, JF
Meignier, S
Merlin, T
[J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PROCEEDINGS: SPEECH II; INDUSTRY TECHNOLOGY TRACKS; DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS; NEURAL NETWORKS FOR SIGNAL PROCESSING, 2003, : 77 - 80
[4] Automatic Transcription and Captioning System for Bahasa Indonesia in Multi-Speaker Environment
Andra, Muhammad Bagus
Usagawa, Tsuyoshi
[J]. 2020 5TH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATICS AND BIOMEDICAL SCIENCES (ICIIBMS 2020), 2020, : 51 - 56
[5] Multi-Speaker Activity Detection using Zero Crossing Rate
Ramaiah, V. Subba
Rao, R. Rajeswara
[J]. 2016 INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING (ICCSP), VOL. 1, 2016, : 23 - 26
[6] Memory Time Span in LSTMs for Multi-Speaker Source Separation
Zegers, Jeroen
Van Hamme, Hugo
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1477 - 1481
[7] Improving Multi-Speaker Tacotron with Speaker Gating Mechanisms
Zhao, Wei
Xu, Li
He, Ting
[J]. PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 7498 - 7503
[8] Multi-array multi-speaker tracking
Potamitis, I
Tremoulis, G
Fakotakis, N
[J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 206 - 213
[9] MULTI-SPEAKER AND CONTEXT-INDEPENDENT ACOUSTICAL CUES FOR AUTOMATIC SPEECH RECOGNITION
ROSSI, M
NISHINUMA, Y
MERCIER, G
[J]. SPEECH COMMUNICATION, 1983, 2 (2-3) : 215 - 217
[10] A hybrid approach to speaker recognition in multi-speaker environment
Trivedi, J
Maitra, A
Mitra, SK
[J]. PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2005, 3776 : 272 - 275

← 1 2 3 4 5 →