A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

被引:49
|
作者
Wang, Yun [1 ]
Metze, Florian [1 ]
机构
[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
sound event detection (SED); connectionist temporal classification (CTC); transfer learning; convolutional neural networks (CNN);
D O I
10.21437/Interspeech.2017-1469
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sound event detection is the task of detecting the type, onset time, and offset time of sound events in audio streams. The mainstream solution is recurrent neural networks (RNNs). which usually predict the probability of each sound event at every time step. Connectionist temporal classification (CTC) has been applied in order to relax the need for exact annotations of onset and offset times: the CTC output layer is expected to generate a peak for each event boundary where the acoustic signal is most salient. However, with limited training data, the CTC network has been found to train slowly, and generalize poorly to new data. In this paper, we try to introduce knowledge learned from a much larger corpus into the CTC network. We train two variants of SoundNet, a deep convolutional network that takes the audio tracks of videos as input, and tries to approximate the visual information extracted by an image recognition network. A lower part of SoundNet or its variants is then used as a feature extractor for the CTC network to perform sound event detection. We show that the new feature extractor greatly accelerates the convergence of the CTC network, and slightly improves the generalization.
引用
收藏
页码:3097 / 3101
页数:5
相关论文
共 50 条
  • [1] A FIRST ATTEMPT AT POLYPHONIC SOUND EVENT DETECTION USING CONNECTIONIST TEMPORAL CLASSIFICATION
    Wang, Yun
    Metze, Florian
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2986 - 2990
  • [2] Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention
    Jin, Ye
    Wang, Mei
    Luo, Liyan
    Zhao, Dinghao
    Liu, Zhanqi
    SENSORS, 2022, 22 (18)
  • [3] SOUND EVENT DETECTION WITH SEQUENTIALLY LABELLED DATA BASED ON CONNECTIONIST TEMPORAL CLASSIFICATION AND UNSUPERVISED CLUSTERING
    Hou, Yuanbo
    Kong, Qiuqiang
    Li, Shengchen
    Plumbley, Mark D.
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 46 - 50
  • [4] Two-stage Polyphonic Sound Event Detection Based on Faster R-CNN-LSTM with Multi-token Connectionist Temporal Classification
    Park, Inyoung
    Kim, Hong Kook
    INTERSPEECH 2020, 2020, : 856 - 860
  • [5] POLYPHONIC SOUND EVENT DETECTION USING CONVOLUTIONAL BIDIRECTIONAL LSTM AND SYNTHETIC DATA-BASED TRANSFER LEARNING
    Jung, Seokwon
    Park, Jungbae
    Lee, Sangwan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 885 - 889
  • [6] A survey of Deep Learning for Polyphonic Sound event detection
    Dang, An
    Vu, Toan H.
    Wang, Jia-Ching
    PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON ORANGE TECHNOLOGIES (ICOT), 2017, : 75 - 78
  • [7] PEER COLLABORATIVE LEARNING FOR POLYPHONIC SOUND EVENT DETECTION
    Endo, Hayato
    Nishizaki, Hiromitsu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 826 - 830
  • [8] CONNECTIONIST TEMPORAL LOCALIZATION FOR SOUND EVENT DETECTION WITH SEQUENTIAL LABELING
    Wang, Yun
    Metze, Florian
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 745 - 749
  • [9] Connectionist Temporal Classification-based Sound Event Encoder for Converting Sound Events into Onomatopoeic Representations
    Miyazaki, Koichi
    Hayashi, Tomoki
    Toda, Tomoki
    Takeda, Kazuya
    2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2018, : 852 - 856
  • [10] Filterbank Learning for Deep Neural Network Based Polyphonic Sound Event Detection
    Cakir, Emre
    Ozan, Ezgi Can
    Virtanen, Tuomas
    2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 3399 - 3406