Speech-music discrimination using deep visual feature extractors

被引:28
|
作者
Papakostas, Michalis [1 ,2 ]
Giannakopoulos, Theodoros [1 ]
机构
[1] Natl Ctr Sci Res Demokritos, Inst Informat & Telecommun, Athens 15341, Greece
[2] Univ Texas Arlington, Dept Comp Sci, Arlington, TX 76019 USA
关键词
CNNs; Speech-music discrimination; Transfer learning; Audio analysis;
D O I
10.1016/j.eswa.2018.05.016
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech music discrimination is a traditional task in audio analytics, useful for a wide range of applications, such as automatic speech recognition and radio broadcast monitoring, that focuses on segmenting audio streams and classifying each segment as either speech or music. In this paper we investigate the capabilities of Convolutional Neural Networks (CNNs) with regards to the speech - music discrimination task. Instead of representing the audio content using handcrafted audio features, as traditional methods do, we use deep structures to learn visual feature dependencies as they appear on the spectrogram domain (i.e. train a CNN using audio spectrograms as input images). The main contribution of our work focuses on the potentials of using pre-trained deep architectures along with transfer-learning to train robust audio classifiers for the particular task of speech music discrimination. We highlight the supremacy of the proposed methods, compared both to the typical audio-based and deep-learning methods that adopt handcrafted features, and we evaluate our system in terms of classification success and run-time execution. To our knowledge this is the first work that investigates CNNs for the task of speech music discrimination and the first that exploits transfer learning across very different domains for audio modeling using deep-learning in general. In particular, we fine-tune a deep architecture originally trained for the Imagenet classification task, using a relatively small amount of data (almost 80 min of training audio samples) along with data augmentation. We evaluate our system through extensive experimentation against three different datasets. Firstly we experiment on a real-world dataset of more than 10 h of uninterrupted radio broadcasts and secondly, for comparison purposes, we evaluate our best method on two publicly available datasets that were designed specifically for the task of speech-music discrimination. Our results indicate that CNNs can significantly outperform current state-of-the-art in terms of performance especially when transfer learning is applied, in all three test-datasets. All the discussed methods, along with the whole experimental setup and the respective datasets, are openly provided for reproduction and further experimentation. Published by Elsevier Ltd.
引用
收藏
页码:334 / 344
页数:11
相关论文
共 50 条
  • [1] SPEECH-MUSIC DISCRIMINATION: A DEEP LEARNING PERSPECTIVE
    Pikrakis, Aggelos
    Theodoridis, Sergios
    2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 616 - 620
  • [2] Improvement to speech-music discrimination using sinusoidal model based features
    Jalil Shirazi
    Shahrokh Ghaemmaghami
    Multimedia Tools and Applications, 2010, 50 : 415 - 435
  • [3] Improvement to speech-music discrimination using sinusoidal model based features
    Shirazi, Jalil
    Ghaemmaghami, Shahrokh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2010, 50 (02) : 415 - 435
  • [4] Rhythm detection for speech-music discrimination in MPEG compressed domain
    Jarina, R
    O'Connor, N
    Marlow, S
    Murphy, N
    DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 129 - 132
  • [5] Speech-Music Segmentation System for Speech Recognition
    Demir, Cemil
    Dogan, Mehmet Ugur
    2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 846 - 849
  • [6] A New Feature for Speech/Music Discrimination
    Huang, Houjun
    Xu, Yunfei
    Zhou, Ruohua
    INTERNATIONAL ACADEMIC CONFERENCE ON THE INFORMATION SCIENCE AND COMMUNICATION ENGINEERING (ISCE 2014), 2014, : 133 - 137
  • [7] Feature extraction for speech and music discrimination
    Hou, Huiyu
    Sadka, Abdul
    Jiang, Richard M.
    2008 INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING, 2008, : 154 - 157
  • [8] An RNN-Based Speech-Music Discrimination Used for Hybrid Audio Coder
    Yang, Wanzhao
    Tu, Weiping
    Zheng, Jiaxi
    Zhang, Xiong
    Yang, Yuhong
    Song, Yucheng
    MULTIMEDIA MODELING, MMM 2018, PT I, 2018, 10704 : 81 - 92
  • [9] From close listening to distant listening: Developing tools for Speech-Music discrimination of Danish music radio
    Have, Iben
    Enevoldsen, Kenneth
    DIGITAL HUMANITIES QUARTERLY, 2021, 15 (01):
  • [10] A speech-music discriminator using HILN model based features
    Thoshkahna, Balaji
    Sudha, V
    Ramakrishnan, K. R.
    2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 5283 - 5286