Context-based environmental audio event recognition for scene understanding

被引:1
|
作者
Lu, Tong [1 ]
Wang, Gongyou [1 ]
Su, Feng [1 ]
机构
[1] Nanjing Univ, Dept Comp Sci & Technol, Natl Key Lab Novel Software Technol, Nanjing 210008, Jiangsu, Peoples R China
关键词
Acoustic scene; Audio event; Context modeling; Recognition; 10-CASE; CLASSIFICATION;
D O I
10.1007/s00530-014-0424-7
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic audio content recognition has attracted an increasing attention for developing multimedia systems, for which the most popular approaches combine frame-based features with statistic models or discriminative classifiers. The existing methods are effective for clean single-source event detection but may not perform well for unstructured environmental sounds, which have a broad noise-like flat spectrum and a diverse variety of compositions. We present an automatic acoustic scene understanding framework that detects audio events through two hierarchies, acoustic scene recognition and audio event recognition, in which the former is preceded by following dominant audio sources and in turn helps infer non-dominant audio events within the same scene through modeling their occurrence correlations. On the scene recognition hierarchy, we perform adaptive segmentation and feature extraction for every input acoustic scene stream through Eigen-audiospace and an optimized feature subspace, respectively. After filtering background, scene streams are recognized by modeling the observation density of dominant features using a two-level hidden Markov model. On the audio event recognition hierarchy, scene knowledge is characterized by an audio context model that essentially describes the occurrence correlations of dominant and non-dominant audio events within this scene. Monte Carlo integration and gradient descent techniques are employed to maximize the likelihood and correctly tag each audio event. To the best of our knowledge, this is the first work that models event correlations as scene context for robust audio event detection from complex and noisy environments. Note that according to the recent report, the mean accuracy for the acoustic scene classification task by human listeners is only around 71 % on the data collected in office environments from the DCASE dataset. None of the existing methods performs well on all scene categories and the average accuracy of the best performances of the recent 11 methods is 53.8 %. The proposed method averagely achieves an accuracy of 62.3 % on the same dataset. Additionally, we create a 10-CASE dataset by manually collecting 5,250 audio clips of 10 scene types and 21 event categories. Our experimental results on 10-CASE show that the proposed method averagely achieves the enhanced performance of 78.3 %, and the average accuracy of audio event recognition can be effectively improved by capturing dominant audio sources and reasoning non-dominant events from the dominant ones through acoustic context modeling. In the future work, exploring the interactions between acoustic scene recognition and audio event detection, and incorporating other modalities to improve the accuracy are required to further advance the proposed framework.
引用
下载
收藏
页码:507 / 524
页数:18
相关论文
共 50 条
  • [41] Audio Event and Scene Recognition: A Unified Approach using Strongly and Weakly Labeled Data
    Kumar, Anurag
    Raj, Bhiksha
    2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 3475 - 3482
  • [42] Context-Based Intent Understanding Using an Activation Spreading Architecture
    Saffar, Mohammad Taghi
    Nicolescu, Mircea
    Nicolescu, Monica
    Rekabdar, Banafsheh
    2015 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2015, : 3002 - 3009
  • [43] Towards a context-based Bayesian recognition of transitions in locomotion activities
    Martinez-Hernandez, Uriel
    Meng, Lin
    Zhang, Dingguo
    Rubio-Solis, Adrian
    2020 29TH IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION (RO-MAN), 2020, : 677 - 682
  • [44] Context-based Data Augmentation for Improved Ballet Pose Recognition
    Bowditch, Margaux
    Van der Haar, Dustin
    5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, BIG DATA, COMPUTING AND DATA COMMUNICATION SYSTEMS (ICABCD2022), 2022,
  • [45] ROBUST VISUAL TRACKING WITH CONTEXT-BASED ACTIVE OCCLUSION RECOGNITION
    Gu, Yueyang
    Qiao, Yu
    Xu, Kuan
    Xu, Hang
    Fang, Xingqi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 1878 - 1882
  • [46] Conditional sequence model for context-based recognition of gaze aversion
    Morency, Louis-Philippe
    Darrell, Trevor
    MACHINE LEARNING FOR MULTIMODAL INTERACTION, 2008, 4892 : 11 - 23
  • [47] Context-Based Object Recognition: Indoor Versus Outdoor Environments
    Alameer, Ali
    Degenaar, Patrick
    Nazarpour, Kianoush
    ADVANCES IN COMPUTER VISION, VOL 2, 2020, 944 : 473 - 490
  • [48] ENTANGLEMENT LOSS FOR CONTEXT-BASED STILL IMAGE ACTION RECOGNITION
    Xin, Miao
    Wang, Shuhang
    Cheng, Jian
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1042 - 1047
  • [49] Context-based Face Recognition for Smart Web Tasking Applications
    Taherimakhsousi, Nina
    Mueller, Hausi A.
    2014 IEEE WORLD CONGRESS ON SERVICES (SERVICES), 2014, : 21 - 23
  • [50] Context-based recognition of process states using neural networks
    Srinivasan, R
    Wang, C
    Ho, WK
    Lim, KW
    CHEMICAL ENGINEERING SCIENCE, 2005, 60 (04) : 935 - 949