Detecting paralinguistic events in audio stream using context in features and probabilistic decisions

被引:8
|
作者
Gupta, Rahul [1 ]
Audhkhasi, Kartik [2 ]
Lee, Sungbok [1 ]
Narayanan, Shrikanth [1 ]
机构
[1] Univ So Calif, Signal Anal & Interpretat Lab, Los Angeles, CA 90089 USA
[2] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
来源
基金
美国国家科学基金会;
关键词
Paralinguistic event; Laughter; Filler; Probability smoothing; Probability masking; NONVERBAL-COMMUNICATION; SPEECH RECOGNITION; AUTISM; CUES; CLASSIFICATION; PERSONALITY; ATTITUDES; LAUGHTER; SIGHS;
D O I
10.1016/j.csl.2015.08.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Non-verbal communication involves encoding, transmission and decoding of non-lexical cues and is realized using vocal (e.g. prosody) or visual (e.g. gaze, body language) channels during conversation. These cues perform the function of maintaining conversational flow, expressing emotions, and marking personality and interpersonal attitude. In particular, non-verbal cues in speech such as paralanguage and non-verbal vocal events (e.g. laughters, sighs, cries) are used to nuance meaning and convey emotions, mood and attitude. For instance, laughters are associated with affective expressions while fillers (e.g. urn, ah, urn) are used to hold floor during a conversation. In this paper we present an automatic non-verbal vocal events detection system focusing on the detect of laughter and fillers. We extend our system presented during Interspeech 2013 Social Signals Sub-challenge (that was the winning entry in the challenge) for frame-wise event detection and test several schemes for incorporating local context during detection. Specifically, we incorporate context at two separate levels in our system: (i) the raw frame-wise features and, (ii) the output decisions. Furthermore, our system processes the output probabilities based on a few heuristic rules in order to reduce erroneous frame-based predictions. Our overall system achieves an Area Under the Receiver Operating Characteristics curve of 95.3% for detecting laughters and 90.4% for fillers on the test set drawn from the data specifications of the Interspeech 2013 Social Signals Sub-challenge. We perform further analysis to understand the interrelation between the features and obtained results. Specifically, we conduct a feature sensitivity analysis and correlate it with each feature's stand alone performance. The observations suggest that the trained system is more sensitive to a feature carrying higher discriminability with implications towards a better system design. (C) 2015 Elsevier Ltd. All rights reserved.
引用
收藏
页码:72 / 92
页数:21
相关论文
共 9 条
  • [1] Speaker Detection in Audio Stream via Probabilistic Prediction Using Generalized GEBI
    Sakata, Koki
    Sakashita, Shota
    Matsuo, Kazuya
    Kurogi, Shuichi
    NEURAL INFORMATION PROCESSING, ICONIP 2016, PT IV, 2016, 9950 : 302 - 311
  • [2] Efficient Image Retrieval using Image and Audio Features in Video Stream
    Shin, In-Kyoung
    Ahn, Hyochang
    Lee, Yong-Hwan
    2016 10TH INTERNATIONAL CONFERENCE ON INNOVATIVE MOBILE AND INTERNET SERVICES IN UBIQUITOUS COMPUTING (IMIS), 2016, : 422 - 424
  • [3] Probabilistic Trainable Segmenter for Call Center Audio Using Multiple Features
    Zinovieva, Nina
    Zhuang, Xiaodan
    Peterson, Pat
    Alwan, Joe
    Prasad, Rohit
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2053 - 2057
  • [4] Detecting semantic concepts using context and audiovisual features
    Naphade, MR
    Huang, TS
    IEEE WORKSHOP ON DETECTION AND RECOGNITION OF EVENTS IN VIDEO, PROCEEDINGS, 2001, : 92 - 98
  • [5] Multiple Events Detection Using Context-Intelligence Features
    Ghadi, Yazeed Yasin
    Akhter, Israr
    Alsuhibany, Suliman A.
    al Shloul, Tamara
    Jalal, Ahmad
    Kim, Kibum
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 34 (03): : 1455 - 1471
  • [6] DETECTING THE DURATION OF INCOMPLETE OBSTRUCTIVE SLEEP APNEA EVENTS USING INTERHEMISPHERIC FEATURES OF ELECTROENCEPHALOGRAPHY
    Hsu, Chien-Chang
    Cai, Zhen-Gjia
    Mei, Hsing
    Chiu, Hou-Chang
    Lin, Chia-Mo
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2013, 9 (02): : 705 - 725
  • [7] FALCoN: Detecting and classifying abusive language in social networks using context features and unlabeled data
    Tuarob, Suppawong
    Satravisut, Manisa
    Sangtunchai, Pochara
    Nunthavanich, Sakunrat
    Noraset, Thanapon
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (04)
  • [8] Detecting Forged Audio Files Using "Mixed Paste" Command: A Deep Learning Approach Based on Korean Phonemic Features
    Son, Yeongmin
    Park, Jae Wan
    SENSORS, 2024, 24 (06)
  • [9] Recognition of Urban Sound Events Using Deep Context-Aware Feature Extractors and Handcrafted Features
    Giannakopoulos, Theodore
    Spyrou, Evaggelos
    Perantonis, Stavros J.
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS (AIAI 2019), 2019, 560 : 184 - 195