Detecting paralinguistic events in audio stream using context in features and probabilistic decisions

被引：8

作者：

Gupta, Rahul ^{[1
]}

Audhkhasi, Kartik ^{[2
]}

Lee, Sungbok ^{[1
]}

Narayanan, Shrikanth ^{[1
]}

机构：

[1] Univ So Calif, Signal Anal & Interpretat Lab, Los Angeles, CA 90089 USA

[2] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA

来源：

COMPUTER SPEECH AND LANGUAGE | 2016年 / 36卷

基金：

美国国家科学基金会;

关键词：

Paralinguistic event; Laughter; Filler; Probability smoothing; Probability masking; NONVERBAL-COMMUNICATION; SPEECH RECOGNITION; AUTISM; CUES; CLASSIFICATION; PERSONALITY; ATTITUDES; LAUGHTER; SIGHS;

D O I：

10.1016/j.csl.2015.08.003

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Non-verbal communication involves encoding, transmission and decoding of non-lexical cues and is realized using vocal (e.g. prosody) or visual (e.g. gaze, body language) channels during conversation. These cues perform the function of maintaining conversational flow, expressing emotions, and marking personality and interpersonal attitude. In particular, non-verbal cues in speech such as paralanguage and non-verbal vocal events (e.g. laughters, sighs, cries) are used to nuance meaning and convey emotions, mood and attitude. For instance, laughters are associated with affective expressions while fillers (e.g. urn, ah, urn) are used to hold floor during a conversation. In this paper we present an automatic non-verbal vocal events detection system focusing on the detect of laughter and fillers. We extend our system presented during Interspeech 2013 Social Signals Sub-challenge (that was the winning entry in the challenge) for frame-wise event detection and test several schemes for incorporating local context during detection. Specifically, we incorporate context at two separate levels in our system: (i) the raw frame-wise features and, (ii) the output decisions. Furthermore, our system processes the output probabilities based on a few heuristic rules in order to reduce erroneous frame-based predictions. Our overall system achieves an Area Under the Receiver Operating Characteristics curve of 95.3% for detecting laughters and 90.4% for fillers on the test set drawn from the data specifications of the Interspeech 2013 Social Signals Sub-challenge. We perform further analysis to understand the interrelation between the features and obtained results. Specifically, we conduct a feature sensitivity analysis and correlate it with each feature's stand alone performance. The observations suggest that the trained system is more sensitive to a feature carrying higher discriminability with implications towards a better system design. (C) 2015 Elsevier Ltd. All rights reserved.

引用

页码：72 / 92

页数：21

共 9 条

[1] Speaker Detection in Audio Stream via Probabilistic Prediction Using Generalized GEBI
Sakata, Koki
Sakashita, Shota
Matsuo, Kazuya
Kurogi, Shuichi
NEURAL INFORMATION PROCESSING, ICONIP 2016, PT IV, 2016, 9950 : 302 - 311
[2] Efficient Image Retrieval using Image and Audio Features in Video Stream
Shin, In-Kyoung
Ahn, Hyochang
Lee, Yong-Hwan
2016 10TH INTERNATIONAL CONFERENCE ON INNOVATIVE MOBILE AND INTERNET SERVICES IN UBIQUITOUS COMPUTING (IMIS), 2016, : 422 - 424
[3] Probabilistic Trainable Segmenter for Call Center Audio Using Multiple Features
Zinovieva, Nina
Zhuang, Xiaodan
Peterson, Pat
Alwan, Joe
Prasad, Rohit
14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2053 - 2057
[4] Detecting semantic concepts using context and audiovisual features
Naphade, MR
Huang, TS
IEEE WORKSHOP ON DETECTION AND RECOGNITION OF EVENTS IN VIDEO, PROCEEDINGS, 2001, : 92 - 98
[5] Multiple Events Detection Using Context-Intelligence Features
Ghadi, Yazeed Yasin
Akhter, Israr
Alsuhibany, Suliman A.
al Shloul, Tamara
Jalal, Ahmad
Kim, Kibum
INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 34 (03): : 1455 - 1471
[6] DETECTING THE DURATION OF INCOMPLETE OBSTRUCTIVE SLEEP APNEA EVENTS USING INTERHEMISPHERIC FEATURES OF ELECTROENCEPHALOGRAPHY
Hsu, Chien-Chang
Cai, Zhen-Gjia
Mei, Hsing
Chiu, Hou-Chang
Lin, Chia-Mo
INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2013, 9 (02): : 705 - 725
[7] FALCoN: Detecting and classifying abusive language in social networks using context features and unlabeled data
Tuarob, Suppawong
Satravisut, Manisa
Sangtunchai, Pochara
Nunthavanich, Sakunrat
Noraset, Thanapon
INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (04)
[8] Detecting Forged Audio Files Using "Mixed Paste" Command: A Deep Learning Approach Based on Korean Phonemic Features
Son, Yeongmin
Park, Jae Wan
SENSORS, 2024, 24 (06)
[9] Recognition of Urban Sound Events Using Deep Context-Aware Feature Extractors and Handcrafted Features
Giannakopoulos, Theodore
Spyrou, Evaggelos
Perantonis, Stavros J.
ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS (AIAI 2019), 2019, 560 : 184 - 195

← 1 →