AUDIO CAPTION: LISTEN AND TELL

被引：0

作者：

Wu, Mengyue ^{[1
]}

Dinkel, Heinrich ^{[1
]}

Yu, Kai ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, SpeechLab, MoE Key Lab Artificial Intelligence, Shanghai, Peoples R China

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

Audio Caption; Audio Databases; Natural Language Generation; Recurrent Neural Networks;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds, but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field is still quite barren. This paper introduces a manually-annotated dataset for audio caption. The purpose is to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image. The whole dataset is labelled in Mandarin and we also include translated English annotations. A baseline encoder-decoder model is provided for both English and Mandarin. Similar BLEU scores are derived for both languages: our model can generate understandable and data-related captions based on the dataset.

引用

页码：830 / 834

页数：5

共 50 条

[1] Tell and listen
Selge, Edgar
[J]. THEATER HEUTE, 2018, : 8 - 10
[2] Show and Tell: A Neural Image Caption Generator
Vinyals, Oriol
Toshev, Alexander
Bengio, Samy
Erhan, Dumitru
[J]. 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 3156 - 3164
[3] A novel framework for automatic caption and audio generation
Kulkarni, Chaitanya
Monika, P.
Preeti, B.
Shruthi, S.
[J]. MATERIALS TODAY-PROCEEDINGS, 2022, 65 : 3248 - 3252
[4] Fast Caption Alignment for Automatic Indexing of Audio
Knight, Allan
Almeroth, Kevin
[J]. INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2010, 1 (02): : 1 - 17
[5] Listen to the data - They have a story to tell
Haggard, WH
[J]. SYMPOSIUM ON ENVIRONMENTAL APPLICATIONS, 1996, : 15 - 21
[6] Learn and Tell: Learning Priors for Image Caption Generation
Liu, Pei
Peng, Dezhong
Zhang, Ming
[J]. APPLIED SCIENCES-BASEL, 2020, 10 (19): : 1 - 17
[7] Speech Evaluation Based on Deep Learning Audio Caption
Zhang, Liu
Zhang, Hanyi
Guo, Jin
Ji, Detao
Liu, Qing
Xie, Cheng
[J]. ADVANCES IN E-BUSINESS ENGINEERING FOR UBIQUITOUS COMPUTING, 2020, 41 : 51 - 66
[8] CAN AUDIO CAPTIONS BE EVALUATED WITH IMAGE CAPTION METRICS?
Zhou, Zelin
Zhang, Zhiling
Xu, Xuenan
Xie, Zeyu
Wu, Mengyue
Zhu, Kenny Q.
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 981 - 985
[9] ACTUAL: Audio Captioning With Caption Feature Space Regularization
Zhang, Yiming
Yu, Hong
Du, Ruoyi
Tan, Zheng-Hua
Wang, Wenwu
Ma, Zhanyu
Dong, Yuan
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2643 - 2657
[10] LISTEN AND LEARN FROM NARRATIVES THAT TELL A STORY
WEBBMITCHELL, B
[J]. RELIGIOUS EDUCATION, 1990, 85 (04) : 617 - 630

← 1 2 3 4 5 →