AUDIO CAPTION: LISTEN AND TELL

被引:0
|
作者
Wu, Mengyue [1 ]
Dinkel, Heinrich [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, SpeechLab, MoE Key Lab Artificial Intelligence, Shanghai, Peoples R China
关键词
Audio Caption; Audio Databases; Natural Language Generation; Recurrent Neural Networks;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds, but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field is still quite barren. This paper introduces a manually-annotated dataset for audio caption. The purpose is to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image. The whole dataset is labelled in Mandarin and we also include translated English annotations. A baseline encoder-decoder model is provided for both English and Mandarin. Similar BLEU scores are derived for both languages: our model can generate understandable and data-related captions based on the dataset.
引用
收藏
页码:830 / 834
页数:5
相关论文
共 50 条
  • [1] Tell and listen
    Selge, Edgar
    [J]. THEATER HEUTE, 2018, : 8 - 10
  • [2] Show and Tell: A Neural Image Caption Generator
    Vinyals, Oriol
    Toshev, Alexander
    Bengio, Samy
    Erhan, Dumitru
    [J]. 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 3156 - 3164
  • [3] A novel framework for automatic caption and audio generation
    Kulkarni, Chaitanya
    Monika, P.
    Preeti, B.
    Shruthi, S.
    [J]. MATERIALS TODAY-PROCEEDINGS, 2022, 65 : 3248 - 3252
  • [4] Fast Caption Alignment for Automatic Indexing of Audio
    Knight, Allan
    Almeroth, Kevin
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2010, 1 (02): : 1 - 17
  • [5] Listen to the data - They have a story to tell
    Haggard, WH
    [J]. SYMPOSIUM ON ENVIRONMENTAL APPLICATIONS, 1996, : 15 - 21
  • [6] Learn and Tell: Learning Priors for Image Caption Generation
    Liu, Pei
    Peng, Dezhong
    Zhang, Ming
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (19): : 1 - 17
  • [7] Speech Evaluation Based on Deep Learning Audio Caption
    Zhang, Liu
    Zhang, Hanyi
    Guo, Jin
    Ji, Detao
    Liu, Qing
    Xie, Cheng
    [J]. ADVANCES IN E-BUSINESS ENGINEERING FOR UBIQUITOUS COMPUTING, 2020, 41 : 51 - 66
  • [8] CAN AUDIO CAPTIONS BE EVALUATED WITH IMAGE CAPTION METRICS?
    Zhou, Zelin
    Zhang, Zhiling
    Xu, Xuenan
    Xie, Zeyu
    Wu, Mengyue
    Zhu, Kenny Q.
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 981 - 985
  • [9] ACTUAL: Audio Captioning With Caption Feature Space Regularization
    Zhang, Yiming
    Yu, Hong
    Du, Ruoyi
    Tan, Zheng-Hua
    Wang, Wenwu
    Ma, Zhanyu
    Dong, Yuan
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2643 - 2657
  • [10] LISTEN AND LEARN FROM NARRATIVES THAT TELL A STORY
    WEBBMITCHELL, B
    [J]. RELIGIOUS EDUCATION, 1990, 85 (04) : 617 - 630