Towards Generating Diverse Audio Captions via Adversarial Training

被引:0
|
作者
Mei, Xinhao [1 ]
Liu, Xubo [1 ]
Sun, Jianyuan [1 ]
Plumbley, Mark D. [1 ]
Wang, Wenwu [1 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, England
基金
英国工程与自然科学研究理事会;
关键词
Generators; Training; Measurement; Task analysis; Hybrid power systems; Semantics; Maximum likelihood estimation; Audio captioning; GANs; deep learning; cross-modal task; reinforcement learning;
D O I
10.1109/TASLP.2024.3416686
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
引用
收藏
页码:3311 / 3323
页数:13
相关论文
共 50 条
  • [41] Reliably fast adversarial training via latent adversarial perturbation
    Park, Geon Yeong
    Lee, Sang Wan
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 7738 - 7747
  • [42] To be Robust or to be Fair: Towards Fairness in Adversarial Training
    Xu, Han
    Liu, Xiaorui
    Li, Yaxin
    Jain, Anil K.
    Tang, Jiliang
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [43] Towards Generating Structurally Realistic Models by Generative Adversarial Networks
    Rahimi, Abbas
    Tisi, Massimo
    Rahimi, Shekoufeh Kolahdouz
    Berardinelli, Luca
    2023 ACM/IEEE INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS COMPANION, MODELS-C, 2023, : 597 - 604
  • [44] Bilateral Adversarial Training: Towards Fast Training of More Robust Models Against Adversarial Attacks
    Wang, Jianyu
    Zhang, Haichao
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6628 - 6637
  • [45] Answer-based Adversarial Training for Generating Clarification Questions
    Rao, Sudha
    Daume, Hal, III
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 143 - 155
  • [46] MGVC: A Mask Voice Conversion Using Generating Adversarial Training
    Lin, Pingyuan
    Lian, Jie
    Dai, Yuxing
    INTELLIGENT COMPUTING THEORIES AND APPLICATION, ICIC 2022, PT II, 2022, 13394 : 579 - 587
  • [47] Adversarial Defense via Learning to Generate Diverse Attacks
    Jang, Yunseok
    Zhao, Tianchen
    Hong, Seunghoon
    Lee, Honglak
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2740 - 2749
  • [48] Unsupervised Diverse Colorization via Generative Adversarial Networks
    Cao, Yun
    Zhou, Zhiming
    Zhang, Weinan
    Yu, Yong
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2017, PT I, 2017, 10534 : 151 - 166
  • [49] Towards Visualizing and Detecting Audio Adversarial Examples for Automatic Speech Recognition
    Zong, Wei
    Chow, Yang-Wai
    Susilo, Willy
    INFORMATION SECURITY AND PRIVACY, ACISP 2021, 2021, 13083 : 531 - 549
  • [50] TOWARDS AUDIO TO SCENE IMAGE SYNTHESIS USING GENERATIVE ADVERSARIAL NETWORK
    Wan, Chia-Hung
    Chuang, Shun-Po
    Lee, Hung-Yi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 496 - 500