Towards Generating Diverse Audio Captions via Adversarial Training

被引:0
|
作者
Mei, Xinhao [1 ]
Liu, Xubo [1 ]
Sun, Jianyuan [1 ]
Plumbley, Mark D. [1 ]
Wang, Wenwu [1 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, England
基金
英国工程与自然科学研究理事会;
关键词
Generators; Training; Measurement; Task analysis; Hybrid power systems; Semantics; Maximum likelihood estimation; Audio captioning; GANs; deep learning; cross-modal task; reinforcement learning;
D O I
10.1109/TASLP.2024.3416686
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
引用
收藏
页码:3311 / 3323
页数:13
相关论文
共 50 条
  • [31] Towards Robust Adversarial Training via Dual-label Supervised and Geometry Constraint
    Cao L.-J.
    Kuang H.-F.
    Liu H.
    Wang Y.
    Zhang B.-C.
    Huang F.-Y.
    Wu Y.-J.
    Ji R.-R.
    Ruan Jian Xue Bao/Journal of Software, 2022, 33 (04): : 1218 - 1230
  • [32] Towards Group Fairness via Semi-Centralized Adversarial Training in Federated Learning
    Yang, Yurui
    Jiang, Bo
    2022 23RD IEEE INTERNATIONAL CONFERENCE ON MOBILE DATA MANAGEMENT (MDM 2022), 2022, : 482 - 487
  • [33] Towards training noise-robust anomaly detection via collaborative adversarial flows
    Cheng, Hao
    Luo, Jiaxiang
    Zhang, Xianyong
    Liu, Haiming
    Wu, Fan
    MEASUREMENT, 2025, 242
  • [34] Towards Privacy-Preserving Visual Recognition via Adversarial Training: A Pilot Study
    Wu, Zhenyu
    Wang, Zhangyang
    Wang, Zhaowen
    Jin, Hailin
    COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 627 - 645
  • [35] Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders
    Li, Jing
    Kang, Di
    Pei, Wenjie
    Zhe, Xuefei
    Zhang, Ying
    He, Zhenyu
    Bao, Linchao
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11273 - 11282
  • [36] Generating unrestricted adversarial examples via three parameteres
    Hanieh Naderi
    Leili Goli
    Shohreh Kasaei
    Multimedia Tools and Applications, 2022, 81 : 21919 - 21938
  • [37] Generating unrestricted adversarial examples via three parameteres
    Naderi, Hanieh
    Goli, Leili
    Kasaei, Shohreh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (15) : 21919 - 21938
  • [38] Towards Improving Adversarial Training of NLP Models
    Yoo, Jin Yong
    Qi, Yanjun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 945 - 956
  • [39] Towards Efficient Adversarial Training on Vision Transformers
    Wu, Boxi
    Gu, Jindong
    Li, Zhifeng
    Cai, Deng
    He, Xiaofei
    Liu, Wei
    COMPUTER VISION, ECCV 2022, PT XIII, 2022, 13673 : 307 - 325
  • [40] VeCLIP: Improving CLIP Training via Visual-Enriched Captions
    Lai, Zhengfeng
    Zhang, Haotian
    Zhang, Bowen
    Wu, Wentao
    Bai, Haoping
    Timofeev, Aleksei
    Du, Xianzhi
    Gan, Zhe
    Shan, Jiulong
    Chuah, Chen-Nee
    Yang, Yinfei
    Cao, Meng
    COMPUTER VISION - ECCV 2024, PT XLII, 2025, 15100 : 111 - 127