AUDIO-JOURNEY: OPEN DOMAIN LATENT DIFFUSION BASED TEXT-TO-AUDIO GENERATION

被引:0
|
作者
Michaels, Jackson [1 ]
Li, Juncheng B. [2 ]
Yao, Laura [2 ]
Yu, Lijun [2 ]
Wood-Doughty, Zach [1 ]
Metze, Florian [2 ]
机构
[1] Northwestern Univ, Chicago, IL 60611 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA USA
关键词
Deep Learning; Open Domain Audio Generation; Audio-Visual Training; Large Language Models;
D O I
10.1109/ICASSP48485.2024.10448220
中图分类号
学科分类号
摘要
Despite recent progress, machine learning (ML) models for open-domain audio generation need to catch up to generative models for image, text, speech, and music. The lack of massive open-domain audio datasets is the main reason for this performance gap; we overcome this challenge through a novel data augmentation approach. We leverage state-of-the-art (SOTA) Large Language Models (LLMs) to enrich captions in the weakly-labeled audio dataset. We then use a SOTA video-captioning model to generate captions for the videos from which the audio data originated, and we again use LLMs to merge the audio and video captions to form a rich, large-scale dataset. We experimentally evaluate the quality of our audio-visual captions, showing a 12.5% gain in semantic score over baselines. Using our augmented dataset, we train a Latent Diffusion Model to generate in an encodec encoding latent space. Our model is novel in the current SOTA audio generation landscape due to our generation space, text encoder, noise schedule, and attention mechanism. Together, these innovations provide competitive open-domain audio generation. The samples, models, and implementation will be at https://audiojourney.github.io.
引用
收藏
页码:6960 / 6964
页数:5
相关论文
共 50 条
  • [1] Text-to-Audio Generation using Instruction-Guided Latent Diffusion Model
    Ghosal, Deepanway
    Majumder, Navonil
    Mehrish, Ambuj
    Poria, Soujanya
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3590 - 3598
  • [2] LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation
    Guan, Wenhao
    Wang, Kaidi
    Zhou, Wangjin
    Wang, Yang
    Deng, Feng
    Wang, Hui
    Li, Lin
    Hong, Qingyang
    Qin, Yong
    INTERSPEECH 2024, 2024, : 4813 - 4817
  • [3] RETRIEVAL-AUGMENTED TEXT-TO-AUDIO GENERATION
    Yuan, Yi
    Liu, Haohe
    Liu, Xubo
    Huang, Qiushi
    Plumbley, Mark D.
    Wang, Wenwu
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 581 - 585
  • [4] Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
    Xue, Jinlong
    Deng, Yayue
    Gao, Yingming
    Li, Ya
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4700 - 4712
  • [5] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
    Xing, Yazhou
    He, Yingqing
    Tian, Zeyue
    Wang, Xintao
    Chen, Qifeng
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 7151 - 7161
  • [6] Towards Weakly Supervised Text-to-Audio Grounding
    Xu, Xuenan
    Ma, Ziyang
    Wu, Mengyue
    Yu, Kai
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 11126 - 11138
  • [7] GENERATION OR REPLICATION: AUSCULTATING AUDIO LATENT DIFFUSION MODELS
    Bralios, Dimitrios
    Wichern, Gordon
    Germain, Francois G.
    Pan, Zexu
    Khurana, Sameer
    Hori, Chiori
    Le Roux, Jonathan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1156 - 1160
  • [8] TEXT-TO-AUDIO GROUNDING: BUILDING CORRESPONDENCE BETWEEN CAPTIONS AND SOUND EVENTS
    Xu, Xuenan
    Dinkel, Heinrich
    Wu, Mengyue
    Yu, Kai
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 606 - 610
  • [9] Speakmytext: A Platform To Support Crowd-Sourced Text-To-Audio Translations
    Ghaznavi, Ibrahim
    Randhawa, Shan
    Shahid, Usman
    Saleem, Bilal
    Saif, Umar
    PROCEEDINGS OF THE FIRST AFRICAN CONFERENCE FOR HUMAN COMPUTER INTERACTION (AFRICHI'16), 2016, : 160 - 164
  • [10] BATON: Aligning Text-to-Audio Model Using Human Preference Feedback
    Liao, Huan
    Han, Haonan
    Yang, Kai
    Du, Tianjiao
    Yang, Rui
    Xu, Qinmei
    Xu, Zunnan
    Liu, Jingquan
    Lu, Jiasheng
    Li, Xiu
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 4542 - 4550