AUDIO-JOURNEY: OPEN DOMAIN LATENT DIFFUSION BASED TEXT-TO-AUDIO GENERATION

被引:0
|
作者
Michaels, Jackson [1 ]
Li, Juncheng B. [2 ]
Yao, Laura [2 ]
Yu, Lijun [2 ]
Wood-Doughty, Zach [1 ]
Metze, Florian [2 ]
机构
[1] Northwestern Univ, Chicago, IL 60611 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA USA
关键词
Deep Learning; Open Domain Audio Generation; Audio-Visual Training; Large Language Models;
D O I
10.1109/ICASSP48485.2024.10448220
中图分类号
学科分类号
摘要
Despite recent progress, machine learning (ML) models for open-domain audio generation need to catch up to generative models for image, text, speech, and music. The lack of massive open-domain audio datasets is the main reason for this performance gap; we overcome this challenge through a novel data augmentation approach. We leverage state-of-the-art (SOTA) Large Language Models (LLMs) to enrich captions in the weakly-labeled audio dataset. We then use a SOTA video-captioning model to generate captions for the videos from which the audio data originated, and we again use LLMs to merge the audio and video captions to form a rich, large-scale dataset. We experimentally evaluate the quality of our audio-visual captions, showing a 12.5% gain in semantic score over baselines. Using our augmented dataset, we train a Latent Diffusion Model to generate in an encodec encoding latent space. Our model is novel in the current SOTA audio generation landscape due to our generation space, text encoder, noise schedule, and attention mechanism. Together, these innovations provide competitive open-domain audio generation. The samples, models, and implementation will be at https://audiojourney.github.io.
引用
收藏
页码:6960 / 6964
页数:5
相关论文
共 50 条
  • [21] Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator
    Kim, Jaewon
    Choi, Won-Gook
    Ahn, Seyun
    Chang, Joon-Hyuk
    INTERSPEECH 2024, 2024, : 3305 - 3309
  • [22] Conversion Of Text to Braille and SAPI Based Audio Generation for Visually Impaired Peoples
    Banu, Halitha H.
    Prabha, N.
    JOURNAL OF ALGEBRAIC STATISTICS, 2022, 13 (02) : 1484 - 1488
  • [23] Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval
    Luo, Kaiyi
    Zhang, Xulong
    Wang, Jianzong
    Li, Huaxiong
    Cheng, Ning
    Xiao, Jing
    2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 913 - 917
  • [24] Diffusion-Based Audio Inpainting
    Moliner, Eloi
    Valimaki, Vesa
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2024, 72 (03): : 100 - 113
  • [25] INVESTIGATING POOLING STRATEGIES AND LOSS FUNCTIONS FOR WEAKLY-SUPERVISED TEXT-TO-AUDIO GROUNDING VIA CONTRASTIVE LEARNING
    Xu, Xuenan
    Wu, Mengyue
    Yu, Kai
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [26] Open-Vocabulary Keyword Spotting With Audio And Text Embeddings
    Sacchi, Niccolo
    Nanchen, Alexandre
    Jaggi, Martin
    Cernak, Milos
    INTERSPEECH 2019, 2019, : 3362 - 3366
  • [27] Text-Based Audio Retrieval by Learning From Similarities Between Audio Captions
    Xie, Huang
    Khorrami, Khazar
    Rasanen, Okko
    Virtanen, Tuomas
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 221 - 225
  • [28] LSB Based Audio Steganography Based On Text Compression
    Begum, M. Baritha
    Venkataramani, Y.
    INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY AND SYSTEM DESIGN 2011, 2012, 30 : 703 - 710
  • [29] Automatic generation of audio content for open learning resources
    Brasher, Andrew
    McAndrew, Patrick
    JOURNAL OF INTERACTIVE MEDIA IN EDUCATION, 2009, (01):
  • [30] TAVT:Towards Transferable Audio-Visual Text Generation
    Lin, Wang
    Jin, Tao
    Wang, Ye
    Pan, Wenwen
    Li, Linjun
    Cheng, Xize
    Zhao, Zhou
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14983 - 14999