AUDIO-JOURNEY: OPEN DOMAIN LATENT DIFFUSION BASED TEXT-TO-AUDIO GENERATION

被引:0
|
作者
Michaels, Jackson [1 ]
Li, Juncheng B. [2 ]
Yao, Laura [2 ]
Yu, Lijun [2 ]
Wood-Doughty, Zach [1 ]
Metze, Florian [2 ]
机构
[1] Northwestern Univ, Chicago, IL 60611 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA USA
关键词
Deep Learning; Open Domain Audio Generation; Audio-Visual Training; Large Language Models;
D O I
10.1109/ICASSP48485.2024.10448220
中图分类号
学科分类号
摘要
Despite recent progress, machine learning (ML) models for open-domain audio generation need to catch up to generative models for image, text, speech, and music. The lack of massive open-domain audio datasets is the main reason for this performance gap; we overcome this challenge through a novel data augmentation approach. We leverage state-of-the-art (SOTA) Large Language Models (LLMs) to enrich captions in the weakly-labeled audio dataset. We then use a SOTA video-captioning model to generate captions for the videos from which the audio data originated, and we again use LLMs to merge the audio and video captions to form a rich, large-scale dataset. We experimentally evaluate the quality of our audio-visual captions, showing a 12.5% gain in semantic score over baselines. Using our augmented dataset, we train a Latent Diffusion Model to generate in an encodec encoding latent space. Our model is novel in the current SOTA audio generation landscape due to our generation space, text encoder, noise schedule, and attention mechanism. Together, these innovations provide competitive open-domain audio generation. The samples, models, and implementation will be at https://audiojourney.github.io.
引用
收藏
页码:6960 / 6964
页数:5
相关论文
共 50 条
  • [31] Enhancing text summarization and audio generation using hybrid model
    Koreddi, Venkatesh
    Chandini, Shaik
    Challa, B. V. T. Kalyan
    Teja, M. Sai Ram
    ENGINEERING RESEARCH EXPRESS, 2025, 7 (01):
  • [32] Mining Audio, Text and Visual Information for Talking Face Generation
    Yu, Lingyun
    Yu, Jun
    Ling, Qiang
    2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 787 - 795
  • [33] Open Domain Event Text Generation
    Fu, Zihao
    Bing, Lidong
    Lam, Wai
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 7748 - 7755
  • [34] AUDIO-BASED NONLINEAR VIDEO DIFFUSION
    Casanovas, Anna Llagostera
    Vandergheynst, Pierre
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 2486 - 2489
  • [35] Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
    Sung-Bin, Kim
    Senocak, Arda
    Ha, Hyunwoo
    Owens, Andrew
    Oh, Tae-Hyun
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6430 - 6440
  • [36] Audio Watermarking Based on Quantization in Wavelet Domain
    Bhat, Vivekananda
    Sengupta, Indranil
    Das, Abhijit
    INFORMATION SYSTEMS SECURITY, PROCEEDINGS, 2008, 5352 : 235 - 242
  • [37] Audio and text density in computer-based instruction
    Koroghlanian, CM
    Sullivan, HJ
    JOURNAL OF EDUCATIONAL COMPUTING RESEARCH, 2000, 22 (02) : 217 - 230
  • [38] Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder
    Yoon, Hyun-Wook
    Lee, Sang-Hoon
    Noh, Hyeong-Rae
    Lee, Seong-Whan
    INTERSPEECH 2020, 2020, : 3545 - 3549
  • [39] MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
    Hayes, Thomas
    Zhang, Songyang
    Yin, Xi
    Pang, Guan
    Sheng, Sasha
    Yang, Harry
    Ge, Songwei
    Hu, Qiyuan
    Parikh, Devi
    COMPUTER VISION, ECCV 2022, PT VIII, 2022, 13668 : 431 - 449
  • [40] HMM-based audio keyword generation
    Xu, M
    Duan, LY
    Cai, J
    Chia, LT
    Xu, CS
    Tian, Q
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2004, PT 3, PROCEEDINGS, 2004, 3333 : 566 - 574