AUDIO-JOURNEY: OPEN DOMAIN LATENT DIFFUSION BASED TEXT-TO-AUDIO GENERATION

被引：0

作者：

Michaels, Jackson ^{[1
]}

Li, Juncheng B. ^{[2
]}

Yao, Laura ^{[2
]}

Yu, Lijun ^{[2
]}

Wood-Doughty, Zach ^{[1
]}

Metze, Florian ^{[2
]}

机构：

[1] Northwestern Univ, Chicago, IL 60611 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA USA

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年

关键词：

Deep Learning; Open Domain Audio Generation; Audio-Visual Training; Large Language Models;

D O I：

10.1109/ICASSP48485.2024.10448220

中图分类号：

学科分类号：

摘要：

Despite recent progress, machine learning (ML) models for open-domain audio generation need to catch up to generative models for image, text, speech, and music. The lack of massive open-domain audio datasets is the main reason for this performance gap; we overcome this challenge through a novel data augmentation approach. We leverage state-of-the-art (SOTA) Large Language Models (LLMs) to enrich captions in the weakly-labeled audio dataset. We then use a SOTA video-captioning model to generate captions for the videos from which the audio data originated, and we again use LLMs to merge the audio and video captions to form a rich, large-scale dataset. We experimentally evaluate the quality of our audio-visual captions, showing a 12.5% gain in semantic score over baselines. Using our augmented dataset, we train a Latent Diffusion Model to generate in an encodec encoding latent space. Our model is novel in the current SOTA audio generation landscape due to our generation space, text encoder, noise schedule, and attention mechanism. Together, these innovations provide competitive open-domain audio generation. The samples, models, and implementation will be at https://audiojourney.github.io.

引用

页码：6960 / 6964

页数：5

共 50 条

[31] Enhancing text summarization and audio generation using hybrid model
Koreddi, Venkatesh
Chandini, Shaik
Challa, B. V. T. Kalyan
Teja, M. Sai Ram
ENGINEERING RESEARCH EXPRESS, 2025, 7 (01):
[32] Mining Audio, Text and Visual Information for Talking Face Generation
Yu, Lingyun
Yu, Jun
Ling, Qiang
2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 787 - 795
[33] Open Domain Event Text Generation
Fu, Zihao
Bing, Lidong
Lam, Wai
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 7748 - 7755
[34] AUDIO-BASED NONLINEAR VIDEO DIFFUSION
Casanovas, Anna Llagostera
Vandergheynst, Pierre
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 2486 - 2489
[35] Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
Sung-Bin, Kim
Senocak, Arda
Ha, Hyunwoo
Owens, Andrew
Oh, Tae-Hyun
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6430 - 6440
[36] Audio Watermarking Based on Quantization in Wavelet Domain
Bhat, Vivekananda
Sengupta, Indranil
Das, Abhijit
INFORMATION SYSTEMS SECURITY, PROCEEDINGS, 2008, 5352 : 235 - 242
[37] Audio and text density in computer-based instruction
Koroghlanian, CM
Sullivan, HJ
JOURNAL OF EDUCATIONAL COMPUTING RESEARCH, 2000, 22 (02) : 217 - 230
[38] Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder
Yoon, Hyun-Wook
Lee, Sang-Hoon
Noh, Hyeong-Rae
Lee, Seong-Whan
INTERSPEECH 2020, 2020, : 3545 - 3549
[39] MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
Hayes, Thomas
Zhang, Songyang
Yin, Xi
Pang, Guan
Sheng, Sasha
Yang, Harry
Ge, Songwei
Hu, Qiyuan
Parikh, Devi
COMPUTER VISION, ECCV 2022, PT VIII, 2022, 13668 : 431 - 449
[40] HMM-based audio keyword generation
Xu, M
Duan, LY
Cai, J
Chia, LT
Xu, CS
Tian, Q
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2004, PT 3, PROCEEDINGS, 2004, 3333 : 566 - 574

← 1 2 3 4 5 →