AUDIO-JOURNEY: OPEN DOMAIN LATENT DIFFUSION BASED TEXT-TO-AUDIO GENERATION

被引：0

作者：

Michaels, Jackson ^{[1
]}

Li, Juncheng B. ^{[2
]}

Yao, Laura ^{[2
]}

Yu, Lijun ^{[2
]}

Wood-Doughty, Zach ^{[1
]}

Metze, Florian ^{[2
]}

机构：

[1] Northwestern Univ, Chicago, IL 60611 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA USA

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年

关键词：

Deep Learning; Open Domain Audio Generation; Audio-Visual Training; Large Language Models;

D O I：

10.1109/ICASSP48485.2024.10448220

中图分类号：

学科分类号：

摘要：

Despite recent progress, machine learning (ML) models for open-domain audio generation need to catch up to generative models for image, text, speech, and music. The lack of massive open-domain audio datasets is the main reason for this performance gap; we overcome this challenge through a novel data augmentation approach. We leverage state-of-the-art (SOTA) Large Language Models (LLMs) to enrich captions in the weakly-labeled audio dataset. We then use a SOTA video-captioning model to generate captions for the videos from which the audio data originated, and we again use LLMs to merge the audio and video captions to form a rich, large-scale dataset. We experimentally evaluate the quality of our audio-visual captions, showing a 12.5% gain in semantic score over baselines. Using our augmented dataset, we train a Latent Diffusion Model to generate in an encodec encoding latent space. Our model is novel in the current SOTA audio generation landscape due to our generation space, text encoder, noise schedule, and attention mechanism. Together, these innovations provide competitive open-domain audio generation. The samples, models, and implementation will be at https://audiojourney.github.io.

引用

页码：6960 / 6964

页数：5

共 50 条

[21] Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator
Kim, Jaewon
Choi, Won-Gook
Ahn, Seyun
Chang, Joon-Hyuk
INTERSPEECH 2024, 2024, : 3305 - 3309
[22] Conversion Of Text to Braille and SAPI Based Audio Generation for Visually Impaired Peoples
Banu, Halitha H.
Prabha, N.
JOURNAL OF ALGEBRAIC STATISTICS, 2022, 13 (02) : 1484 - 1488
[23] Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval
Luo, Kaiyi
Zhang, Xulong
Wang, Jianzong
Li, Huaxiong
Cheng, Ning
Xiao, Jing
2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 913 - 917
[24] Diffusion-Based Audio Inpainting
Moliner, Eloi
Valimaki, Vesa
JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2024, 72 (03): : 100 - 113
[25] INVESTIGATING POOLING STRATEGIES AND LOSS FUNCTIONS FOR WEAKLY-SUPERVISED TEXT-TO-AUDIO GROUNDING VIA CONTRASTIVE LEARNING
Xu, Xuenan
Wu, Mengyue
Yu, Kai
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
[26] Open-Vocabulary Keyword Spotting With Audio And Text Embeddings
Sacchi, Niccolo
Nanchen, Alexandre
Jaggi, Martin
Cernak, Milos
INTERSPEECH 2019, 2019, : 3362 - 3366
[27] Text-Based Audio Retrieval by Learning From Similarities Between Audio Captions
Xie, Huang
Khorrami, Khazar
Rasanen, Okko
Virtanen, Tuomas
IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 221 - 225
[28] LSB Based Audio Steganography Based On Text Compression
Begum, M. Baritha
Venkataramani, Y.
INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY AND SYSTEM DESIGN 2011, 2012, 30 : 703 - 710
[29] Automatic generation of audio content for open learning resources
Brasher, Andrew
McAndrew, Patrick
JOURNAL OF INTERACTIVE MEDIA IN EDUCATION, 2009, (01):
[30] TAVT:Towards Transferable Audio-Visual Text Generation
Lin, Wang
Jin, Tao
Wang, Ye
Pan, Wenwen
Li, Linjun
Cheng, Xize
Zhao, Zhou
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14983 - 14999

← 1 2 3 4 5 →