Cross-modal Embeddings for Video and Audio Retrieval

被引:11
|
作者
Suris, Didac [1 ]
Duarte, Amanda [1 ,2 ]
Salvador, Amaia [1 ]
Torres, Jordi [1 ,2 ]
Giro-i-Nieto, Xavier [1 ,2 ]
机构
[1] Univ Politecn Catalunya UPC, Barcelona, Spain
[2] Barcelona Supercomp Ctr BSC, Barcelona, Spain
关键词
Cross-modal; Retrieval; YouTube-8M;
D O I
10.1007/978-3-030-11018-5_62
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.
引用
收藏
页码:711 / 716
页数:6
相关论文
共 50 条
  • [1] Video and audio are images: A cross-modal mixer for original data on video–audio retrieval
    Yuan, Zichen
    Shen, Qi
    Zheng, Bingyi
    Liu, Yuting
    Jiang, Linying
    Guo, Guibing
    [J]. Knowledge-Based Systems, 2024, 299
  • [2] Probabilistic Embeddings for Cross-Modal Retrieval
    Chun, Sanghyuk
    Oh, Seong Joon
    de Rezende, Rafael Sampaio
    Kalantidis, Yannis
    Larlus, Diane
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8411 - 8420
  • [3] Token Embeddings Alignment for Cross-Modal Retrieval
    Xie, Chen-Wei
    Wu, Jianmin
    Zheng, Yun
    Pan, Pan
    Hua, Xian-Sheng
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4555 - 4563
  • [4] Synchronising audio and ultrasound by learning cross-modal embeddings
    Eshky, Aciel
    Ribeiro, Manuel Sam
    Richmond, Korin
    Renals, Steve
    [J]. INTERSPEECH 2019, 2019, : 4100 - 4104
  • [5] Cross-modal retrieval of scripted speech audio
    Owen, CB
    Makedon, F
    [J]. MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 226 - 235
  • [6] Improving Cross-Modal Retrieval with Set of Diverse Embeddings
    Kim, Dongwon
    Kim, Namyup
    Kwak, Suha
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23422 - 23431
  • [7] Masking Modalities for Cross-modal Video Retrieval
    Gabeur, Valentin
    Nagrani, Arsha
    Sun, Chen
    Alahari, Karteek
    Schmid, Cordelia
    [J]. 2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2111 - 2120
  • [8] CHEF: Cross-Modal Hierarchical Embeddings for Food Domain Retrieval
    Pham, Hai X.
    Guerrero, Ricardo
    Li, Jiatong
    Pavlovic, Vladimir
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2423 - 2430
  • [9] CROSS MODAL AUDIO SEARCH AND RETRIEVAL WITH JOINT EMBEDDINGS BASED ON TEXT AND AUDIO
    Elizalde, Benjamin
    Zarar, Shuayb
    Raj, Bhiksha
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 4095 - 4099
  • [10] LEARNING CONTEXTUAL TAG EMBEDDINGS FOR CROSS-MODAL ALIGNMENT OF AUDIO AND TAGS
    Favory, Xavier
    Drossos, Konstantinos
    Virtanen, Tuomas
    Serra, Xavier
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 596 - 600