Probabilistic Embeddings for Cross-Modal Retrieval

被引:100
|
作者
Chun, Sanghyuk [1 ]
Oh, Seong Joon [1 ]
de Rezende, Rafael Sampaio [2 ]
Kalantidis, Yannis [2 ]
Larlus, Diane [2 ]
机构
[1] NAVER AI Lab, Seongnam, South Korea
[2] NAVER Labs Europe, Meylan, France
关键词
D O I
10.1109/CVPR46437.2021.00831
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval methods build a common representation space for samples from multiple modalities, typically from the vision and the language domains. For images and their captions, the multiplicity of the correspondences makes the task particularly challenging. Given an image (respectively a caption), there are multiple captions (respectively images) that equally make sense. In this paper, we argue that deterministic functions are not sufficiently powerful to capture such one-to-many correspondences. Instead, we propose to use Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space. Since common benchmarks such as COCO suffer from non-exhaustive annotations for cross-modal matches, we propose to additionally evaluate retrieval on the CUB dataset, a smaller yet clean database where all possible image-caption pairs are annotated. We extensively ablate PCME and demonstrate that it not only improves the retrieval performance over its deterministic counterpart but also provides uncertainty estimates that render the embeddings more interpretable.
引用
收藏
页码:8411 / 8420
页数:10
相关论文
共 50 条
  • [1] Token Embeddings Alignment for Cross-Modal Retrieval
    Xie, Chen-Wei
    Wu, Jianmin
    Zheng, Yun
    Pan, Pan
    Hua, Xian-Sheng
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4555 - 4563
  • [2] Cross-modal Embeddings for Video and Audio Retrieval
    Suris, Didac
    Duarte, Amanda
    Salvador, Amaia
    Torres, Jordi
    Giro-i-Nieto, Xavier
    [J]. COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 711 - 716
  • [3] Improving Cross-Modal Retrieval with Set of Diverse Embeddings
    Kim, Dongwon
    Kim, Namyup
    Kwak, Suha
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23422 - 23431
  • [4] CHEF: Cross-Modal Hierarchical Embeddings for Food Domain Retrieval
    Pham, Hai X.
    Guerrero, Ricardo
    Li, Jiatong
    Pavlovic, Vladimir
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2423 - 2430
  • [5] Diachronic Cross-modal Embeddings
    Semedo, David
    Magalhaes, Joao
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2061 - 2069
  • [6] Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval
    Chung, Soo-Whan
    Chung, Joon Son
    Kang, Hong-Goo
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 568 - 576
  • [7] Learning Semantic Structure-preserved Embeddings for Cross-modal Retrieval
    Wu, Yiling
    Wang, Shuhui
    Huang, Qingming
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 825 - 833
  • [8] Cross-modal Retrieval Using Contrastive Learning of Visual-Semantic Embeddings
    Jain, Anurag
    Verma, Yashaswi
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4693 - 4699
  • [9] A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval
    Li, Hao
    Song, Jingkuan
    Gao, Lianli
    Zeng, Pengpeng
    Zhang, Haonan
    Li, Gongfu
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [10] Adversarial Cross-Modal Retrieval
    Wang, Bokun
    Yang, Yang
    Xu, Xing
    Hanjalic, Alan
    Shen, Heng Tao
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 154 - 162