Cross-Modal Retrieval Augmentation for Multi-Modal Classification

被引:0
|
作者
Gur, Shir [1 ,3 ]
Neverova, Natalia [2 ]
Stauffer, Chris [2 ]
Lim, Ser-Nam [2 ]
Kiela, Douwe [2 ]
Reiter, Austin [2 ]
机构
[1] Tel Aviv Univ, Tel Aviv, Israel
[2] Facebook AI, Menlo Pk, CA USA
[3] FAIR, Menlo Pk, CA USA
关键词
KNOWLEDGE; LANGUAGE;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvements in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.
引用
收藏
页码:111 / 123
页数:13
相关论文
共 50 条
  • [1] Multi-modal and cross-modal for lecture videos retrieval
    Nhu Van Nguyen
    Coustaty, Mickal
    Ogier, Jean-Marc
    [J]. 2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 2667 - 2672
  • [2] Unsupervised Multi-modal Hashing for Cross-Modal Retrieval
    Jun Yu
    Xiao-Jun Wu
    Donglin Zhang
    [J]. Cognitive Computation, 2022, 14 : 1159 - 1171
  • [3] Unsupervised Multi-modal Hashing for Cross-Modal Retrieval
    Yu, Jun
    Wu, Xiao-Jun
    Zhang, Donglin
    [J]. COGNITIVE COMPUTATION, 2022, 14 (03) : 1159 - 1171
  • [4] Multi-modal semantic autoencoder for cross-modal retrieval
    Wu, Yiling
    Wang, Shuhui
    Huang, Qingming
    [J]. NEUROCOMPUTING, 2019, 331 : 165 - 175
  • [5] Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval
    Zeng, Yawen
    Cao, Da
    Wei, Xiaochi
    Liu, Meng
    Zhao, Zhou
    Qin, Zheng
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2215 - 2224
  • [6] Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval
    Wu, Hongchang
    Guan, Ziyu
    Zhi, Tao
    zhao, Wei
    Xu, Cai
    Han, Hong
    Yang, Yarning
    [J]. 2019 10TH IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (ICBK 2019), 2019, : 265 - 272
  • [7] Multi-modal Subspace Learning with Joint Graph Regularization for Cross-modal Retrieval
    Wang, Kaiye
    Wang, Wei
    He, Ran
    Wang, Liang
    Tan, Tieniu
    [J]. 2013 SECOND IAPR ASIAN CONFERENCE ON PATTERN RECOGNITION (ACPR 2013), 2013, : 236 - 240
  • [8] Multi-modal Subspace Learning with Dropout regularization for Cross-modal Recognition and Retrieval
    Cao, Guanqun
    Waris, Muhammad Adeel
    Iosifidis, Alexandros
    Gabbouj, Moncef
    [J]. 2016 SIXTH INTERNATIONAL CONFERENCE ON IMAGE PROCESSING THEORY, TOOLS AND APPLICATIONS (IPTA), 2016,
  • [9] A Framework for Enabling Unpaired Multi-Modal Learning for Deep Cross-Modal Hashing Retrieval
    Williams-Lekuona, Mikel
    Cosma, Georgina
    Phillips, Iain
    [J]. JOURNAL OF IMAGING, 2022, 8 (12)
  • [10] Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval
    Zou, Zhuoyang
    Zhu, Xinghui
    Zhu, Qinying
    Zhang, Hongyan
    Zhu, Lei
    [J]. FOODS, 2024, 13 (11)