AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval

被引:8
|
作者
Zhu, Hongguang [1 ]
Wei, Yunchao [1 ]
Zhao, Yao [1 ]
Zhang, Chunjie [2 ,3 ]
Huang, Shujuan [2 ,3 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China
[2] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[3] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
guided image retrieval; multimodal fusion; mixture-of-experts;
D O I
10.1145/3584703
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-guided image retrieval integrates reference image and text feedback as a multimodal query to search the image corresponding to user intention. Recent approaches employ multi-level matching, multiple accesses, or multiple subnetworks for better performance regardless of the heavy burden of storage and computation in the deployment. Additionally, these models not only rely on expert knowledge to handcraft image-text composing modules but also do inference by the static computational graph. It limits the representation capability and generalization ability of networks in the face of challenges from complex and varied combinations of reference image and text feedback. To break the shackles of the static network concept, we introduce the dynamic router mechanism to achieve data-dependent expert activation and flexible collaboration of multiple experts to explore more implicit multimodal fusion patterns. Specifically, we construct AMC, our Adaptive Multi-expert Collaborative network, by using the proposed router to activate the different experts with different levels of image-text interaction. Since routers can dynamically adjust the activation of experts for the current samples, AMC can achieve the adaptive fusion mode for the different reference image and text combinations and generate dynamic computational graphs according to varied multimodal queries. Extensive experiments on two benchmark datasets demonstrate that due to benefits from the image-text composing representation produced by an adaptive multi-expert collaboration mechanism, AMC has better retrieval performance and zero-shot generalization ability than the state-of-the-art method while keeping the lightweight model and fast retrieval speed. Moreover, we analyze the visualization of path activation, attention map, and retrieval results to further understand the routing decisions and semantic localization ability of AMC. The codes and pretrained models are available at https://github.com/KevinLight831/AMC.
引用
收藏
页数:22
相关论文
共 50 条
  • [11] Perceptual Image Compression with Text-Guided Multi-level Fusion
    Hu, Jiaqi
    Zhuang, Jiedong
    Liang, Xiaoyu
    Wang, Dayong
    Yu, Lu
    Hu, Haoji
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 84 - 97
  • [12] Bimodal text-guided image inpainting algorithm
    Li H.
    Chen J.
    Yu P.
    Li H.
    Zhang Y.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2023, 49 (10): : 2547 - 2557
  • [13] Compound Text-Guided Prompt Tuning via Image-Adaptive Cues
    Tan, Hao
    Li, Jun
    Zhou, Yizhuang
    Wan, Jun
    Lei, Zhen
    Zhang, Xiangyu
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 5061 - 5069
  • [14] MISL: Multi-grained image-text semantic learning for text-guided image inpainting
    Wu, Xingcai
    Zhao, Kejun
    Huang, Qianding
    Wang, Qi
    Yang, Zhenguo
    Hao, Gefei
    PATTERN RECOGNITION, 2024, 145
  • [15] Text-Guided Customizable Image Synthesis and Manipulation
    Zhang, Zhiqiang
    Fu, Chen
    Weng, Wei
    Zhou, Jinjia
    APPLIED SCIENCES-BASEL, 2022, 12 (20):
  • [16] GENERATIVE ADVERSARIAL NETWORK INCLUDING REFERRING IMAGE SEGMENTATION FOR TEXT-GUIDED IMAGE MANIPULATION
    Watanabe, Yuto
    Togo, Ren
    Maeda, Keisuke
    Ogawa, Takahiro
    Haseyama, Miki
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4818 - 4822
  • [17] Text-guided Unsupervised Latent Transformation for Multi-attribute Image Manipulation
    Wei, Xiwen
    Xu, Zhen
    Liu, Cheng
    Wu, Si
    Yu, Zhiwen
    Wong, Hau San
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19285 - 19294
  • [18] Text-Guided Attention Model for Image Captioning
    Mun, Jonghwan
    Cho, Minsu
    Han, Bohyung
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4233 - 4239
  • [19] Text-Guided Neural Network Training for Image Recognition in Natural Scenes and Medicine
    Zhang, Zizhao
    Chen, Pingjun
    Shi, Xiaoshuang
    Yang, Lin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (05) : 1733 - 1745
  • [20] Text-Guided Sketch-to-Photo Image Synthesis
    Osahor, Uche
    Nasrabadi, Nasser M.
    IEEE ACCESS, 2022, 10 : 98278 - 98289