Modal Complementarity Based on Multimodal Large Language Model for Text-Based Person Retrieval

被引:0
|
作者
Bao, Tong [1 ,2 ,4 ]
Xu, Tong [1 ,2 ,4 ]
Xu, Derong [1 ,3 ,4 ]
Zheng, Zhi [1 ,3 ,4 ]
机构
[1] State Key Lab Cognit Intelligence, Hefei, Peoples R China
[2] Sch Comp Sci & Technol, Hefei, Peoples R China
[3] Sch Data Sci, Hefei, Peoples R China
[4] Univ Sci & Technol China, Hefei, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Text-to-image Retrieval; Person Re-identification; Cross-Model Retrieval;
D O I
10.1007/978-981-97-7232-2_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-based person retrieval aims to find interest person images based on textual descriptions. The primary challenge in this task stems from the semantic gap resulting from the difference in feature granularity between text (which is characterized by coarse-grained features) and images (which are known for their fine-grained features). Previous works have utilized attention mechanisms to align modalities or to acquire a uniform representation, aiming to bridge the semantic gap between text and images. However, these methods suffer from two limitations: 1) Attention-based methods overlook subtle yet valuable information. 2) There exists a significant granularity gap between modalities, making the learning of a uniform representation time-consuming. To address these issues, we propose a Modal Complementarity framework based on Multimodal Large Language Model (MLLM-MC), which designed prompts according to task characteristics and utilized the multimodal abilities of Multimodal Large Language Model (MLLM) to produce elaborate textual descriptions for images. The textual descriptions generated by MLLM are used as a complement to the visual modality, thereby expanding the text-to-image retrieval task to encompass text-to-composite-image retrieval. To extract more comprehensive feature information, MLLM-MC employs a dual-stream model structure, which incorporates separate feature extractors for both visual and textual modalities. These extractors are further categorized into basic and detailed extractors, enabling the capture of information across different levels of granularity. Furthermore, in order to address the modal gap, we propose an uncertainty modeling technique within the visual branch, aiming to improve the model's matching patterns from one-to-one to one-to-many manner. The features from modal fusion are aligned using a transformer-based fusion module and low-order multimodal alignment. We conducted extensive experiments on three public datasets to evaluate the proposed MLLM-MC, achieving competitive Rank-1 accuracy of 68.58%, 62.66%, and 52.50% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively.
引用
收藏
页码:264 / 279
页数:16
相关论文
共 50 条
  • [1] DCEL: Deep Cross-modal Evidential Learning for Text-Based Person Retrieval
    Li, Shenshen
    Xu, Xing
    Yang, Yang
    Shen, Fumin
    Mo, Yijun
    Li, Yujie
    Shen, Heng Tao
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6292 - 6300
  • [2] Multi-modal molecule structure–text model for text-based retrieval and editing
    Shengchao Liu
    Weili Nie
    Chengpeng Wang
    Jiarui Lu
    Zhuoran Qiao
    Ling Liu
    Jian Tang
    Chaowei Xiao
    Animashree Anandkumar
    [J]. Nature Machine Intelligence, 2023, 5 : 1447 - 1457
  • [3] Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval
    Li, Shenshen
    He, Chen
    Xu, Xing
    Shen, Fumin
    Yang, Yang
    Shen, Heng Tao
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3172 - 3180
  • [4] Multi-modal molecule structure-text model for text-based retrieval and editing
    Liu, Shengchao
    Nie, Weili
    Wang, Chengpeng
    Lu, Jiarui
    Qiao, Zhuoran
    Liu, Ling
    Tang, Jian
    Xiao, Chaowei
    Anandkumar, Animashree
    [J]. NATURE MACHINE INTELLIGENCE, 2023, 5 (12) : 1447 - 1457
  • [5] Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark
    Yang, Shuyu
    Zhou, Yinan
    Zheng, Zhedong
    Wang, Yaxiong
    Zhu, Li
    Wu, Yujiao
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4492 - 4501
  • [6] Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval
    Li, Jiayi
    Jiang, Min
    Kong, Jun
    Tao, Xuefeng
    Luo, Xi
    [J]. IEEE Transactions on Multimedia, 2024, 26 : 10678 - 10691
  • [7] SUM: Serialized Updating and Matching for text-based person retrieval
    Wang, Zijie
    Zhu, Aichun
    Xue, Jingyi
    Jiang, Daihong
    Liu, Chao
    Li, Yifeng
    Hu, Fangqiang
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 248
  • [8] DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval
    Zhu, Aichun
    Wang, Zijie
    Li, Yifeng
    Wan, Xili
    Jin, Jing
    Wang, Tian
    Hu, Fangqiang
    Hua, Gang
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 209 - 217
  • [9] Text-based person search via cross-modal alignment learning
    Ke, Xiao
    Liu, Hao
    Xu, Peirong
    Lin, Xinru
    Guo, Wenzhong
    [J]. PATTERN RECOGNITION, 2024, 152
  • [10] Causality-Inspired Invariant Representation Learning for Text-Based Person Retrieval
    Liu, Yu
    Qin, Guihe
    Chen, Haipeng
    Cheng, Zhiyong
    Yang, Xun
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 12, 2024, : 14052 - 14060