Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

被引:0
|
作者
Zhao, Zhiwei [1 ,2 ]
Liu, Bin [1 ,2 ]
Lu, Yan [3 ]
Chu, Qi [1 ,2 ]
Yu, Nenghai [1 ,2 ]
机构
[1] Univ Sci & Technol China, Sch Cyber Sci & Technol, Hefei, Peoples R China
[2] CAS Key Lab Electromagnet Space Informat, Beijing, Peoples R China
[3] Shanghai AI Lab, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-to-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited imagetext relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multimodal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts.
引用
收藏
页码:7534 / 7542
页数:9
相关论文
共 50 条
  • [1] Semantic consistency learning for unsupervised multi-modal person re-identification
    Zhang, Yuxin
    Teng, Zhu
    Zhang, Baopeng
    IMAGE AND VISION COMPUTING, 2025, 155
  • [2] Cross-Modal Dual Matching and Comparison for Text-to-Image Person Re-identification
    Cao, Lin
    Sun, Wenwen
    Guo, Yanan
    Wang, Shoujing
    Lv, Boqian
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 246 - 259
  • [3] TriMatch: Triple Matching for Text-to-Image Person Re-Identification
    Yan, Shuanglin
    Dong, Neng
    Li, Shuang
    Li, Huafeng
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 806 - 810
  • [4] CMLFA: cross-modal latent feature aligning for text-to-image person re-identification
    Yang, Xiaofan
    Wang, Jianming
    Sun, Yukuan
    Duan, Xiaojie
    JOURNAL OF ELECTRONIC IMAGING, 2025, 34 (01)
  • [5] Text-to-Image Person Re-Identification Based on Multimodal Graph Convolutional Network
    Han, Guang
    Lin, Min
    Li, Ziyang
    Zhao, Haitao
    Kwong, Sam
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6025 - 6036
  • [6] Learning Granularity-Unified Representations for Text-to-Image Person Re-identification
    Shao, Zhiyin
    Zhang, Xinyu
    Fang, Meng
    Lin, Zhifeng
    Wang, Jian
    Ding, Changxing
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5566 - 5574
  • [7] Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification
    Yan, Shuanglin
    Dong, Neng
    Liu, Jun
    Zhang, Liyan
    Tang, Jinhui
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6202 - 6211
  • [8] Text-to-Image Vehicle Re-Identification: Multi-Scale Multi-View Cross-Modal Alignment Network and a Unified Benchmark
    Ding, Leqi
    Liu, Lei
    Huang, Yan
    Li, Chenglong
    Zhang, Cheng
    Wang, Wei
    Wang, Liang
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (07) : 7673 - 7686
  • [9] Cross-modal feature learning and alignment network for text-image person re-identification
    Huang, Bailiang
    Qi, Xiaolong
    Chen, Bin
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 103
  • [10] Multi-granularity confidence learning for unsupervised text-to-image person re-identification with incomplete modality
    Li, Yongxiang
    Peng, Dezhong
    Huang, Haixiao
    Liu, Yizhi
    Zheng, Huiming
    Liu, Zheng
    KNOWLEDGE-BASED SYSTEMS, 2025, 315