Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

被引:0
|
作者
Zhao, Zhiwei [1 ,2 ]
Liu, Bin [1 ,2 ]
Lu, Yan [3 ]
Chu, Qi [1 ,2 ]
Yu, Nenghai [1 ,2 ]
机构
[1] Univ Sci & Technol China, Sch Cyber Sci & Technol, Hefei, Peoples R China
[2] CAS Key Lab Electromagnet Space Informat, Beijing, Peoples R China
[3] Shanghai AI Lab, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-to-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited imagetext relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multimodal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts.
引用
收藏
页码:7534 / 7542
页数:9
相关论文
共 50 条
  • [21] Multi-modal uniform deep learning for RGB-D person re-identification
    Ren, Liangliang
    Lu, Jiwen
    Feng, Jianjiang
    Zhou, Jie
    PATTERN RECOGNITION, 2017, 72 : 446 - 457
  • [22] Graph based Spatial-temporal Fusion for Multi-modal Person Re-identification
    Zhang, Yaobin
    Lv, Jianming
    Liu, Chen
    Cai, Hongmin
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3736 - 3744
  • [23] TriReID: Towards Multi-Modal Person Re-Identification via Descriptive Fusion Model
    Zhai, Yajing
    Zeng, Yawen
    Cao, Da
    Lu, Shaofei
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 63 - 71
  • [24] PROMPTCHARM: Text-to-Image Generation through Multi-modal Prompting and Refinement
    Wang, Zhijie
    Huang, Yuheng
    Song, Da
    Ma, Lei
    Zhang, Tianyi
    PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
  • [25] Progressively Hybrid Transformer for Multi-Modal Vehicle Re-Identification
    Pan, Wenjie
    Huang, Linhan
    Liang, Jianbao
    Hong, Lan
    Zhu, Jianqing
    SENSORS, 2023, 23 (09)
  • [26] Spatial enhanced multi-level alignment learning for text-image person re-identification with coupled noisy labels
    Zhao, Jiacheng
    Che, Haojie
    Li, Yongxi
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [27] Graph-based Consistent Reconstruction and Alignment for imbalanced text-image person re-identification
    Du, Guodong
    Gong, Tiantian
    Zhang, Liyan
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 260
  • [28] Interact, Embed, and EnlargE: Boosting Modality-Specific Representations for Multi-Modal Person Re-identification
    Wang, Zi
    Li, Chenglong
    Zheng, Aihua
    He, Ran
    Tang, Jin
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2633 - 2641
  • [29] Image-to-video person re-identification using three-dimensional semantic appearance alignment and cross-modal interactive learning
    Shi, Wei
    Liu, Hong
    Liu, Mengyuan
    PATTERN RECOGNITION, 2022, 122
  • [30] Enhancing Cross-modal Completion and Alignment for Unsupervised Incomplete Text-to-Image Person Retrieval
    Gong, Tiantian
    Wang, Junsheng
    Zhang, Liyan
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 794 - 802