Learning From Noisy Correspondence With Tri-Partition for Cross-Modal Matching

被引:0
|
作者
Feng, Zerun [1 ]
Zeng, Zhimin [1 ]
Guo, Caili [2 ]
Li, Zheng [1 ]
Hu, Lin [3 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Informat & Commun Engn, Beijing Key Lab Network Syst Architecture & Conver, Beijing 100876, Peoples R China
[2] Beijing Univ Posts & Telecommun, Sch Informat & Commun Engn, Beijing Lab Adv Informat Networks, Beijing 100876, Peoples R China
[3] China Telecom Digital Intelligence Technol Co Ltd, Beijing 100035, Peoples R China
关键词
Noise measurement; Semantics; Training; Semisupervised learning; Data models; Costs; Visualization; Cross-modal matching; noisy correspondence; image-text matching; video-text matching; TRANSFORMER;
D O I
10.1109/TMM.2023.3318002
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to high labeling cost, it is inevitable to introduce a certain proportion of noisy correspondence into visual-text datasets, resulting in poor model robustness for cross-modal matching. Although recent methods divide the datasets into clean and noisy pair subsets to yield promising achievements, they still suffer from deep neural networks over-fitting on noisy correspondence. In particular, the similar positive pairs with partially relevant semantic correspondence are easily partitioned into noisy pair subset by mistake without carefully selection, which brings harmful impact on robust learning. Meanwhile, the similar negative pairs with partially relevant semantic correspondence lead to ambiguous distance relations in common space learning, which also damages the stability of performance. To solve the coarse-grained dataset division problem, we propose Correspondence Tri-Partition Rectifier (CTPR) to partition the training set into clean, hard, and noisy pair subsets based on the memorization effect of neural networks and prediction inconsistency. Then, we refine the correspondence labels for each subset to indicate the real semantic correspondence between visual-text pairs. The differences between rectified labels of anchors and hard negatives are recast as the adaptive margin in the improved triplet loss for robust training in a co-teaching manner. To verify the effectiveness and robustness of our method, we conduct experiments by implementing image-text and video-text matching as two showcases. Extensive experiments on Flickr30 K, MS-COCO, MSR-VTT, and LSMDC datasets verify that our method successfully partitions the visual-text pairs according to their semantic correspondence and improves performance under noisy data training.
引用
收藏
页码:3884 / 3896
页数:13
相关论文
共 50 条
  • [1] Learning with Noisy Correspondence for Cross-modal Matching
    Huang, Zhenyu
    Niu, Guocheng
    Liu, Xiao
    Ding, Wenbiao
    Xiao, Xinyan
    Wu, Hua
    Peng, Xi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [2] UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal Matching
    Zha, Quanxing
    Liu, Xin
    Cheung, Yiu-ming
    Xu, Xing
    Wang, Nannan
    Cao, Jianjia
    [J]. PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 852 - 861
  • [3] Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval
    Qin, Yang
    Peng, Dezhong
    Peng, Xi
    Wang, Xu
    Hu, Peng
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4948 - 4956
  • [4] Neighborhood Learning from Noisy Labels for Cross-Modal Retrieval
    Li, Runhao
    Weng, Zhenyu
    Zhuang, Huiping
    Chen, Yongming
    Lin, Zhiping
    [J]. 2023 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS, 2023,
  • [5] Learning Cross-Modal Retrieval with Noisy Labels
    Hu, Peng
    Peng, Xi
    Zhu, Hongyuan
    Zhen, Liangli
    Lin, Jie
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5399 - 5409
  • [6] Negative Pre-aware for Noisy Cross-Modal Matching
    Zhang, Xu
    Li, Hao
    Ye, Mang
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7341 - 7349
  • [7] Quaternion Representation Learning for cross-modal matching
    Wang, Zheng
    Xu, Xing
    Wei, Jiwei
    Xie, Ning
    Shao, Jie
    Yang, Yang
    [J]. KNOWLEDGE-BASED SYSTEMS, 2023, 270
  • [8] Cross-Modal Retrieval With Noisy Correspondence via Consistency Refining and Mining
    Ma, Xinran
    Yang, Mouxing
    Li, Yunfan
    Hu, Peng
    Lv, Jiancheng
    Peng, Xi
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 2587 - 2598
  • [9] Disentangled Representation Learning for Cross-Modal Biometric Matching
    Ning, Hailong
    Zheng, Xiangtao
    Lu, Xiaoqiang
    Yuan, Yuan
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1763 - 1774
  • [10] Learning Coupled Feature Spaces for Cross-modal Matching
    Wang, Kaiye
    He, Ran
    Wang, Wei
    Wang, Liang
    Tan, Tieniu
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 2088 - 2095