Learning From Noisy Correspondence With Tri-Partition for Cross-Modal Matching

被引：0

作者：

Feng, Zerun ^{[1
]}

Zeng, Zhimin ^{[1
]}

Guo, Caili ^{[2
]}

Li, Zheng ^{[1
]}

Hu, Lin ^{[3
]}

机构：

[1] Beijing Univ Posts & Telecommun, Sch Informat & Commun Engn, Beijing Key Lab Network Syst Architecture & Conver, Beijing 100876, Peoples R China

[2] Beijing Univ Posts & Telecommun, Sch Informat & Commun Engn, Beijing Lab Adv Informat Networks, Beijing 100876, Peoples R China

[3] China Telecom Digital Intelligence Technol Co Ltd, Beijing 100035, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

关键词：

Noise measurement; Semantics; Training; Semisupervised learning; Data models; Costs; Visualization; Cross-modal matching; noisy correspondence; image-text matching; video-text matching; TRANSFORMER;

D O I：

10.1109/TMM.2023.3318002

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Due to high labeling cost, it is inevitable to introduce a certain proportion of noisy correspondence into visual-text datasets, resulting in poor model robustness for cross-modal matching. Although recent methods divide the datasets into clean and noisy pair subsets to yield promising achievements, they still suffer from deep neural networks over-fitting on noisy correspondence. In particular, the similar positive pairs with partially relevant semantic correspondence are easily partitioned into noisy pair subset by mistake without carefully selection, which brings harmful impact on robust learning. Meanwhile, the similar negative pairs with partially relevant semantic correspondence lead to ambiguous distance relations in common space learning, which also damages the stability of performance. To solve the coarse-grained dataset division problem, we propose Correspondence Tri-Partition Rectifier (CTPR) to partition the training set into clean, hard, and noisy pair subsets based on the memorization effect of neural networks and prediction inconsistency. Then, we refine the correspondence labels for each subset to indicate the real semantic correspondence between visual-text pairs. The differences between rectified labels of anchors and hard negatives are recast as the adaptive margin in the improved triplet loss for robust training in a co-teaching manner. To verify the effectiveness and robustness of our method, we conduct experiments by implementing image-text and video-text matching as two showcases. Extensive experiments on Flickr30 K, MS-COCO, MSR-VTT, and LSMDC datasets verify that our method successfully partitions the visual-text pairs according to their semantic correspondence and improves performance under noisy data training.

引用

页码：3884 / 3896

页数：13

共 50 条

[1] Learning with Noisy Correspondence for Cross-modal Matching
Huang, Zhenyu
Niu, Guocheng
Liu, Xiao
Ding, Wenbiao
Xiao, Xinyan
Wu, Hua
Peng, Xi
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[2] UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal Matching
Zha, Quanxing
Liu, Xin
Cheung, Yiu-ming
Xu, Xing
Wang, Nannan
Cao, Jianjia
[J]. PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 852 - 861
[3] Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval
Qin, Yang
Peng, Dezhong
Peng, Xi
Wang, Xu
Hu, Peng
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4948 - 4956
[4] Neighborhood Learning from Noisy Labels for Cross-Modal Retrieval
Li, Runhao
Weng, Zhenyu
Zhuang, Huiping
Chen, Yongming
Lin, Zhiping
[J]. 2023 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS, 2023,
[5] Learning Cross-Modal Retrieval with Noisy Labels
Hu, Peng
Peng, Xi
Zhu, Hongyuan
Zhen, Liangli
Lin, Jie
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5399 - 5409
[6] Negative Pre-aware for Noisy Cross-Modal Matching
Zhang, Xu
Li, Hao
Ye, Mang
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7341 - 7349
[7] Quaternion Representation Learning for cross-modal matching
Wang, Zheng
Xu, Xing
Wei, Jiwei
Xie, Ning
Shao, Jie
Yang, Yang
[J]. KNOWLEDGE-BASED SYSTEMS, 2023, 270
[8] Cross-Modal Retrieval With Noisy Correspondence via Consistency Refining and Mining
Ma, Xinran
Yang, Mouxing
Li, Yunfan
Hu, Peng
Lv, Jiancheng
Peng, Xi
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 2587 - 2598
[9] Disentangled Representation Learning for Cross-Modal Biometric Matching
Ning, Hailong
Zheng, Xiangtao
Lu, Xiaoqiang
Yuan, Yuan
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1763 - 1774
[10] Learning Coupled Feature Spaces for Cross-modal Matching
Wang, Kaiye
He, Ran
Wang, Wei
Wang, Liang
Tan, Tieniu
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 2088 - 2095

← 1 2 3 4 5 →