CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network

被引：186

作者：

Peng, Yuxin ^{[1
]}

Qi, Jinwei ^{[1
]}

Huang, Xin ^{[1
]}

Yuan, Yuxin ^{[1
]}

机构：

[1] Peking Univ, Inst Comp Sci & Technol, Beijing 100871, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2018年 / 20卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Cross-modal retrieval; fine-grained correlation; joint optimization; multi-task learning; REPRESENTATION; MODEL;

D O I：

10.1109/TMM.2017.2742704

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cross-modal retrieval has become a highlighted research topic for retrieval across multimedia data such as image and text. A two-stage learning framework is widely adopted by most existing methods based on deep neural network (DNN): The first learning stage is to generate separate representation for each modality, and the second learning stage is to get the cross-modal common representation. However, the existing methods have three limitations: 1) In the first learning stage, they only model intramodality correlation, but ignore intermodality correlation with rich complementary context. 2) In the second learning stage, they only adopt shallow networks with single-loss regularization, but ignore the intrinsic relevance of intramodality and intermodality correlation. 3) Only original instances are considered while the complementary fine-grained clues provided by their patches are ignored. For addressing the above problems, this paper proposes a cross-modal correlation learning (CCL) approach with multigrained fusion by hierarchical network, and the contributions are as follows: 1) In the first learning stage, CCL exploits multilevel association with joint optimization to preserve the complementary context from intramodality and intermodality correlation simultaneously. 2) In the second learning stage, a multitask learning strategy is designed to adaptively balance the intramodality semantic category constraints and intermodality pairwise similarity constraints. 3) CCL adopts multigrained modeling, which fuses the coarse-grained instances and fine-grained patches to make cross-modal correlation more precise. Comparing with 13 state-of-the-art methods on 6 widely-used cross-modal datasets, the experimental results show our CCL approach achieves the best performance.

引用

页码：405 / 420

页数：16

共 50 条

[41] CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition
Zheng, Jinzhi
Ji, Ruyi
Zhang, Libo
Wu, Yanjun
Zhao, Chen
NEURAL INFORMATION PROCESSING, ICONIP 2023, PT VI, 2024, 14452 : 421 - 433
[42] Learning a Dynamic Cross-Modal Network for Multispectral Pedestrian Detection
Xie, Jin
Anwer, Rao Muhammad
Cholakkal, Hisham
Nie, Jing
Cao, Jiale
Laaksonen, Jorma
Khan, Fahad Shahbaz
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4043 - 4052
[43] Heterogeneous Interactive Learning Network for Unsupervised Cross-Modal Retrieval
Zheng, Yuanchao
Zhang, Xiaowei
COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 692 - 707
[44] Cross-modal Common Representation Learning by Hybrid Transfer Network
Huang, Xin
Peng, Yuxin
Yuan, Mingkuan
PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1893 - 1900
[45] Enhanced Linear Discriminant Canonical Correlation Analysis for Cross-modal Fusion Recognition
Yu, Chengnian
Wang, Huabin
Liu, Xin
Tao, Liang
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 841 - 853
[46] Semi-discriminant cross-modal correlation feature fusion with structure elasticity
Zhu, Yanmin
Peng, Tianhao
Su, Shuzhi
OPTIK, 2022, 254
[47] Cross-modal Scalable Hyperbolic Hierarchical Clustering
Long, Teng
van Noord, Nanne
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 16609 - 16618
[48] Hierarchical Consensus Hashing for Cross-Modal Retrieval
Sun, Yuan
Ren, Zhenwen
Hu, Peng
Peng, Dezhong
Wang, Xu
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 824 - 836
[49] Cross-Modal and Hierarchical Modeling of Video and Text
Zhang, Bowen
Hu, Hexiang
Sha, Fei
COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 : 385 - 401
[50] Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Yu, Yi
Tang, Suhua
Raposo, Francisco
Chen, Lei
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)

← 1 2 3 4 5 →