CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network

被引:186
|
作者
Peng, Yuxin [1 ]
Qi, Jinwei [1 ]
Huang, Xin [1 ]
Yuan, Yuxin [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing 100871, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; fine-grained correlation; joint optimization; multi-task learning; REPRESENTATION; MODEL;
D O I
10.1109/TMM.2017.2742704
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal retrieval has become a highlighted research topic for retrieval across multimedia data such as image and text. A two-stage learning framework is widely adopted by most existing methods based on deep neural network (DNN): The first learning stage is to generate separate representation for each modality, and the second learning stage is to get the cross-modal common representation. However, the existing methods have three limitations: 1) In the first learning stage, they only model intramodality correlation, but ignore intermodality correlation with rich complementary context. 2) In the second learning stage, they only adopt shallow networks with single-loss regularization, but ignore the intrinsic relevance of intramodality and intermodality correlation. 3) Only original instances are considered while the complementary fine-grained clues provided by their patches are ignored. For addressing the above problems, this paper proposes a cross-modal correlation learning (CCL) approach with multigrained fusion by hierarchical network, and the contributions are as follows: 1) In the first learning stage, CCL exploits multilevel association with joint optimization to preserve the complementary context from intramodality and intermodality correlation simultaneously. 2) In the second learning stage, a multitask learning strategy is designed to adaptively balance the intramodality semantic category constraints and intermodality pairwise similarity constraints. 3) CCL adopts multigrained modeling, which fuses the coarse-grained instances and fine-grained patches to make cross-modal correlation more precise. Comparing with 13 state-of-the-art methods on 6 widely-used cross-modal datasets, the experimental results show our CCL approach achieves the best performance.
引用
收藏
页码:405 / 420
页数:16
相关论文
共 50 条
  • [41] CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition
    Zheng, Jinzhi
    Ji, Ruyi
    Zhang, Libo
    Wu, Yanjun
    Zhao, Chen
    NEURAL INFORMATION PROCESSING, ICONIP 2023, PT VI, 2024, 14452 : 421 - 433
  • [42] Learning a Dynamic Cross-Modal Network for Multispectral Pedestrian Detection
    Xie, Jin
    Anwer, Rao Muhammad
    Cholakkal, Hisham
    Nie, Jing
    Cao, Jiale
    Laaksonen, Jorma
    Khan, Fahad Shahbaz
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4043 - 4052
  • [43] Heterogeneous Interactive Learning Network for Unsupervised Cross-Modal Retrieval
    Zheng, Yuanchao
    Zhang, Xiaowei
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 692 - 707
  • [44] Cross-modal Common Representation Learning by Hybrid Transfer Network
    Huang, Xin
    Peng, Yuxin
    Yuan, Mingkuan
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1893 - 1900
  • [45] Enhanced Linear Discriminant Canonical Correlation Analysis for Cross-modal Fusion Recognition
    Yu, Chengnian
    Wang, Huabin
    Liu, Xin
    Tao, Liang
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 841 - 853
  • [46] Semi-discriminant cross-modal correlation feature fusion with structure elasticity
    Zhu, Yanmin
    Peng, Tianhao
    Su, Shuzhi
    OPTIK, 2022, 254
  • [47] Cross-modal Scalable Hyperbolic Hierarchical Clustering
    Long, Teng
    van Noord, Nanne
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 16609 - 16618
  • [48] Hierarchical Consensus Hashing for Cross-Modal Retrieval
    Sun, Yuan
    Ren, Zhenwen
    Hu, Peng
    Peng, Dezhong
    Wang, Xu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 824 - 836
  • [49] Cross-Modal and Hierarchical Modeling of Video and Text
    Zhang, Bowen
    Hu, Hexiang
    Sha, Fei
    COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 : 385 - 401
  • [50] Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
    Yu, Yi
    Tang, Suhua
    Raposo, Francisco
    Chen, Lei
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)