Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

被引:4
|
作者
Chen, Hao [1 ,2 ]
Shen, Feihong [1 ,2 ]
Ding, Ding [1 ]
Deng, Yongjian [3 ]
Li, Chao [4 ]
机构
[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
[2] Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing 211189, Peoples R China
[3] Beijing Univ Technol, Coll Comp Sci, Beijing 100124, Peoples R China
[4] Alibaba Grp, Hangzhou 311121, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Feature extraction; Task analysis; Computer architecture; Computational modeling; Object detection; Context modeling; RGB-D salient object detection; cross-modal attention; disentanglement; transformer; NETWORK; IMAGE;
D O I
10.1109/TIP.2024.3364022
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.
引用
收藏
页码:1699 / 1709
页数:11
相关论文
共 50 条
  • [1] RGB-D salient object detection with asymmetric cross-modal fusion
    Yu, Ming
    Xing, Zhang-Hao
    Liu, Yi
    [J]. Kongzhi yu Juece/Control and Decision, 2023, 38 (09): : 2487 - 2495
  • [2] Cross-modal hierarchical interaction network for RGB-D salient object detection
    Bi, Hongbo
    Wu, Ranwan
    Liu, Ziqi
    Zhu, Huihui
    Zhang, Cong
    Xiang, Tian -Zhu
    [J]. PATTERN RECOGNITION, 2023, 136
  • [3] Joint Cross-Modal and Unimodal Features for RGB-D Salient Object Detection
    Huang, Nianchang
    Liu, Yi
    Zhang, Qiang
    Han, Jungong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2428 - 2441
  • [4] Depth Enhanced Cross-Modal Cascaded Network for RGB-D Salient Object Detection
    Zhao, Zhengyun
    Huang, Ziqing
    Chai, Xiuli
    Wang, Jun
    [J]. NEURAL PROCESSING LETTERS, 2023, 55 (01) : 361 - 384
  • [5] Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection
    Hu, Xihang
    Sun, Fuming
    Sun, Jing
    Wang, Fasheng
    Li, Haojie
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (08) : 3067 - 3085
  • [6] A cross-modal edge-guided salient object detection for RGB-D image
    Liu, Zhengyi
    Wang, Kaixun
    Dong, Hao
    Wang, Yuan
    [J]. NEUROCOMPUTING, 2021, 454 : 168 - 177
  • [7] Depth Enhanced Cross-Modal Cascaded Network for RGB-D Salient Object Detection
    Zhengyun Zhao
    Ziqing Huang
    Xiuli Chai
    Jun Wang
    [J]. Neural Processing Letters, 2023, 55 : 361 - 384
  • [8] MULTI-MODAL TRANSFORMER FOR RGB-D SALIENT OBJECT DETECTION
    Song, Peipei
    Zhang, Jing
    Koniusz, Piotr
    Barnes, Nick
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2466 - 2470
  • [9] Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection
    Xiao, Zhibin
    Xie, Pengwei
    Wang, Guijin
    [J]. MULTIMEDIA MODELING (MMM 2022), PT I, 2022, 13141 : 352 - 363
  • [10] Cross-modal refined adjacent-guided network for RGB-D salient object detection
    Bi H.
    Zhang J.
    Wu R.
    Tong Y.
    Jin W.
    [J]. Multimedia Tools and Applications, 2023, 82 (24) : 37453 - 37478