Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

被引:4
|
作者
Chen, Hao [1 ,2 ]
Shen, Feihong [1 ,2 ]
Ding, Ding [1 ]
Deng, Yongjian [3 ]
Li, Chao [4 ]
机构
[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
[2] Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing 211189, Peoples R China
[3] Beijing Univ Technol, Coll Comp Sci, Beijing 100124, Peoples R China
[4] Alibaba Grp, Hangzhou 311121, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Feature extraction; Task analysis; Computer architecture; Computational modeling; Object detection; Context modeling; RGB-D salient object detection; cross-modal attention; disentanglement; transformer; NETWORK; IMAGE;
D O I
10.1109/TIP.2024.3364022
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.
引用
收藏
页码:1699 / 1709
页数:11
相关论文
共 50 条
  • [21] Cross-modal and multi-level feature refinement network for RGB-D salient object detection
    Gao, Yue
    Dai, Meng
    Zhang, Qing
    [J]. VISUAL COMPUTER, 2023, 39 (09): : 3979 - 3994
  • [22] Cross-modal and multi-level feature refinement network for RGB-D salient object detection
    Yue Gao
    Meng Dai
    Qing Zhang
    [J]. The Visual Computer, 2023, 39 : 3979 - 3994
  • [23] Cross-Modal Attentional Context Learning for RGB-D Object Detection
    Li, Guanbin
    Gan, Yukang
    Wu, Hejun
    Xiao, Nong
    Lin, Liang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (04) : 1591 - 1601
  • [24] Cross-Modal Adaptation for RGB-D Detection
    Hoffman, Judy
    Gupta, Saurabh
    Leong, Jian
    Guadarrama, Sergio
    Darrell, Trevor
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2016, : 5032 - 5039
  • [25] Attention-aware Cross-modal Cross-level Fusion Network for RGB-D Salient Object Detection
    Chen, Hao
    Li, You-Fu
    Su, Dan
    [J]. 2018 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2018, : 6821 - 6826
  • [26] RGBD Salient Object Detection via Disentangled Cross-Modal Fusion
    Chen, Hao
    Deng, Yongjian
    Li, Youfu
    Hung, Tzu-Yi
    Lin, Guosheng
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 (29) : 8407 - 8416
  • [27] Attention-guided cross-modal multiple feature aggregation network for RGB-D salient object detection
    Chen, Bojian
    Wu, Wenbin
    Li, Zhezhou
    Han, Tengfei
    Chen, Zhuolei
    Zhang, Weihao
    [J]. ELECTRONIC RESEARCH ARCHIVE, 2024, 32 (01): : 643 - 669
  • [28] Discriminative Cross-Modal Transfer Learning and Densely Cross-Level Feedback Fusion for RGB-D Salient Object Detection
    Chen, Hao
    Li, Youfu
    Su, Dan
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (11) : 4808 - 4820
  • [29] Siamese Network for RGB-D Salient Object Detection and Beyond
    Fu, Keren
    Fan, Deng-Ping
    Ji, Ge-Peng
    Zhao, Qijun
    Shen, Jianbing
    Zhu, Ce
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (09) : 5541 - 5559
  • [30] Transformer-Based Cross-Modal Integration Network for RGB-T Salient Object Detection
    Lv, Chengtao
    Zhou, Xiaofei
    Wan, Bin
    Wang, Shuai
    Sun, Yaoqi
    Zhang, Jiyong
    Yan, Chenggang
    [J]. IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2024, 70 (02) : 4741 - 4755