TSVT: Token Sparsification Vision Transformer for robust RGB-D salient object detection

被引:1
|
作者
Gao, Lina [1 ]
Liu, Bing [1 ]
Fu, Ping [1 ]
Xu, Mingzhu [2 ]
机构
[1] Harbin Inst Technol, Sch Elect & Informat Engn, Harbin 150001, Heilongjiang, Peoples R China
[2] Shangdong Univ, Sch Software, Jinan 250101, Shangdong, Peoples R China
关键词
Salient object detection; RGB-D image; Self-attention mechanism; Vision transformer; Token sparsification;
D O I
10.1016/j.patcog.2023.110190
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual transformer-based salient object detection (SOD) models have attracted increasing research attention. However, the existing transformer-based RGB-D SOD models usually operate on the full token sequences of RGB-D images and use an equal tokenization process to treat appearance and depth modalities, which leads to limited feature richness and inefficiency. To address these limitations, we present a novel token sparsification vision transformer architecture for RGB-D SOD, named TSVT, that explicitly extracts global-local multi modality features with sparse tokens. The TSVT is an asymmetric encoder-decoder architecture with a dynamic sparse token encoder that adaptively selects and operates on sparse tokens, along with an multiple cascade aggregation decoder (MCAD) that predicts saliency results. Furthermore, we deeply investigate the differences and similarities between the appearance and depth modalities and develop an interactive diversity fusion module (IDFM) to integrate each pair of multi-modality tokens in different stages. Finally, to comprehensively evaluate the performance of the proposed model, we conduct extensive experiments on seven standard RGB-D SOD benchmarks in terms of five evaluation metrics. The experimental results reveal that the proposed model is more robust and effective than fifteen existing RGB-D SOD models. Moreover, the complexity of our model with the sparsification module is more than two times lower than that of the variant model without the dynamic sparse token module (DSTM).
引用
收藏
页数:14
相关论文
共 50 条
  • [11] DVSOD: RGB-D Video Salient Object Detection
    Li, Jingjing
    Ji, Wei
    Wang, Size
    Li, Wenbo
    Cheng, Li
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [12] Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond
    Chen, Hao
    Shen, Feihong
    Ding, Ding
    Deng, Yongjian
    Li, Chao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1699 - 1709
  • [13] TANet: Transformer-based asymmetric network for RGB-D salient object detection
    Liu, Chang
    Yang, Gang
    Wang, Shuo
    Wang, Hangxu
    Zhang, Yunhua
    Wang, Yutao
    IET COMPUTER VISION, 2023, 17 (04) : 415 - 430
  • [14] Transformer-based difference fusion network for RGB-D salient object detection
    Cui, Zhi-Qiang
    Wang, Feng
    Feng, Zheng-Yong
    JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (06)
  • [15] Advancing in RGB-D Salient Object Detection: A Survey
    Chen, Ai
    Li, Xin
    He, Tianxiang
    Zhou, Junlin
    Chen, Duanbing
    APPLIED SCIENCES-BASEL, 2024, 14 (17):
  • [16] Adaptive Fusion for RGB-D Salient Object Detection
    Wang, Ningning
    Gong, Xiaojin
    IEEE ACCESS, 2019, 7 : 55277 - 55284
  • [17] Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection
    Wang, Shuaihui
    Jiang, Fengyi
    Xu, Boqian
    SENSORS, 2023, 23 (21)
  • [18] Transformer Fusion and Pixel-Level Contrastive Learning for RGB-D Salient Object Detection
    Wu, Jiesheng
    Hao, Fangwei
    Liang, Weiyun
    Xu, Jing
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1011 - 1026
  • [19] HierN et: Hierarchical Transformer U -Shape Network for RGB-D Salient Object Detection
    Lv, Pengfei
    Yu, Xiaosheng
    Wang, Junxiang
    Wu, Chengdong
    2023 35TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2023, : 1807 - 1811
  • [20] SiaTrans: Siamese transformer network for RGB-D salient object detection with depth image classification
    Jia, XingZhao
    DongYe, ChangLei
    Peng, YanJun
    IMAGE AND VISION COMPUTING, 2022, 127