TSVT: Token Sparsification Vision Transformer for robust RGB-D salient object detection

被引:1
|
作者
Gao, Lina [1 ]
Liu, Bing [1 ]
Fu, Ping [1 ]
Xu, Mingzhu [2 ]
机构
[1] Harbin Inst Technol, Sch Elect & Informat Engn, Harbin 150001, Heilongjiang, Peoples R China
[2] Shangdong Univ, Sch Software, Jinan 250101, Shangdong, Peoples R China
关键词
Salient object detection; RGB-D image; Self-attention mechanism; Vision transformer; Token sparsification;
D O I
10.1016/j.patcog.2023.110190
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual transformer-based salient object detection (SOD) models have attracted increasing research attention. However, the existing transformer-based RGB-D SOD models usually operate on the full token sequences of RGB-D images and use an equal tokenization process to treat appearance and depth modalities, which leads to limited feature richness and inefficiency. To address these limitations, we present a novel token sparsification vision transformer architecture for RGB-D SOD, named TSVT, that explicitly extracts global-local multi modality features with sparse tokens. The TSVT is an asymmetric encoder-decoder architecture with a dynamic sparse token encoder that adaptively selects and operates on sparse tokens, along with an multiple cascade aggregation decoder (MCAD) that predicts saliency results. Furthermore, we deeply investigate the differences and similarities between the appearance and depth modalities and develop an interactive diversity fusion module (IDFM) to integrate each pair of multi-modality tokens in different stages. Finally, to comprehensively evaluate the performance of the proposed model, we conduct extensive experiments on seven standard RGB-D SOD benchmarks in terms of five evaluation metrics. The experimental results reveal that the proposed model is more robust and effective than fifteen existing RGB-D SOD models. Moreover, the complexity of our model with the sparsification module is more than two times lower than that of the variant model without the dynamic sparse token module (DSTM).
引用
收藏
页数:14
相关论文
共 50 条
  • [1] GroupTransNet: Group transformer network for RGB-D salient object detection
    Fang, Xian
    Jiang, Mingfeng
    Zhu, Jinchao
    Shao, Xiuli
    Wang, Hongpeng
    [J]. NEUROCOMPUTING, 2024, 594
  • [2] MULTI-MODAL TRANSFORMER FOR RGB-D SALIENT OBJECT DETECTION
    Song, Peipei
    Zhang, Jing
    Koniusz, Piotr
    Barnes, Nick
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2466 - 2470
  • [3] RGB-D salient object detection: A survey
    Tao Zhou
    Deng-Ping Fan
    Ming-Ming Cheng
    Jianbing Shen
    Ling Shao
    [J]. Computational Visual Media, 2021, 7 : 37 - 69
  • [4] RGB-D salient object detection: A survey
    Tao Zhou
    Deng-Ping Fan
    Ming-Ming Cheng
    Jianbing Shen
    Ling Shao
    [J]. Computational Visual Media, 2021, 7 (01) : 37 - 69
  • [5] RGB-D salient object detection: A survey
    Zhou, Tao
    Fan, Deng-Ping
    Cheng, Ming-Ming
    Shen, Jianbing
    Shao, Ling
    [J]. COMPUTATIONAL VISUAL MEDIA, 2021, 7 (01) : 37 - 69
  • [6] Salient Object Detection in RGB-D Videos
    Mou, Ao
    Lu, Yukang
    He, Jiahao
    Min, Dingyao
    Fu, Keren
    Zhao, Qijun
    [J]. IEEE Transactions on Image Processing, 2024, 33 : 6660 - 6675
  • [7] Calibrated RGB-D Salient Object Detection
    Ji, Wei
    Li, Jingjing
    Yu, Shuang
    Zhang, Miao
    Piao, Yongri
    Yao, Shunyu
    Bi, Qi
    Ma, Kai
    Zheng, Yefeng
    Lu, Huchuan
    Cheng, Li
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9466 - 9476
  • [8] Lightweight cross-modal transformer for RGB-D salient object detection
    Huang, Nianchang
    Yang, Yang
    Zhang, Qiang
    Han, Jungong
    Huang, Jin
    [J]. Computer Vision and Image Understanding, 2024, 249
  • [9] TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network
    Liu, Zhengyi
    Wang, Yuan
    Tu, Zhengzheng
    Xiao, Yun
    Tang, Bin
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4481 - 4490
  • [10] CATNet: A Cascaded and Aggregated Transformer Network for RGB-D Salient Object Detection
    Sun, Fuming
    Ren, Peng
    Yin, Bowen
    Wang, Fasheng
    Li, Haojie
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2249 - 2262