Visual transformer-based salient object detection (SOD) models have attracted increasing research attention. However, the existing transformer-based RGB-D SOD models usually operate on the full token sequences of RGB-D images and use an equal tokenization process to treat appearance and depth modalities, which leads to limited feature richness and inefficiency. To address these limitations, we present a novel token sparsification vision transformer architecture for RGB-D SOD, named TSVT, that explicitly extracts global-local multi modality features with sparse tokens. The TSVT is an asymmetric encoder-decoder architecture with a dynamic sparse token encoder that adaptively selects and operates on sparse tokens, along with an multiple cascade aggregation decoder (MCAD) that predicts saliency results. Furthermore, we deeply investigate the differences and similarities between the appearance and depth modalities and develop an interactive diversity fusion module (IDFM) to integrate each pair of multi-modality tokens in different stages. Finally, to comprehensively evaluate the performance of the proposed model, we conduct extensive experiments on seven standard RGB-D SOD benchmarks in terms of five evaluation metrics. The experimental results reveal that the proposed model is more robust and effective than fifteen existing RGB-D SOD models. Moreover, the complexity of our model with the sparsification module is more than two times lower than that of the variant model without the dynamic sparse token module (DSTM).