RGB-D salient object detection task still encounters three challenges: (1) how to effectively integrate superior information from different modalities, (2) how to effectively mine common information of features at different levels, and (3) how to detect salient objects in complex scenes, such as complex backgrounds, low-quality depth maps, small targets, and high foreground-background similarity. To address the above challenges, we propose a novel Perceptual Localization and Focus Refinement Network, termed PLFRNet, based on the mechanism of human visual capture of salient objects in images. The network includes three key components: an encoder, a Perceptual Localization Module (PLM), and a Focus-Refinement Decoder (FRD). Specifically, we first adopt a two-stream asymmetric Pyramid Visual Transformer as the encoder to extract RGB and depth features. Then, we develop the PLM under the guidance of a Perceptual Localization Unit (PLU) delicately designed. This module can mine the common information of features at different levels and integrate the advantageous information from multiple modalities to localize salient objects. Finally, we propose the FRD focusing on detailed information guided by the attention mechanism. Furthermore, it further refines the located objects by gradually interacting with low-level features to achieve salient object detection. Extensive experimental results show that this method achieves state-of-the-art performance compared with 13 RGB-D models on 6 public datasets. The codes are released at https://github.com/hjy0518/PLFRNet/.