The extrinsic calibration of LiDAR-camera system is a prerequisite for multimodal fusion. The target-based method can obtain precise extrinsic parameters offline. However, it is time-consuming and laborious, and during long-time application in real scenes, some unforeseen movement may cause the LiDAR-camera system to drift, which is necessary to be recalibrated. Therefore, this article proposes a targetless calibration method, which first uses a hand-eye estimator to reduce the cross-modality divergence for coarse calibration and then introduces a transformer-based deep network with local-to-global constraints to regress fine six-DoF extrinsic parameters. The latter deep network is composed of four modules: token pyramid module is used to reduce the size of transformer; transformer fuser module is designed for tightly coupling the point clouds and image data, exploiting its self-attention mechanism to fuse global spatial context of multimodality; feature injection module injects local tokens of corresponding scale into global features for representation augmentation, in order to extract subsequent semantic features; and multiconstraint module constructs a loss function including global regression constraint of six-DoF extrinsic parameters, local depth projection, and semantic edge constraint, further improving the robustness and generalization ability of the proposed network. Finally, an iterative refinement strategy is introduced to make the calibration results more precise. Experiments conducted on KITTI and Newer College datasets verify that the proposed method has promising performance. Ablation studies also show the effectiveness of the proposed modules.