Modulation classification is a research hot-spot in the field of machine learning and the knowledge-based systems of wireless communication, involving the identification of different types of wireless signals. However, classification targets like radio signals usually involve long-sequence and large-scale data, and the classification of modulation types is affected by environmental noises, resulting in unsatisfactory performance. To overcome these challenges, we propose a novel Deep Hybrid Transformer Network (DH-TR) that makes full use of intrinsic properties of multi-head self-attention and different neural modules, facilitating the identification modulation types from global to local perspectives. In particular, DH-TR uses a convolution stem to extract local features inherent in In-phase and Quadrature (IQ) data, based on which the Gated Recurrent Unit (GRU) is used to capture the step-wise sequential patterns of these signals. Afterward, the self-attention based Transformer branch is employed to learn the global and long-term dependencies among the signal patches. With innovative hybrid design, DH-TR performs well in processing sequential signal data and can better capture the complex relationships between signals. We validate DH-TR's effectiveness through extensive experiments on four benchmark datasets, demonstrating its superior performance in modulation classification with higher and robustness to noise to methods.