In gearbox fault diagnosis based on vibration and torque state data, traditional one-dimensional time-frequency domain analysis methods often suffer from insufficient feature expression and mining, and require complex noise reduction and filtering preprocessing. To address this issue, this paper proposes a fusion image generation method that integrates the advantages of recurrence plot (RP) and Gramian angular summation field (GASF) to generate recurrence Gramian transformed (RGT) images. This approach integrates both global and local fault information, making the fault characteristics more intuitive and easier to analyze. Given that multi-sensor collaboration can enhance feature representation, feature-level fusion increases the computational burden, and decision-level fusion is prone to losing inter-sensor correlation information, this paper adopts data-level fusion for image sample enhancement. In the diagnostic method, the challenge of traditional convolutional neural networks (CNNs) in extracting diverse geometric linear structures from fused images is addressed by introducing deformable convolutional blocks for initial feature extraction. Additionally, a multi-scale feature fusion interaction network (MFFIN) is constructed. This network incorporates a channel-space interactive attention mechanism on top of multi-scale feature extraction, assigning weights to features according to their importance while facilitating the interaction of feature information. Finally, validation is carried out using public datasets, and the experimental results show that the proposed method demonstrates significant advantages in classification accuracy and robustness under variable operating conditions and noise, thereby proving its effectiveness and practicality.