RGB-T crowd counting (RGB-T CC) aims to estimate the crowd population size utilizing the complementary information from visible and thermal images. Current deep models for RGB-T CC typically adopt a three-tier architecture, featuring a middle fusion layer that aggregates both RGB and thermal streams. However, we find that this dedicated fusion layer dominates the training process, causing under-optimization of both modal branches, which becomes the performance bottleneck in mainstream multi-modal counting models. To address this challenge, we propose a simple-yet-effective counting architecture, the Spatial Exchanging Fusion Network (SEFNet). It is built on a Dual Attention Guided Spatial Exchanging (DASE) mechanism, enabling direct extraction and exchange of modality-complementary features between modalities without the extra fusion branch employed in most existing works. This design ensures a more balanced gradient back-propagation over networks, attaining optimized representations in multi-modality fusion over prior models. Besides, the Modality Gradient Enhancement Module (MGEM) in SEFNet can effectively learn modality-specific crowd representations with two counting sub-tasks, dynamically achieving better gradient distribution and further enhancing optimization in both modalities. Extensive experiments demonstrate that SEFNet significantly outperforms state-of-the-art methods on mainstream benchmark datasets, and also exhibits promising generalization ability across various counting backbones and losses.