Semantic segmentation for road scenes aims to predict specific categories for each pixel in road images, playing a crucial role in various applications such as autonomous driving, scene recognition, and robotics. However, existing semantic segmentation methods face the challenge of maintaining competitive segmentation accuracy while achieving real-time inference speed. To address this, we introduce the multi-stage feature fusion network (MSF2Net) for real-time semantic segmentation of road scenes, achieving a balance between speed and accuracy. First, we develop a lightweight dilate symmetric interaction module (DSIM) to extract rich local and contextual information from images. Next, we enhance the spatial and semantic information of shallow and deep features by stacking DSIMs with different numbers and dilation rates, respectively. Finally, the position attention module supplements the global information of the image. Subsequently, a feature fusion module is utilized to integrate shallow, deep, and global features, thereby achieving real-time semantic segmentation through multi-stage feature fusion. Experimental results on multiple benchmark datasets demonstrate that MSF2Net achieves a good balance between segmentation performance and inference speed. © 2024 SPIE and IS&T.