Near-infrared (NIR) band sensors capture achromatic images that contain complementary details of a scene which are diminished in visible (VS) band images when the scene is obscured by haze, mist, or fog. To exploit these complementary details, an integrated FPGA architecture and implementation of a video processing system are proposed in this paper. This system performs VS-NIR video fusion and produces an enhanced VS video in real-time. The proposed FPGA architecture and implementation effectively handle the challenges associated with the simultaneous processing of video signals obtained from different sources such as the inevitable delay among corresponding frames and time-varying deviation among frame rates. Moreover, the proposed implementation is efficiently designed and able to produce the fused video at the same frame rate as the input videos, i.e. in real-time, regardless of the resolution of the input videos while the consumed FPGA resources are kept small. This is achieved by data and calculations reuse, besides performing operations concurrently in parallel and pipelined fashions at both the data and task levels. The proposed implementation is synthesized, validated on a low-end FPGA device, and compared to three other implementations. The comparison shows the superiority of the proposed implementation in terms of the consumed resources which have a direct industrial impact in the case of integration in modern smart-phones and cameras.