Hyperspectral images (HSIs) encompass data across numerous spectral bands, making them valuable in various practical fields such as remote sensing, agriculture, and marine monitoring. Unfortunately, inevitable noise introduction during sensing restricts their applicability, necessitating denoising for optimal utilization. The existing deep learning (DL)-based denoising methods suffer from various limitations. For instance, convolutional neural networks (CNNs) struggle with long-range dependencies, while vision transformers (ViTs) struggle to capture local details. This article introduces a novel method, UNFOLD, that addresses these inherent limitations by harmoniously integrating the strengths of 3-D U-Net, 3-D CNN, and 3-D Transformer architectures. Unlike several existing methods that predominantly capture dependencies either along the spatial or the spectral dimension, UNFOLD addresses HSI denoising as a 3-D task, synergizing spatial and spectral information through the utilization of 3-D Transformer and 3-D CNN. It employs the self-attention (SA) mechanism of Transformers to capture the global dependencies and model long-range relationships across spatial and spectral dimensions. To overcome the limitations of 3-D Transformer in capturing fine-grained local and spatial features, UNFOLD complements it by incorporating 3-D CNN. Moreover, UNFOLD utilize a modified form of 3-D U-Net architecture for HSI denoising, wherein it employs a 3-D Transformer-based encoder instead of the conventional 3-D CNN-based encoder. It further capitalizes on the property of U-Net to integrate features across various scales, thereby enhancing efficacy by preserving intricate structural details. Results from extensive experiments demonstrate that UNFOLD outperforms the state-of-the-art HSI denoising methods.