Shadow detection helps reduce ambiguity in object detection and tracking. However, existing shadow detection methods tend to misidentify complex shadows and their similar patterns, such as soft shadow regions and shadow-like regions, since they treat all cases equally, leading to an incomplete structure of the detected shadow regions. To alleviate this issue, we propose a structure-aware transformer network (STNet) for robust shadow detection. Specifically, we first develop a transformer-based shadow detection network to learn significant contextual information interactions. To this end, a context-aware enhancement (CaE) block is also introduced into the backbone to expand the receptive field, thus enhancing semantic interaction. Then, we design an edge-guided multi-task learning framework to produce intermediate and main predictions with a rich structure. By fusing these two complementary predictions, we can obtain an edge-preserving refined shadow map. Finally, we introduce an auxiliary semantic-aware learning to overcome the interference from complex scenes, which facilitates the model to perceive shadow and non-shadow regions using a semantic affinity loss. By doing these, we can predict high-quality shadow maps in different scenarios. Experimental results demonstrate that our method reduces the balance error rate (BER) by 4.53%, 2.54%, and 3.49% compared to state-of-the-art (SOTA) methods on the benchmark datasets SBU, ISTD, and UCF, respectively.