Net Primary Productivity (NPP), as a core ecological indicator, plays a crucial role in carbon estimation and plateau climate change research, and effective prediction of grassland NPP can provide a basis for macro-management of grasslands. Existing models fail to accurately predict grassland NPP, and there is limited research on the impact of different inputs and model structures on the performance of grassland NPP prediction models. This study takes into account the spatiotemporal characteristics of grassland NPP, integrating spatial clustering and Seasonal-Trend decomposition using Loess (STL) algorithm. Based on phenological data from source region of the Yellow River from 2000 to 2021 and the Carnegie-Ames-Stanford Approach model, a deep learning model suitable for predicting grassland NPP is proposed for monthly spatiotemporal NPP forecasts. The predictive accuracy of this model is compared against several others, including a basic recurrent network, the GRU model, and six different deep learning neural network architectures (CNN, LSTM, CNN-LSTM, CNN-BiLSTM, CNN-BiLSTM-Attention, and STL-CNN-BiLSTM-Attention). The results show that grassland NPP in source region of the Yellow River exhibits significant spatial correlation and periodicity. For spatial clustering-based prediction, Group1 performs the best, outperforming other clustering methods and overall predictions. In long-term prediction, the best-performing model is STL-CNN-BiLSTM-Attention; while in overall grassland growing season prediction, the best-performing model is CNN-BiLSTM-Attention. The study indicates that the integrated application of spatiotemporal clustering, seasonal decomposition, and deep model combinations outperforms shallow models such as CNN and GRU in predicting grassland NPP data. This approach provides robust technical support for future studies on grass yield forecasting and vegetation monitoring.