In previous video matting methods, there are some problems that require additional auxiliary information and lack of temporal consistency. To solve these problems, we propose a novel video matting framework (STMI-Net) based on temporal-spatial information mining and aggregation. This framework doesn’t require any auxiliary information and adopts a double decoder network structure, specifically, one decoder is composed of the recurrent network, which can make full use of the temporal information in the video frames to ensure the temporal coherence in results; and the other decoder is composed of the convolution network, which deeply restores the frame-by-frame spatial features to achieve the spatial continuity in results. By aggregating these two parts of the information at the global level, our model achieves 0.0066 MSE on the VideoMatte240K dataset, which surpasses the RVM baseline by 13%; and achieves 0.0047 MSE on PPM-100 portrait matting dataset, which surpasses the MG baseline by 26.5%. We also implement an ablation study to demonstrate the specific functions of the temporal decoder and the spatial decoder in our model.