The deep fusion of sports and machine vision has become a research hot spot in sports video target detection, athlete state recovery and sports promotion. On the basis of in-depth study, it can detect a large number of sports videos, complete the drawing and analysis of human body detection model, and detect and evaluate the posture of corresponding athletes in the video, which can save a lot of costs and maximize the more professional training of athletes. In order to solve the above problems, this paper innovatively completes the automatic language description of sports video based on time-sharing memory algorithm. Its principle is to realize the accurate decomposition of athletes' sports data through the mapping relationship between the corresponding letter sequence and video sequence in time-sharing memory. In order to capture the key posture of athletes' sports video, this paper innovatively proposes an object extraction algorithm based on athletes' skeleton motion enhancement. In practical application, based on the key pose capture, it is necessary to train the depth selection network in time to extract the key pose of the skeleton. Based on this network, it can enhance the key posture of bone information and accurately express its related features. After extracting the actual athlete's bone information, we need to fine-tune the training network to realize the accurate recognition of key features. Based on the above key algorithms, this paper designs a sports video athlete detection system based on deep learning and makes an experimental research on the related sports video. The experimental results show that the detection accuracy of athletes' sports video is improved by nearly 10% compared with the traditional convolution network recognition algorithm, so the algorithm has obvious advantages in recognition accuracy.