The significance of video processing in recent years nurtures the technologies so well to keep them evolving. Video processing is principally applicable to real-life applications such as monitoring of traffic flow on road, video quality enhancement, detection of moving object or persons, emotion-aware system from the dynamics of facial muscles, etc. Recent research works focus mostly towards detecting moving objects with complex backgrounds. This paper explores some of the novel ideas of object tracking that represent optical flow and clustering model, scene-specific detection model and image moments-based model. The Three-dimensional (3D) sensor networks using multiple lightdetection-and-ranging (LIDAR) sensors are great for convergences, since it has higher opportunity for mishaps. It focuses on zones with high spatial significance and abbreviates the calculation lag engaged with the object detection. A super-quick and lightweight start to finish 3D distinct convolutional brain network with a multi-input multi-yield (MIMO) technique "3DS_MM" is seen for moving object identificaton to further develop recognition exactness by embracing 3D convolution. The use of Light Detection and Ranging (LiDAR) and streaming video can be utilized to empower ongoing object detection and tracking. Thus, it consolidates the aftereffects of the combination following radiological information to improve situational mindfulness and increment location responsiveness. An end-to-end blur-aid feature aggregation network (BFAN) for video object detection for video object identification has been investigated. BFAN accentuations on the grouping system are yielded by blur alongside movement blur and defocus. Here it is assessed with the object blur level of each frame as the load for collection. It is also observed that an efficient and accurate spatiotemporal salient object detection method can be utilized to recognize the most observable item in a video grouping. Instinctively, the hidden movement in a video is a steadier saliency pointer than the evident shading signals that frequently contain huge varieties and complex designs. In light of this perception, an effective and exact spatiotemporal saliency discovery technique that utilizations movement data is constructed, as an influence to find the most dynamic locales in a video continuance. K-means thresholding works better than MMM thresholding; Confidence- Encoded SVM improves the detection rate for scene-specific detector; and MDP improves the object detection in 3DResNet and LSTM model.