Two key technologies in robotic object grasping are target object localization and pose estimation (PE), respectively, and the addition of a robotic vision system can dramatically enhance the flexibility and accuracy of robotic object grasping. The study optimizes the classical convolutional structure in the target detection network considering the limited computing power and memory resources of the embedded platform, and replaces the original anchor frame mechanism using an adaptive anchor frame mechanism in combination with the fused depth map. For evaluating the target's pose, the smooth plane of its surface is identified using the semantic segmentation network, and the target's pose information is obtained by solving the normal vector of the plane, so that the robotic arm can absorb the object surface along the direction of the plane normal vector to achieve the target's grasping. The adaptive anchor frame can maintain an average accuracy of 85.75% even when the number of anchor frames is increased, which proves its anti-interference ability to the over fitting problem. The detection accuracy of the target localization algorithm is 98.8%; the accuracy of the PE algorithm is 74.32%; the operation speed could be 25 frames/s. It could satisfy the requirements of real-time physical grasping. In view of the vision algorithm in the study, physical grasping experiments were carried on. Then the success rate of object grasping in the experiments was above 75%, which effectively verified the practicability.