Learning Robot Manipulation Skills From Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks With Self-Attention

被引:1
|
作者
Xu, Xin [1 ,2 ]
Qian, Kun [1 ,2 ]
Jing, Xingshuo [1 ,2 ]
Song, Wei [3 ]
机构
[1] Southeast Univ, Sch Automat, Minist Educ, Nanjing 210096, Peoples R China
[2] Southeast Univ, Key Lab Measurement & Control CSE, Minist Educ, Nanjing 210096, Peoples R China
[3] Zhejiang Lab, Res Ctr Intelligent Robot, Hangzhou 311121, Peoples R China
基金
中国国家自然科学基金;
关键词
Learning from Demonstration (LfD); robot manipulation; self-attention; skills learning; Video-to-Command (V2C); OBJECT;
D O I
10.1109/TCDS.2022.3182877
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning manipulation skills from observing human demonstration video is a promising aspect for intelligent robotic systems. Recent advances in Video-to-Command (V2C) provide an end-to-end approach to translate a video into robot plans. However, simultaneous V2C and action segmentation remain a major challenge for bimanual manipulations with fine-grained actions. Another concern is the generalization capability of end-to-end approaches in dealing with varied task parameters as well as environmental changes between the learned skills and the one-shot task demonstration for the robot to replay. In this article, we propose a two-stream network for robots to learn and segment manipulation subactions from human demonstration videos. Our framework with the self-attention mechanism can segment learned skills and generate action commands simultaneously. To arrive at refined plans in situations of underspecified or redundant human demonstrations, we utilize PDDL-based skill scripts to model the semantics of demonstrated activities and infer latent movements. Experimental results on the extended manipulation data set indicate that our approach generates more accurate commands than the state-of-the-art methods. Real-world experiment results on a Baxter robotic arm also demonstrated the feasibility of our method in reproducing fine-grained actions from video demonstrations.
引用
收藏
页码:1000 / 1011
页数:12
相关论文
共 12 条
  • [1] Two-stream 2D/3D Residual Networks for Learning Robot Manipulations from Human Demonstration Videos
    Xu, Xin
    Qian, Kun
    Zhou, Bo
    Chen, Shenghao
    Li, Yitong
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 3353 - 3358
  • [2] TWO-STREAM DESIGNED 2D/3D RESIDUAL NETWORKS WITH LSTMS FOR ACTION RECOGNITION IN VIDEOS
    Song, Lifei
    Weng, Liguo
    Wang, Lingfeng
    Min, Xia
    Pan, Chunhong
    [J]. 2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 808 - 812
  • [3] 2-D to 3-D Conversion of Videos using Fixed Point Learning Approach
    Chahal, Nidhi
    Chaudhury, Santanu
    [J]. 2016 9TH INTERNATIONAL CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (ICECE), 2016, : 8 - 13
  • [4] Self-Supervised Learning of Depth and Ego-Motion From Videos by Alternative Training and Geometric Constraints from 3-D to 2-D
    Fang, Jiaojiao
    Liu, Guizhong
    [J]. IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (01) : 223 - 233
  • [5] Learning an Attention Model for Robust 2-D/3-D Registration Using Point-To-Plane Correspondences
    Schaffert, Roman
    Wang, Jian
    Fischer, Peter
    Borsdorf, Anja
    Maier, Andreas
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2020, 39 (10) : 3159 - 3174
  • [6] Roof Classification From 3-D LiDAR Point Clouds Using Multiview CNN With Self-Attention
    Shajahan, Dimple A.
    Nayel, Vaibhav
    Muthuganapathy, Ramanathan
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2020, 17 (08) : 1465 - 1469
  • [7] Comparison of two-phase flow in 2-D and 3-D pore networks using a true-to-mechanism theoretical model (DeProF)
    Valavanides, MS
    Payatakes, AC
    [J]. COMPUTATIONAL METHODS IN WATER RESOURCES, VOLS 1 AND 2, PROCEEDINGS, 2002, 47 : 1083 - 1090
  • [8] Extracting Self-Motion and 3-D Depth Information From 2-D Video Sequences Using the Properties of Primate Motion-Sensitive Neurons
    Perrone, John
    Cree, Michael
    [J]. PERCEPTION, 2019, 48 : 184 - 184
  • [9] Enhanced Two-Stream Bayesian Hyper Parameter Optimized 3D-CNN Inception-v3 Based Drop-ConvLSTM2D Deep Learning Model for Human Action Recognition
    Jeyanthi, A.
    Visumathi, J.
    Genitha, C. Heltin
    [J]. INFORMATION TECHNOLOGY AND CONTROL, 2024, 53 (01): : 53 - 70
  • [10] Two copper(II) metal-organic networks derived from bis-pyridyl-bis-amide ligands and aromatic polycarboxylates: a 2-D layered structure and a 4-connected trinodal 3-D topology
    Lin, Hongyan
    Liu, Peng
    Zhang, Juwen
    Wang, Xiuli
    Liu, Guocheng
    [J]. JOURNAL OF COORDINATION CHEMISTRY, 2013, 66 (04) : 612 - 623