Joint Object Affordance Reasoning and Segmentation in RGB-D Videos

被引:4
|
作者
Thermos, Spyridon [1 ]
Potamianos, Gerasimos [2 ]
Daras, Petros [3 ]
机构
[1] Univ Edinburgh, Sch Engn, Edinburgh EH9 3JL, Midlothian, Scotland
[2] Univ Thessaly, Dept Elect & Comp Engn, Volos 38221, Greece
[3] Informat Technol Inst, Ctr Res & Technol Hellas, Visual Comp Lab, Thessaloniki 57001, Greece
关键词
Affordances; Cognition; Decoding; Task analysis; Image segmentation; Heating systems; Videos; Object affordances; human-object interaction; reasoning; semantic segmentation; deep learning; encoder-decoder model; attention mechanism; RGB-D video; RECOGNITION; MODEL;
D O I
10.1109/ACCESS.2021.3090471
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding human-object interaction is a fundamental challenge in computer vision and robotics. Crucial to it is the ability to infer "object affordances" from visual data, namely the types of interaction supported by an object of interest and the object parts involved. Such inference can be approached as an "affordance reasoning" task, where object affordances are recognized and localized as image heatmaps, and as an "affordance segmentation" task, where affordance labels are obtained at a more detailed, image pixel level. To tackle the two tasks, existing methods typically: (i) treat them independently; (ii) adopt static image-based models, ignoring the temporal aspect of human-object interaction; and / or (iii) require additional strong supervision concerning object class and location. In this paper, we focus on both tasks, while addressing all three aforementioned shortcomings. For this purpose, we propose a deep-learning based dual encoder-decoder model for joint affordance reasoning and segmentation, which learns from our recently introduced SOR3D-AFF corpus of RGB-D human-object interaction videos, without relying on object localization and classification. The basic components of the model comprise: (i) two parallel encoders that capture spatio-temporal interaction information; (ii) a reasoning decoder that predicts affordance heatmaps, assisted by an affordance classifier and an attention mechanism; and (iii) a segmentation decoder that exploits the predicted heatmap to yield pixel-level affordance segmentation. All modules are jointly trained, while the system can operate on both static images and videos. The approach is evaluated on four datasets, surpassing the current state-of-the-art in both affordance reasoning and segmentation.
引用
收藏
页码:89699 / 89713
页数:15
相关论文
共 50 条
  • [21] Object Detection and Tracking Under Occlusion for Object-Level RGB-D Video Segmentation
    Xie, Qian
    Remil, Oussama
    Guo, Yanwen
    Wang, Meng
    Wei, Mingqiang
    Wang, Jun
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (03) : 580 - 592
  • [22] Semantic Mapping Using Object-Class Segmentation of RGB-D Images
    Stueckler, Joerg
    Biresev, Nenad
    Behnke, Sven
    [J]. 2012 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2012, : 3005 - 3010
  • [23] Object Segmentation of Indoor Scenes Using Perceptual Organization on RGB-D Images
    Wang, Chaonan
    Xue, Yanbing
    Zhang, Hua
    Xu, Guangping
    Gao, Zan
    [J]. 2016 8TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS & SIGNAL PROCESSING (WCSP), 2016,
  • [24] HAND AND OBJECT SEGMENTATION FROM RGB-D IMAGES FOR INTERACTION WITH PLANAR SURFACES
    Weber, Henrique
    Jung, Claudio Rosito
    Gelb, Dan
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 2984 - 2988
  • [25] Learning Rich Features from RGB-D Images for Object Detection and Segmentation
    Gupta, Saurabh
    Girshick, Ross
    Arbelaez, Pablo
    Malik, Jitendra
    [J]. COMPUTER VISION - ECCV 2014, PT VII, 2014, 8695 : 345 - 360
  • [26] RGB-D Object Modelling for Object Recognition and Tracking
    Prankl, Johann
    Aldoma, Aitor
    Svejda, Alexander
    Vincze, Markus
    [J]. 2015 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2015, : 96 - 103
  • [27] RGB-D joint modelling with scene geometric information for indoor semantic segmentation
    Liu, Hong
    Wu, Wenshan
    Wang, Xiangdong
    Qian, Yueliang
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (17) : 22475 - 22488
  • [28] JALNet: joint attention learning network for RGB-D salient object detection
    Gao, Xiuju
    Cui, Jianhua
    Meng, Jin
    Shi, Huaizhong
    Duan, Songsong
    Xia, Chenxing
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2024, 27 (01) : 36 - 47
  • [29] RGB-D joint modelling with scene geometric information for indoor semantic segmentation
    Hong Liu
    Wenshan Wu
    Xiangdong Wang
    Yueliang Qian
    [J]. Multimedia Tools and Applications, 2018, 77 : 22475 - 22488
  • [30] Joint Semantic Mining for Weakly Supervised RGB-D Salient Object Detection
    Li, Jingjing
    Ji, Wei
    Bi, Qi
    Yan, Cheng
    Zhang, Miao
    Piao, Yongri
    Lu, Huchuan
    Cheng, Li
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34