Joint Object Affordance Reasoning and Segmentation in RGB-D Videos

被引:4
|
作者
Thermos, Spyridon [1 ]
Potamianos, Gerasimos [2 ]
Daras, Petros [3 ]
机构
[1] Univ Edinburgh, Sch Engn, Edinburgh EH9 3JL, Midlothian, Scotland
[2] Univ Thessaly, Dept Elect & Comp Engn, Volos 38221, Greece
[3] Informat Technol Inst, Ctr Res & Technol Hellas, Visual Comp Lab, Thessaloniki 57001, Greece
关键词
Affordances; Cognition; Decoding; Task analysis; Image segmentation; Heating systems; Videos; Object affordances; human-object interaction; reasoning; semantic segmentation; deep learning; encoder-decoder model; attention mechanism; RGB-D video; RECOGNITION; MODEL;
D O I
10.1109/ACCESS.2021.3090471
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding human-object interaction is a fundamental challenge in computer vision and robotics. Crucial to it is the ability to infer "object affordances" from visual data, namely the types of interaction supported by an object of interest and the object parts involved. Such inference can be approached as an "affordance reasoning" task, where object affordances are recognized and localized as image heatmaps, and as an "affordance segmentation" task, where affordance labels are obtained at a more detailed, image pixel level. To tackle the two tasks, existing methods typically: (i) treat them independently; (ii) adopt static image-based models, ignoring the temporal aspect of human-object interaction; and / or (iii) require additional strong supervision concerning object class and location. In this paper, we focus on both tasks, while addressing all three aforementioned shortcomings. For this purpose, we propose a deep-learning based dual encoder-decoder model for joint affordance reasoning and segmentation, which learns from our recently introduced SOR3D-AFF corpus of RGB-D human-object interaction videos, without relying on object localization and classification. The basic components of the model comprise: (i) two parallel encoders that capture spatio-temporal interaction information; (ii) a reasoning decoder that predicts affordance heatmaps, assisted by an affordance classifier and an attention mechanism; and (iii) a segmentation decoder that exploits the predicted heatmap to yield pixel-level affordance segmentation. All modules are jointly trained, while the system can operate on both static images and videos. The approach is evaluated on four datasets, surpassing the current state-of-the-art in both affordance reasoning and segmentation.
引用
收藏
页码:89699 / 89713
页数:15
相关论文
共 50 条
  • [1] Salient Object Detection in RGB-D Videos
    Mou, Ao
    Lu, Yukang
    He, Jiahao
    Min, Dingyao
    Fu, Keren
    Zhao, Qijun
    [J]. IEEE Transactions on Image Processing, 2024, 33 : 6660 - 6675
  • [2] Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D Videos
    Lu, Shiyang
    Deng, Yunfu
    Boularias, Abdeslam
    Bekris, Kostas
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 7017 - 7023
  • [3] Semantic segmentation with Recurrent Neural Networks on RGB-D videos
    Gao, Chuan
    Wang, Weihong
    Chen, Mingxi
    [J]. 2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 1203 - 1207
  • [4] A computational framework for attentional object discovery in RGB-D videos
    Germán Martín García
    Mircea Pavel
    Simone Frintrop
    [J]. Cognitive Processing, 2017, 18 : 169 - 182
  • [5] A computational framework for attentional object discovery in RGB-D videos
    Garcia, German Martin
    Pavel, Mircea
    Frintrop, Simone
    [J]. COGNITIVE PROCESSING, 2017, 18 (02) : 169 - 182
  • [6] Learning human activities and object affordances from RGB-D videos
    Koppula, Hema Swetha
    Gupta, Rudhir
    Saxena, Ashutosh
    [J]. INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2013, 32 (08): : 951 - 970
  • [7] Object Pose Estimation From RGB-D Images With Affordance-Instance Segmentation Constraint for Semantic Robot Manipulation
    Wang, Zhongli
    Tian, Guohui
    [J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (01) : 595 - 602
  • [8] Learning of perceptual grouping for object segmentation on RGB-D data
    Richtsfeld, Andreas
    Moerwald, Thomas
    Prankl, Johann
    Zillich, Michael
    Vincze, Markus
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2014, 25 (01) : 64 - 73
  • [9] Multimodal Neural Networks: RGB-D for Semantic Segmentation and Object Detection
    Schneider, Lukas
    Jasch, Manuel
    Froehlich, Bjoern
    Weber, Thomas
    Franke, Uwe
    Pollefeys, Marc
    Raetsch, Matthias
    [J]. IMAGE ANALYSIS, SCIA 2017, PT I, 2017, 10269 : 98 - 109
  • [10] RGB-D object detection and semantic segmentation for autonomous manipulation in clutter
    Schwarz, Max
    Milan, Anton
    Periyasamy, Arul Selvam
    Behnke, Sven
    [J]. INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2018, 37 (4-5): : 437 - 451