A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure

被引:1
|
作者
Heo, Yoonseok [1 ]
Kang, Sangwoo [2 ]
机构
[1] Sogang Univ, Dept Comp Sci & Engn, Seoul 04107, South Korea
[2] Gachon Univ, Sch Comp, Seongnam 13120, South Korea
基金
新加坡国家研究基金会;
关键词
multimodal deep learning; scene graph reasoning; multimodal transformer; multi-task learning;
D O I
10.3390/math11173751
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
A rapidly expanding multimedia environment in recent years has led to an explosive increase in demand for multimodality that can communicate with humans in various ways. Even though the convergence of vision and language intelligence has shed light on the remarkable success over the last few years, there is still a caveat: it is unknown whether they truly understand the semantics of the image. More specifically, how they correctly capture relationships between objects represented within the image is still regarded as a black box. In order to testify whether such relationships are well understood, this work mainly focuses on the Graph-structured visual Question Answering (GQA) task which evaluates the understanding of an image by reasoning a scene graph describing the structural characteristics of an image in the form of natural language together with the image. Unlike the existing approaches that have been accompanied by an additional encoder for scene graphs, we propose a simple yet effective framework using pre-trained multimodal transformers for scene graph reasoning. Inspired by the fact that a scene graph can be regarded as a set of sentences describing two related objects with a relationship, we fuse them into the framework separately from the question. In addition, we propose a multi-task learning method that utilizes evaluating the grammatical validity of questions as an auxiliary task to better understand a question with complex structures. This utilizes the semantic role labels of the question to randomly shuffle the sentence structure of the question. We have conducted extensive experiments to evaluate the effectiveness in terms of task capabilities, ablation studies, and generalization.
引用
收藏
页数:15
相关论文
共 23 条
  • [1] Improved Scene Understanding through Semantic Reasoning and Online Learning
    Moskal, Jakub J.
    Kokar, Mieczyslaw M.
    Whittington, Sydney J.
    [J]. SIGNAL PROCESSING, SENSOR/INFORMATION FUSION, AND TARGET RECOGNITION XXXI, 2022, 12122
  • [2] Image Understanding using vision and reasoning through Scene Description Graph
    Aditya, Somak
    Yang, Yezhou
    Baral, Chitta
    Aloimonos, Yiannis
    Fermueller, Cornelia
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2018, 173 : 33 - 45
  • [3] A Bayesian network framework for vision based semantic scene understanding
    Im, Seung-Bin
    Hwang, Keum-Sung
    Cho, Sung-Bae
    [J]. 2007 RO-MAN: 16TH IEEE INTERNATIONAL SYMPOSIUM ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, VOLS 1-3, 2007, : 834 - 839
  • [4] MUFIN: Enriching Semantic Understanding of Sentence Embedding using Dual Tune Framework
    Goswami, Koustava
    Dutta, Sourav
    Assem, Haytham
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 2034 - 2039
  • [5] Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning
    Khan, Muhammad Jaleed
    Breslin, John G.
    Curry, Edward
    [J]. SEMANTIC WEB, ESWC 2022, 2022, 13261 : 93 - 112
  • [6] Model-based inexact graph matching on top of DNNs for semantic scene understanding
    Chopin, Jeremy
    Fasquel, Jean-Baptiste
    Mouchere, Harold
    Dahyot, Rozenn
    Bloch, Isabelle
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 235
  • [7] LOGICDEF: An Interpretable Defense Framework against Adversarial Examples via Inductive Scene Graph Reasoning
    Yang, Yuan
    Kerce, James C.
    Fekri, Faramarz
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 8840 - 8848
  • [8] A Bottom-up Framework for Construction of Structured Semantic 3D Scene Graph
    Yu, Bangguo
    Chen, Chongyu
    Zhou, Fengyu
    Wan, Fang
    Zhuang, Wenmi
    Zhao, Yang
    [J]. 2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 8224 - 8230
  • [9] STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation
    Wei, Yunchao
    Liang, Xiaodan
    Chen, Yunpeng
    Shen, Xiaohui
    Cheng, Ming-Ming
    Feng, Jiashi
    Zhao, Yao
    Yan, Shuicheng
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (11) : 2314 - 2320
  • [10] Generating a Novel Scene-Graph Structure for a Modern GIS Rendering Framework
    Tully, David
    El Rhalibi, Abdennour
    Carter, Christopher
    Sudirman, Sud
    [J]. 2016 9TH INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ESYSTEMS ENGINEERING (DESE 2016), 2016, : 169 - 174