Visual Relation Extraction via Multi-modal Translation Embedding Based Model

被引:0
|
作者
Li, Zhichao [1 ]
Han, Yuping [1 ]
Xu, Yajing [1 ]
Gao, Sheng [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
关键词
Visual relation extraction; Multi-modal network; Translation embedding;
D O I
10.1007/978-3-319-93034-3_43
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual relation, such as "person holds dog" is an effective semantic unit for image understanding, as well as a bridge to connect computer vision and natural language. Recent work has been proposed to extract the object features in the image with the aid of respective textual description. However, very little work has been done to combine the multi-modal information to model the subject-predicate-object relation triplets to obtain deeper scene understanding. In this paper, we propose a novel visual relation extraction model named Multi-modal Translation Embedding Based Model to integrate the visual information and respective textual knowledge base. For that, our proposed model places objects of the image as well as their semantic relationships in two different low-dimensional spaces where the relation can be modeled as a simple translation vector to connect the entity descriptions in the knowledge graph. Moreover, we also propose a visual phrase learning method to capture the interactions between objects of the image to enhance the performance of visual relation extraction. Experiments are conducted on two real world datasets, which show that our proposed model can benefit from incorporating the language information into the relation embeddings and provide significant improvement compared to the state-of-the-art methods.
引用
收藏
页码:538 / 548
页数:11
相关论文
共 50 条
  • [31] Multi-modal measurement of the visual cortex
    Amano, Kaoru
    Takemura, Hiromasa
    I-PERCEPTION, 2014, 5 (04): : 408 - 408
  • [32] Learning Visual Emotion Distributions via Multi-Modal Features Fusion
    Zhao, Sicheng
    Ding, Guiguang
    Gao, Yue
    Han, Jungong
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 369 - 377
  • [33] RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine Translation
    Wang, Yan
    Zeng, Yawen
    Liang, Junjie
    Xing, Xiaofen
    Xu, Jin
    Xu, Xiangmin
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 860 - 868
  • [34] Visual Translation Embedding Network for Visual Relation Detection
    Zhang, Hanwang
    Kyaw, Zawlin
    Chang, Shih-Fu
    Chua, Tat-Seng
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3107 - 3115
  • [35] Multi-modal co-attention relation networks for visual question answering
    Guo, Zihan
    Han, Dezhi
    VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795
  • [36] Multi-modal co-attention relation networks for visual question answering
    Zihan Guo
    Dezhi Han
    The Visual Computer, 2023, 39 : 5783 - 5795
  • [37] Multi-Modal Association based Grouping for Form Structure Extraction
    Aggarwal, Milan
    Sarkar, Mausoom
    Gupta, Hiresh
    Krishnamurthy, Balaji
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 2064 - 2073
  • [38] Multi-Modal Military Event Extraction Based on Knowledge Fusion
    Xiang, Yuyuan
    Jia, Yangli
    Zhang, Xiangliang
    Zhang, Zhenling
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 77 (01): : 97 - 114
  • [39] Visual Sorting Method Based on Multi-Modal Information Fusion
    Han, Song
    Liu, Xiaoping
    Wang, Gang
    APPLIED SCIENCES-BASEL, 2022, 12 (06):
  • [40] Unsupervised Multi-modal Neural Machine Translation
    Su, Yuanhang
    Fan, Kai
    Nguyen Bach
    Kuo, C-C Jay
    Huang, Fei
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10474 - 10483