Visual Relation Extraction via Multi-modal Translation Embedding Based Model

被引:0
|
作者
Li, Zhichao [1 ]
Han, Yuping [1 ]
Xu, Yajing [1 ]
Gao, Sheng [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
关键词
Visual relation extraction; Multi-modal network; Translation embedding;
D O I
10.1007/978-3-319-93034-3_43
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual relation, such as "person holds dog" is an effective semantic unit for image understanding, as well as a bridge to connect computer vision and natural language. Recent work has been proposed to extract the object features in the image with the aid of respective textual description. However, very little work has been done to combine the multi-modal information to model the subject-predicate-object relation triplets to obtain deeper scene understanding. In this paper, we propose a novel visual relation extraction model named Multi-modal Translation Embedding Based Model to integrate the visual information and respective textual knowledge base. For that, our proposed model places objects of the image as well as their semantic relationships in two different low-dimensional spaces where the relation can be modeled as a simple translation vector to connect the entity descriptions in the knowledge graph. Moreover, we also propose a visual phrase learning method to capture the interactions between objects of the image to enhance the performance of visual relation extraction. Experiments are conducted on two real world datasets, which show that our proposed model can benefit from incorporating the language information into the relation embeddings and provide significant improvement compared to the state-of-the-art methods.
引用
收藏
页码:538 / 548
页数:11
相关论文
共 50 条
  • [41] Multi-modal translation system and its evaluation
    Morishima, S
    Nakamura, S
    FOURTH IEEE INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, PROCEEDINGS, 2002, : 241 - 246
  • [42] SELF-AUGMENTED MULTI-MODAL FEATURE EMBEDDING
    Matsuo, Shinnosuke
    Uchida, Seiichi
    Iwana, Brian Kenji
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3995 - 3999
  • [43] MULTI-MODAL JOINT EMBEDDING FOR FASHION PRODUCT RETRIEVAL
    Rubio, A.
    Yu, LongLong
    Simo-Serra, E.
    Moreno-Noguer, F.
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 400 - 404
  • [44] Multi-Modal Sarcasm Detection with Sentiment Word Embedding
    Fu, Hao
    Liu, Hao
    Wang, Hongling
    Xu, Linyan
    Lin, Jiali
    Jiang, Dazhi
    ELECTRONICS, 2024, 13 (05)
  • [45] Multi-Modal Embedding for Main Product Detection in Fashion
    Rubio, Antonio
    Yu, LongLong
    Simo-Serra, Edgar
    Moreno-Noguer, Francesc
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 2236 - 2242
  • [46] Multi-Modal Component Embedding for Fake News Detection
    Kang, SeongKu
    Hwang, Junyoung
    Yu, Hwanjo
    PROCEEDINGS OF THE 2020 14TH INTERNATIONAL CONFERENCE ON UBIQUITOUS INFORMATION MANAGEMENT AND COMMUNICATION (IMCOM), 2020,
  • [47] Photonic modes prediction via multi-modal diffusion model
    Sun, Jinyang
    Chen, Xi
    Wang, Xiumei
    Zhu, Dandan
    Zhou, Xingping
    MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2024, 5 (03):
  • [48] Lightweight multi-modal emotion recognition model based on modal generation
    Liu, Peisong
    Che, Manqiang
    Luo, Jiangchuan
    2022 9TH INTERNATIONAL FORUM ON ELECTRICAL ENGINEERING AND AUTOMATION, IFEEA, 2022, : 430 - 435
  • [49] Expanding the four resources model: reading visual and multi-modal texts
    Serafini, Frank
    PEDAGOGIES, 2012, 7 (02): : 150 - 164
  • [50] TripleMIE: Multi-modal and Multi Architecture Information Extraction
    Xia, Boqian
    Ma, Shihan
    Li, Yadong
    Huang, Wenkang
    Shi, Qiuhui
    Huang, Zuming
    Xie, Lele
    Wang, Hongbin
    HEALTH INFORMATION PROCESSING. EVALUATION TRACK PAPERS, 2023, 1773 : 143 - 153