ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

被引:0
|
作者
Yu, Fei [1 ]
Tang, Jiji [1 ]
Yin, Weichong [1 ]
Su, Yu [1 ]
Tian, Hao [1 ]
Wu, Hua [1 ]
Wang, Haifeng [1 ]
机构
[1] Baidu Inc, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured knowledge obtained from scene graphs to learn joint representations of vision-language. ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal tasks. Utilizing scene graphs of visual scenes, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks in the pre-training phase. Specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language. After pre-training on large scale i mage-text aligned datasets, we validate the effectiveness of ERNIE-ViL on 5 cross-modal downstream tasks. ERNIE-ViL achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%.
引用
收藏
页码:3208 / 3216
页数:9
相关论文
共 50 条
  • [1] Structured Scene Memory for Vision-Language Navigation
    Wang, Hanqing
    Wang, Wenguan
    Liang, Wei
    Xiong, Caiming
    Shen, Jianbing
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8451 - 8460
  • [2] e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks
    Kayser, Maxime
    Camburu, Oana-Maria
    Salewski, Leonard
    Emde, Cornelius
    Do, Virginie
    Akata, Zeynep
    Lukasiewicz, Thomas
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1224 - 1234
  • [3] Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
    Lou, Chao
    Han, Wenjuan
    Lin, Yuhuan
    Zheng, Zilong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15586 - 15595
  • [4] Language Features Matter: Effective Language Representations for Vision-Language Tasks
    Burns, Andrea
    Tan, Reuben
    Saenko, Kate
    Sclaroff, Stan
    Plummer, Bryan A.
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7473 - 7482
  • [5] VinVL: Revisiting Visual Representations in Vision-Language Models
    Zhang, Pengchuan
    Li, Xiujun
    Hu, Xiaowei
    Yang, Jianwei
    Zhang, Lei
    Wang, Lijuan
    Choi, Yejin
    Gao, Jianfeng
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
  • [6] LifeGraph 4-Lifelog Retrieval using Multimodal Knowledge Graphs and Vision-Language Models
    Rossetto, Luca
    Kyriakou, Athina
    Lange, Svenja
    Ruosch, Florian
    Wang, Ruijie
    Wardatzky, Kathrin
    Bernstein, Abraham
    [J]. PROCEEDINGS OF 2024 ACM WORKSHOP ON THE LIFELOG SEARCH CHALLENGE, LSC 2024, 2024, : 88 - 92
  • [7] Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective
    Salin, Emmanuelle
    Farah, Badreddine
    Ayache, Stephane
    Favre, Benoit
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11248 - 11257
  • [8] "This Is My Unicorn, Fluffy": Personalizing Frozen Vision-Language Representations
    Cohen, Niv
    Gal, Rinon
    Meirom, Eli A.
    Chechik, Gal
    Atzmon, Yuval
    [J]. COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 558 - 577
  • [9] FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
    Han, Xiao
    Zhu, Xiatian
    Yu, Licheng
    Zhang, Li
    Song, Yi-Zhe
    Xiang, Tao
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2669 - 2680
  • [10] Vision-Language Pre-Training for Boosting Scene Text Detectors
    Song, Sibo
    Wan, Jianqiang
    Yang, Zhibo
    Tang, Jun
    Cheng, Wenqing
    Bai, Xiang
    Yao, Cong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670