ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

被引:0
|
作者
Yu, Fei [1 ]
Tang, Jiji [1 ]
Yin, Weichong [1 ]
Su, Yu [1 ]
Tian, Hao [1 ]
Wu, Hua [1 ]
Wang, Haifeng [1 ]
机构
[1] Baidu Inc, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured knowledge obtained from scene graphs to learn joint representations of vision-language. ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal tasks. Utilizing scene graphs of visual scenes, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks in the pre-training phase. Specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language. After pre-training on large scale i mage-text aligned datasets, we validate the effectiveness of ERNIE-ViL on 5 cross-modal downstream tasks. ERNIE-ViL achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%.
引用
收藏
页码:3208 / 3216
页数:9
相关论文
共 50 条
  • [21] Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
    Kan, Baoshuo
    Wang, Teng
    Lu, Wenpeng
    Zhen, Xiantong
    Guan, Weili
    Zheng, Feng
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15624 - 15634
  • [22] Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
    Wang, Yubin
    Jiang, Xinyang
    Cheng, De
    Li, Dongsheng
    Zhao, Cairong
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5749 - 5757
  • [23] HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models
    Ning, Shan
    Qiu, Longtian
    Liu, Yongfei
    He, Xuming
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23507 - 23517
  • [24] Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
    Dai, Wenliang
    Hou, Lu
    Shang, Lifeng
    Jiang, Xin
    Liu, Qun
    Fung, Pascale
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2383 - 2395
  • [25] Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
    Ye, Shuquan
    Xie, Yujia
    Chen, Dongdong
    Xu, Yichong
    Yuan, Lu
    Zhu, Chenguang
    Liao, Jing
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2634 - 2645
  • [26] MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling
    Zhao, Zijia
    Guo, Longteng
    He, Xingjian
    Shao, Shuai
    Yuan, Zehuan
    Liu, Jing
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1528 - 1538
  • [27] Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting
    Xue, Chuhui
    Zhang, Wenqing
    Hao, Yu
    Lu, Shijian
    Torr, Philip H. S.
    Bai, Song
    [J]. COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 284 - 302
  • [28] DeepUnseen: Unpredicted Event Recognition Through Integrated Vision-Language Models
    Sakaino, Hidetomo
    Gaviphat, Natnapat
    Zamora, Louie
    Insisiengmay, Alivanh
    Ningrum, Dwi Fetiria
    [J]. 2023 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI, 2023, : 48 - 50
  • [29] Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network
    Wei, Jiajun
    Zhan, Hongjian
    Lu, Yue
    Tu, Xiao
    Yin, Bing
    Liu, Cong
    Pal, Umapada
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5885 - 5893
  • [30] Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-training
    Chen, Xiaofei
    He, Yuting
    Xue, Cheng
    Ge, Rongjun
    Li, Shuo
    Yang, Guanyu
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT I, 2023, 14220 : 405 - 415