What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

被引：1

作者：

Ilinykh, Nikolai ^{[1
]}

Dobnik, Simon ^{[1
]}

机构：

[1] Univ Gothenburg, Dept Philosophy Linguist & Theory Sci FLoV, Ctr Linguist Theory & Studies Probabil Clasp, Gothenburg, Sweden

来源：

FRONTIERS IN ARTIFICIAL INTELLIGENCE | 2021年 / 4卷

基金：

瑞典研究理事会;

关键词：

language-and-vision; multi-modality; transformer; representation learning; effect of language on vision; self-attention; information fusion; natural language processing;

D O I：

10.3389/frai.2021.767971

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task's effect on these representations in large-scale architectures. In this paper, we take amulti-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.

引用

页数：22

共 50 条

[1] Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations
Wu, Hao
Mao, Jiayuan
Zhang, Yufeng
Jiang, Yuning
Li, Lei
Sun, Weiwei
Ma, Wei-Ying
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6602 - 6611
[2] Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer
Ilinykh, Nikolai
Dobnik, Simon
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 4062 - 4073
[3] Hierarchical Vision and Language Transformer for Efficient Visual Dialog
He, Qiangqiang
Zhang, Mujie
Zhang, Jie
Yang, Shang
Wang, Chongjun
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VI, 2023, 14259 : 421 - 432
[4] VinVL: Revisiting Visual Representations in Vision-Language Models
Zhang, Pengchuan
Li, Xiujun
Hu, Xiaowei
Yang, Jianwei
Zhang, Lei
Wang, Lijuan
Choi, Yejin
Gao, Jianfeng
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
[5] How agents see things: On visual representations in an emergent language game
Bouchacourt, Diane
Baroni, Marco
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 981 - 985
[6] What does second-order vision see in an image?
Schofield, AJ
PERCEPTION, 2000, 29 (09) : 1071 - 1086
[7] Interpreting vision and language generative models with semantic visual priors
Cafagna, Michele
Rojas-Barahona, Lina M.
van Deemter, Kees
Gatt, Albert
FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
[8] Transmission Versus Truth, Imitation Versus Innovation: What Children Can Do That Large Language and Language-and-Vision Models Cannot (Yet)
Yiu, Eunice
Kosoy, Eliza
Gopnik, Alison
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE, 2024, 19 (05) : 874 - 883
[9] Do you see what I see? Affect and visual information processing
Gasper, K
COGNITION & EMOTION, 2004, 18 (03) : 405 - 421
[10] I SEE WHAT YOU MEAN - SEMANTIC ENVIRONMENT OF SIGN LANGUAGE
YOUNG, RQ
ET CETERA, 1978, 35 (02): : 145 - 159

← 1 2 3 4 5 →