What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

被引:1
|
作者
Ilinykh, Nikolai [1 ]
Dobnik, Simon [1 ]
机构
[1] Univ Gothenburg, Dept Philosophy Linguist & Theory Sci FLoV, Ctr Linguist Theory & Studies Probabil Clasp, Gothenburg, Sweden
来源
基金
瑞典研究理事会;
关键词
language-and-vision; multi-modality; transformer; representation learning; effect of language on vision; self-attention; information fusion; natural language processing;
D O I
10.3389/frai.2021.767971
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task's effect on these representations in large-scale architectures. In this paper, we take amulti-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Combining Vision and Language Representations for Patch-based Identification of Lexico-Semantic Relations
    Jha, Prince
    Dias, Gael
    Lechervy, Alexis
    Moreno, Jose G.
    Jangra, Anubhav
    Pais, Sebastiao
    Saha, Sriparna
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4406 - 4415
  • [22] Enriching visual feature representations for vision-language tasks using spectral transforms
    Ondeng, Oscar
    Ouma, Heywood
    Akuon, Peter
    IMAGE AND VISION COMPUTING, 2025, 154
  • [23] SCIENCE AND ART - HUMAN VISION IN ART AND WHAT CAMERA DOES NOT SEE AND CANNOT RECORD
    BORNE, M
    TRANSACTIONS OF THE NEW YORK ACADEMY OF SCIENCES, 1974, 36 (05): : 490 - 490
  • [24] Transmission Versus Truth, Imitation Versus Innovation: What Children Can Do That Large Language and Language-and-Vision Models Cannot (Yet)? (Oct, 10.1177/17456916231201401, 2023)
    Yiu, E.
    Kosoy, E.
    Gopnik, A.
    PERSPECTIVES ON PSYCHOLOGICAL SCIENCE, 2024,
  • [25] Do you see what I see? The impact of age differences in time perspective on visual attention
    Thomas, Ruthann C.
    Kim, Sunghan
    Goldstein, David
    Hasher, Lynn
    Wong, Karen
    Ghai, Amrita
    JOURNALS OF GERONTOLOGY SERIES B-PSYCHOLOGICAL SCIENCES AND SOCIAL SCIENCES, 2007, 62 (05): : P247 - P252
  • [26] Exploring the Relationship Between Visual Information and Language Semantic Concept in the Human Brain
    Jing, Haodong
    Du, Ming
    Ma, Yongqiang
    Zheng, Nanning
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2022, PART I, 2022, 646 : 394 - 406
  • [27] TransVG plus plus : End-to-End Visual Grounding With Language Conditioned Vision Transformer
    Deng, Jiajun
    Yang, Zhengyuan
    Liu, Daqing
    Chen, Tianlang
    Zhou, Wengang
    Zhang, Yanyong
    Li, Houqiang
    Ouyang, Wanli
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13636 - 13652
  • [28] The Brain as a Constructor: What Does the Visual Perception Has in Common With Language?
    Chernorizov, Aleksandr
    INTERNATIONAL JOURNAL OF PSYCHOPHYSIOLOGY, 2021, 168 : S44 - S44
  • [29] Homework through the Eyes of Children: what does visual ethnography invite us to see?
    Hutchison, Kirsten
    EUROPEAN EDUCATIONAL RESEARCH JOURNAL, 2011, 10 (04): : 545 - 558
  • [30] Smoking, nicotine and visual plasticity: Does what you know, tell you what you can see?
    Debski, Elizabeth A.
    BRAIN RESEARCH BULLETIN, 2008, 77 (05) : 221 - 226