Towards Cognition-Aligned Visual Language Models via Zero-Shot Instance Retrieval

被引:1
|
作者
Ma, Teng [1 ]
Organisciak, Daniel [2 ]
Ma, Wenbao [1 ]
Long, Yang [3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Humanities & Social Sci, Xian 710049, Peoples R China
[2] Northumbria Univ, Dept Comp & Informat Sci, Newcastle Upon Tyne NE1 8ST, England
[3] Univ Durham, Dept Comp Sci, Durham DH1 3LE, England
关键词
large visual language models; zero-shot instance retrieval; cognition alignment;
D O I
10.3390/electronics13091660
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The pursuit of Artificial Intelligence (AI) that emulates human cognitive processes is a cornerstone of ethical AI development, ensuring that emerging technologies can seamlessly integrate into societal frameworks requiring nuanced understanding and decision-making. Zero-Shot Instance Retrieval (ZSIR) stands at the forefront of this endeavour, potentially providing a robust platform for AI systems, particularly large visual language models, to demonstrate and refine cognition-aligned learning without the need for direct experience. In this paper, we critically evaluate current cognition alignment methodologies within traditional zero-shot learning paradigms using visual attributes and word embedding generated by large AI models. We propose a unified similarity function that quantifies the cognitive alignment level, bridging the gap between AI processes and human-like understanding. Through extensive experimentation, our findings illustrate that this similarity function can effectively mirror the visual-semantic gap, steering the model towards enhanced performance in Zero-Shot Instance Retrieval. Our models achieve state-of-the-art performance on both the SUN (92.8% and 82.2%) and CUB datasets (59.92% and 48.82%) for bi-directional image-attribute retrieval accuracy. This work not only benchmarks the cognition alignment of AI but also sets a new precedent for the development of visual language models attuned to the complexities of human cognition.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] Zero-shot Video Moment Retrieval With Off-the-Shelf Models
    Diwan, Anuj
    Peng, Puyuan
    Mooney, Raymond J.
    TRANSFER LEARNING FOR NATURAL LANGUAGE PROCESSING WORKSHOP, VOL 203, 2022, 203 : 10 - 21
  • [42] Instance-wise multi-view visual fusion for zero-shot learning
    Tang, Long
    Zhao, Jingtao
    Tian, Yingjie
    Yao, Changhua
    Pardalos, Panos M.
    APPLIED SOFT COMPUTING, 2024, 167
  • [43] Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval
    Ueki, Kazuya
    20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021), 2021, : 628 - 634
  • [44] Zero-Shot Learning via Structure-Aligned Generative Adversarial Network
    Tang, Chenwei
    He, Zhenan
    Li, Yunxia
    Lv, Jiancheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6749 - 6762
  • [45] Language-only Efficient Training of Zero-shot Composed Image Retrieval
    Gu, Geonmo
    Chun, Sanghyuk
    Kim, Wonjae
    Kang, Yoohoon
    Yun, Sangdoo
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13225 - 13234
  • [46] From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models
    Guo, Jiaxian
    Li, Junnan
    Li, Dongxu
    Tiong, Anthony Meng Huat
    Li, Boyang
    Tao, Dacheng
    Hoi, Steven
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10867 - 10877
  • [47] Towards zero-shot human-object interaction detection via vision-language integration
    Xue, Weiying
    Liu, Qi
    Wang, Yuxiao
    Wei, Zhenao
    Xing, Xiaofen
    Xu, Xiangmin
    NEURAL NETWORKS, 2025, 187
  • [48] MEDAGENTS: Large Language Models as Collaborators for Zero-shot Medical Reasoning
    Tang, Xiangru
    Zou, Anni
    Zhang, Zhuosheng
    Li, Ziming
    Zhao, Yilun
    Zhang, Xingyao
    Cohen, Arman
    Gerstein, Mark
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 599 - 621
  • [49] Zero-Shot Translation of Attention Patterns in VQA Models to Natural Language
    Salewski, Leonard
    Koepke, A. Sophia
    Lensch, Hendrik P. A.
    Akata, Zeynep
    PATTERN RECOGNITION, DAGM GCPR 2023, 2024, 14264 : 378 - 393
  • [50] Label Propagation for Zero-shot Classification with Vision-Language Models
    Stojnic, Vladan
    Kalantidis, Yannis
    Tolias, Giorgos
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23209 - 23218