Vision-Language Models for Robot Success Detection

被引:0
|
作者
Luo, Fiona [1 ]
机构
[1] Univ Penn, Gen Robot Automat Sensing & Percept (GRASP) Lab, 3330 Walnut St, Philadelphia, PA 19104 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we use Vision-Language Models (VLMs) as a binary success detector given a robot observation and task description, formulated as a Visual Question Answering (VQA) problem. We fine-tune the open-source MiniGPT-4 VLM to detect success on robot trajectories from the Berkeley Bridge and Berkeley AUTOLab UR5 datasets. We find that while a handful of test distribution trajectories can train an accurate detector, transferring learning between different environments is challenging due to distribution shift. In addition, while our VLM is robust to language variations, it is less robust to visual variations. In the future, more powerful VLMs such as Gemini and GPT-4 have the potential to be more accurate and robust success detectors, and success detectors can provide a sparse binary reward to improve existing policies.
引用
收藏
页码:23750 / 23752
页数:3
相关论文
共 50 条
  • [11] Task Residual for Tuning Vision-Language Models
    Yu, Tao
    Lu, Zhihe
    Jin, Xin
    Chen, Zhibo
    Wang, Xinchao
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
  • [12] Perceptual Grouping in Contrastive Vision-Language Models
    Ranasinghe, Kanchana
    McKinzie, Brandon
    Ravi, Sachin
    Yang, Yinfei
    Toshev, Alexander
    Shlens, Jonathon
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5548 - 5561
  • [13] Task Bias in Contrastive Vision-Language Models
    Menon, Sachit
    Chandratreya, Ishaan Preetam
    Vondrick, Carl
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
  • [14] Equivariant Similarity for Vision-Language Foundation Models
    Wang, Tan
    Lin, Kevin
    Li, Linjie
    Lin, Chung-Ching
    Yang, Zhengyuan
    Zhang, Hanwang
    Liu, Zicheng
    Wang, Lijuan
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11964 - 11974
  • [15] Adventures of Trustworthy Vision-Language Models: A Survey
    Vatsa, Mayank
    Jain, Anubhooti
    Singh, Richa
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
  • [16] Vision-Language Models for Feature Detection of Macular Diseases on Optical Coherence Tomography
    Antaki, Fares
    Chopra, Reena
    Keane, Pearse A.
    [J]. JAMA OPHTHALMOLOGY, 2024, 142 (06) : 573 - 576
  • [17] VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection
    Wu, Peng
    Zhou, Xuerong
    Pang, Guansong
    Zhou, Lingru
    Yan, Qingsen
    Wang, Peng
    Zhang, Yanning
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6074 - 6082
  • [18] Tuning Vision-Language Models With Multiple Prototypes Clustering
    Guo, Meng-Hao
    Zhang, Yi
    Mu, Tai-Jiang
    Huang, Sharon X.
    Hu, Shi-Min
    [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12) : 11186 - 11199
  • [19] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
    Salin, Emmanuelle
    Ayache, Stephane
    Favre, Benoit
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
  • [20] Effectiveness assessment of recent large vision-language models
    Yao Jiang
    Xinyu Yan
    Ge-Peng Ji
    Keren Fu
    Meijun Sun
    Huan Xiong
    Deng-Ping Fan
    Fahad Shahbaz Khan
    [J]. Visual Intelligence, 2 (1):