Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

被引:0
|
作者
Lee, Saehyung [1 ]
Yu, Sangwon [1 ]
Park, Junsung [1 ]
Yi, Jihun [1 ]
Yoon, Sungroh [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul, South Korea
[2] Seoul Natl Univ, Interdisciplinary Program Artificial Intelligence, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we primarily address the issue of dialogue-form context query within the interactive text-to-image retrieval task. Our methodology, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. This approach mitigates the issues of noisiness and redundancy in the generated questions. Beyond our methodology, we propose a novel evaluation metric, Best log Rank Integral (BRI), for a comprehensive assessment of the interactive retrieval system. PlugIR demonstrates superior performance compared to both zero-shot and fine-tuned baselines in various benchmarks. Additionally, the two methodologies comprising PlugIR can be flexibly applied together or separately in various situations. Our codes are available at https://github.com/Saehyung-Lee/PlugIR.
引用
收藏
页码:791 / 809
页数:19
相关论文
共 50 条
  • [21] Blind image separation for document restoration using plug-and-play approach
    Coba, Xhenis
    Feng, Fangchen
    Beghdadi, Azeddine
    IEEE MMSP 2021: 2021 IEEE 23RD INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2021,
  • [22] Plug-and-play approach to class-adapted blind image deblurring
    Marina Ljubenović
    Mário A. T. Figueiredo
    International Journal on Document Analysis and Recognition (IJDAR), 2019, 22 : 79 - 97
  • [23] Plug-and-Play Joint Image Deblurring and Detection
    Marrs, Corey
    Kathariya, Birendra
    Li, Zhu
    York, George
    2023 IEEE 25TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, MMSP, 2023,
  • [24] Constrained Plug-and-Play Priors for Image Restoration
    Benfenati, Alessandro
    Cascarano, Pasquale
    JOURNAL OF IMAGING, 2024, 10 (02)
  • [25] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
    Saharia, Chitwan
    Chan, William
    Saxena, Saurabh
    Li, Lala
    Whang, Jay
    Denton, Emily
    Ghasemipour, Seyed Kamyar Seyed
    Ayan, Burcu Karagol
    Mahdavi, S. Sara
    Gontijo-Lopes, Raphael
    Salimans, Tim
    Ho, Jonathan
    Fleet, David J.
    Norouzi, Mohammad
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [26] Towards Practical Plug-and-Play Diffusion Models
    Go, Hyojun
    Lee, Yunsung
    Kim, JinYoung
    Lee, Seunghyun
    Jeong, Myeongho
    Lee, Hyun Seung
    Choi, Seungtaek
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 1962 - 1971
  • [27] Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
    Zhu, Hongyi
    Huang, Jia-Hong
    Rudinac, Stevan
    Kanoulas, Evangelos
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 978 - 987
  • [28] Pre-trained Diffusion Models for Plug-and-Play Medical Image Enhancement
    Ma, Jun
    Zhu, Yuanzhi
    You, Chenyu
    Wang, Bo
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT III, 2023, 14222 : 3 - 13
  • [29] Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
    Zhao, Shihao
    Shaozhe, Hao
    Zi, Bojia
    Xu, Huaizhe
    Kwan-Yee K Wone
    COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 70 - 86
  • [30] SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
    Zhong, Shanshan
    Huang, Zhongzhan
    Wen, Wushao
    Qin, Jinghui
    Lin, Liang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 567 - 578