Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

被引：0

作者：

Lee, Saehyung ^{[1
]}

Yu, Sangwon ^{[1
]}

Park, Junsung ^{[1
]}

Yi, Jihun ^{[1
]}

Yoon, Sungroh ^{[1
,2
]}

机构：

[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul, South Korea

[2] Seoul Natl Univ, Interdisciplinary Program Artificial Intelligence, Seoul, South Korea

来源：

PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年

基金：

新加坡国家研究基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we primarily address the issue of dialogue-form context query within the interactive text-to-image retrieval task. Our methodology, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. This approach mitigates the issues of noisiness and redundancy in the generated questions. Beyond our methodology, we propose a novel evaluation metric, Best log Rank Integral (BRI), for a comprehensive assessment of the interactive retrieval system. PlugIR demonstrates superior performance compared to both zero-shot and fine-tuned baselines in various benchmarks. Additionally, the two methodologies comprising PlugIR can be flexibly applied together or separately in various situations. Our codes are available at https://github.com/Saehyung-Lee/PlugIR.

引用

页码：791 / 809

页数：19

共 50 条

[21] Blind image separation for document restoration using plug-and-play approach
Coba, Xhenis
Feng, Fangchen
Beghdadi, Azeddine
IEEE MMSP 2021: 2021 IEEE 23RD INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2021,
[22] Plug-and-play approach to class-adapted blind image deblurring
Marina Ljubenović
Mário A. T. Figueiredo
International Journal on Document Analysis and Recognition (IJDAR), 2019, 22 : 79 - 97
[23] Plug-and-Play Joint Image Deblurring and Detection
Marrs, Corey
Kathariya, Birendra
Li, Zhu
York, George
2023 IEEE 25TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, MMSP, 2023,
[24] Constrained Plug-and-Play Priors for Image Restoration
Benfenati, Alessandro
Cascarano, Pasquale
JOURNAL OF IMAGING, 2024, 10 (02)
[25] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Saharia, Chitwan
Chan, William
Saxena, Saurabh
Li, Lala
Whang, Jay
Denton, Emily
Ghasemipour, Seyed Kamyar Seyed
Ayan, Burcu Karagol
Mahdavi, S. Sara
Gontijo-Lopes, Raphael
Salimans, Tim
Ho, Jonathan
Fleet, David J.
Norouzi, Mohammad
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[26] Towards Practical Plug-and-Play Diffusion Models
Go, Hyojun
Lee, Yunsung
Kim, JinYoung
Lee, Seunghyun
Jeong, Myeongho
Lee, Hyun Seung
Choi, Seungtaek
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 1962 - 1971
[27] Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
Zhu, Hongyi
Huang, Jia-Hong
Rudinac, Stevan
Kanoulas, Evangelos
PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 978 - 987
[28] Pre-trained Diffusion Models for Plug-and-Play Medical Image Enhancement
Ma, Jun
Zhu, Yuanzhi
You, Chenyu
Wang, Bo
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT III, 2023, 14222 : 3 - 13
[29] Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
Zhao, Shihao
Shaozhe, Hao
Zi, Bojia
Xu, Huaizhe
Kwan-Yee K Wone
COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 70 - 86
[30] SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
Zhong, Shanshan
Huang, Zhongzhan
Wen, Wushao
Qin, Jinghui
Lin, Liang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 567 - 578

← 1 2 3 4 5 →