Improving Text-Guided Object Inpainting with Semantic Pre-inpainting

被引:0
|
作者
Chen, Yifu [1 ,2 ]
Chen, Jingwen [3 ]
Pan, Yingwei [3 ]
Li, Yehao [3 ]
Yao, Ting [3 ]
Chen, Zhineng [1 ,2 ]
Mei, Tao [3 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Shanghai Collaborat Innovat Ctr Intelligent Visua, Shanghai, Peoples R China
[3] HiDreamai Inc, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Text-guided Object Inpainting; Diffusion Models;
D O I
10.1007/978-3-031-72952-2_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at https://github.com/Nnn-s/CATdiffusion.
引用
收藏
页码:110 / 126
页数:17
相关论文
共 50 条
  • [21] A text guided cross modal joint inpainting algorithm for ancient murals
    Chen, Yong
    Du, Wanjun
    Zhang, Shilong
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 258
  • [22] Semantic-Guided Inpainting Network for Complex Urban Scenes Manipulation
    Ardino, Pierfrancesco
    Liu, Yahui
    Ricci, Elisa
    Lepri, Bruno
    de Nadai, Marco
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9280 - 9287
  • [23] Semantic Segmentation on Compressed Video Using Block Motion Compensation and Guided Inpainting
    Tanujaya, Stefanie
    Chu, Tieh
    Liu, Jia-Hao
    Peng, Wen-Hsiao
    2020 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2020,
  • [24] Semantic-Guided Completion Network for Video Inpainting in Complex Urban Scene
    Wang, Jianan
    Xuan, Hanyu
    Wu, Zhiliang
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT XI, 2024, 14435 : 224 - 236
  • [25] Text Image Inpainting via Global Structure-Guided Diffusion Models
    Zhu, Shipeng
    Fang, Pengfei
    Zhu, Chenjie
    Zhao, Zuoyan
    Xu, Qiang
    Xue, Hui
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7775 - 7783
  • [26] Zero-Shot Text-Guided Object Generation with Dream Fields
    Jain, Ajay
    Mildenhall, Ben
    Barron, Jonathan T.
    Abbeel, Pieter
    Poole, Ben
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 857 - 866
  • [27] Virtual Contour Guided Video Object Inpainting Using Posture Mapping and Retrieval
    Ling, Chih-Hung
    Lin, Chia-Wen
    Su, Chih-Wen
    Chen, Yong-Sheng
    Liao, Hong-Yuan Mark
    IEEE TRANSACTIONS ON MULTIMEDIA, 2011, 13 (02) : 292 - 302
  • [28] Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval
    Liu, Delong
    Li, Haiwen
    Zhao, Zhicheng
    Dong, Yuan
    NEURAL NETWORKS, 2025, 184
  • [29] TGAVC: IMPROVING AUTOENCODER VOICE CONVERSION WITH TEXT-GUIDED AND ADVERSARIAL TRAINING
    Tang, Huaizhen
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Zeng, Zhen
    Xiao, Edward
    Xiao, Jing
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 938 - 945
  • [30] Semantic object removal with convolutional neural network feature-based inpainting approach
    Xiuxia Cai
    Bin Song
    Multimedia Systems, 2018, 24 : 597 - 609