Improving Text-Guided Object Inpainting with Semantic Pre-inpainting

被引:0
|
作者
Chen, Yifu [1 ,2 ]
Chen, Jingwen [3 ]
Pan, Yingwei [3 ]
Li, Yehao [3 ]
Yao, Ting [3 ]
Chen, Zhineng [1 ,2 ]
Mei, Tao [3 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Shanghai Collaborat Innovat Ctr Intelligent Visua, Shanghai, Peoples R China
[3] HiDreamai Inc, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Text-guided Object Inpainting; Diffusion Models;
D O I
10.1007/978-3-031-72952-2_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at https://github.com/Nnn-s/CATdiffusion.
引用
收藏
页码:110 / 126
页数:17
相关论文
共 50 条
  • [1] Text-Guided Image Inpainting
    Zhang, Zijian
    Zhao, Zhou
    Zhang, Zhu
    Huai, Baoxing
    Yuan, Jing
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4079 - 4087
  • [2] Text-Guided Neural Image Inpainting
    Zhang, Lisai
    Chen, Qingcai
    Hu, Baotian
    Jiang, Shuoran
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1302 - 1310
  • [3] Learning semantic alignment from image for text-guided image inpainting
    Yucheng Xie
    Zehang Lin
    Zhenguo Yang
    Huan Deng
    Xingcai Wu
    Xudong Mao
    Qing Li
    Wenyin Liu
    The Visual Computer, 2022, 38 : 3149 - 3161
  • [4] Learning semantic alignment from image for text-guided image inpainting
    Xie, Yucheng
    Lin, Zehang
    Yang, Zhenguo
    Deng, Huan
    Wu, Xingcai
    Mao, Xudong
    Li, Qing
    Liu, Wenyin
    VISUAL COMPUTER, 2022, 38 (9-10): : 3149 - 3161
  • [5] Bimodal text-guided image inpainting algorithm
    Li H.
    Chen J.
    Yu P.
    Li H.
    Zhang Y.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2023, 49 (10): : 2547 - 2557
  • [6] Improving Cross-modal Alignment for Text-Guided Image Inpainting
    Zhou, Yucheng
    Long, Guodong
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3445 - 3456
  • [7] Feature pre-inpainting enhanced transformer for video inpainting
    Li, Guanxiao
    Zhang, Ke
    Su, Yu
    Wang, Jingyu
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
  • [8] MISL: Multi-grained image-text semantic learning for text-guided image inpainting
    Wu, Xingcai
    Zhao, Kejun
    Huang, Qianding
    Wang, Qi
    Yang, Zhenguo
    Hao, Gefei
    PATTERN RECOGNITION, 2024, 145
  • [9] MMFL: Multimodal Fusion Learning for Text-Guided Image Inpainting
    Lin, Qing
    Yan, Bo
    Li, Jichun
    Tan, Weimin
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1094 - 1102
  • [10] Adversarial Learning with Mask Reconstruction for Text-Guided Image Inpainting
    Wu, Xingcai
    Xie, Yucheng
    Zeng, Jiaqi
    Yang, Zhenguo
    Yu, Yi
    Li, Qing
    Liu, Wenyin
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3464 - 3472