TRTST: Arbitrary High-Quality Text-Guided Style Transfer With Transformers

被引:0
|
作者
Chen, Haibo [1 ,2 ]
Wang, Zhoujie [1 ,2 ]
Zhao, Lei [3 ]
Li, Jun [1 ,2 ]
Yang, Jian [1 ,2 ]
机构
[1] Nanjing Univ Sci & Technol, Minist Educ, PCA Lab, Key Lab Intelligent Percept & Syst High Dimens Inf, Nanjing, Peoples R China
[2] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China
基金
美国国家科学基金会;
关键词
Transformers; Visualization; Training; Feature extraction; Training data; Image coding; Data models; Painting; Impedance matching; Encoding; Text-guided style transfer; transformer; unpaired; visual quality; generalization ability;
D O I
10.1109/TIP.2025.3530822
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-guided style transfer aims to repaint a content image with the target style described by a text prompt, offering greater flexibility and creativity compared to traditional image-guided style transfer. Despite the potential, existing text-guided style transfer methods often suffer from many issues, including insufficient visual quality, poor generalization ability, or a reliance on large amounts of paired training data. To address these limitations, we leverage the inherent strengths of transformers in handling multimodal data and propose a novel transformer-based framework called TRTST that not only achieves unpaired arbitrary text-guided style transfer but also significantly improves the visual quality. Specifically, TRTST explores combining a text transformer encoder with an image transformer encoder to project the input text prompt and content image into a joint embedding space and extract the desired style and content features. These features are then input into a multimodal co-attention module to stylize the image sequence based on the text sequence. We also propose a new adaptive parametric positional encoding (APPE) scheme which can adaptively produce different positional encodings to optimally match different inputs with a position encoder. In addition, to further improve content preservation, we introduce a text-guided identity loss to our model. Extensive results and comparisons are conducted to demonstrate the effectiveness and superiority of our method.
引用
收藏
页码:759 / 771
页数:13
相关论文
共 50 条
  • [31] Minimalist and High-Quality Panoramic Imaging With PSF-Aware Transformers
    Jiang, Qi
    Gao, Shaohua
    Gao, Yao
    Yang, Kailun
    Yi, Zhonghua
    Shi, Hao
    Sun, Lei
    Wang, Kaiwei
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 4568 - 4583
  • [32] UDiffText: A Unified Framework for High-Quality Text Synthesis in Arbitrary Images via Character-Aware Diffusion Models
    Zhao, Yiming
    Lian, Zhouhui
    COMPUTER VISION - ECCV 2024, PT XXXI, 2025, 15089 : 217 - 233
  • [33] Arbitrary style transfer using neurally-guided patch-based synthesis
    Texler, Ondrej
    Futschik, David
    Fiser, Jakub
    Lukac, Michal
    Lu, Jingwan
    Shechtman, Eli
    Sykora, Daniel
    COMPUTERS & GRAPHICS-UK, 2020, 87 (62-71): : 62 - 71
  • [34] EmoAsst: emotion recognition assistant via text-guided transfer learning on pre-trained visual and acoustic models
    Wang, Minxiao
    Yang, Ning
    FRONTIERS IN COMPUTER SCIENCE, 2024, 6
  • [35] Multi-layer feature fusion based image style transfer with arbitrary text condition
    Yu, Yue
    Xing, Jingshuo
    Li, Nengli
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2025, 132
  • [36] Adaptive Prompt Routing for Arbitrary Text Style Transfer with Pre-trained Language Models
    Liu, Qingyi
    Qin, Jinghui
    Ye, Wenxuan
    Mou, Hao
    He, Yuxuan
    Wang, Keze
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18689 - 18697
  • [37] PortaSpeech: Portable and High-Quality Generative Text-to-Speech
    Ren, Yi
    Liu, Jinglin
    Zhao, Zhou
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [38] QUERYD: A VIDEO DATASET WITH HIGH-QUALITY TEXT AND AUDIO NARRATIONS
    Oncescu, Andreea-Maria
    Henriques, Joao F.
    Liu, Yang
    Zisserman, Andrew
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 2265 - 2269
  • [39] EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture
    Miao, Chenfeng
    Liang, Shuang
    Liu, Zhencheng
    Chen, Minchuan
    Ma, Jun
    Wang, Shaojun
    Xiao, Jing
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [40] Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring
    Kong, Lingshun
    Dong, Jiangxin
    Ge, Jianjun
    Li, Mingqiang
    Pan, Jinshan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 5886 - 5895