基于交换注意力机制的任意图像风格迁移
Arbitrary Image Style Transfer Based on Swapping Attention Mechanism
- 2025年 页码:1-13
收稿日期:2024-10-26,
修回日期:2025-01-19,
录用日期:2025-02-18,
网络出版日期:2025-02-20
DOI: 10.11834/jig.240652
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2024-10-26,
修回日期:2025-01-19,
录用日期:2025-02-18,
网络出版日期:2025-02-20,
移动端阅览
张宇欣, 董未名, 徐常胜. 基于交换注意力机制的任意图像风格迁移[J/OL]. 中国图象图形学报, 2025,1-13.
Zhang Yuxin, Dong Weiming, Xu Changsheng. Arbitrary Image Style Transfer Based on Swapping Attention Mechanism[J/OL]. Journal of image and graphics, 2025, 1-13.
目的
2
图像风格迁移旨在将一种艺术风格应用到一幅真实图像上,同时保持内容的完整性。传统的风格迁移方法存在伪影、风格模糊等问题。文字到图像生成扩散模型难以通过语言精确表达特定作品的创意。现有研究通过微调模型或文本嵌入技术提升风格准确度,但微调模型引入了额外计算成本,效率较低。
方法
2
针对这些挑战,本文提出了一种无需额外微调模型的高效风格迁移方法,包含一种新颖的三线并行风格迁移框架,即图像生成路径、内容引导路径和风格引导路径。本文方法利用摄影和艺术图像作为视觉提示,通过一种新颖的注意力交换机制,将引导图像信息注入新图像生成路径,实现灵活、高效的任意图像风格迁移。
结果
2
通过定性和定量实验比较,本文方法能够在推理时间内生成高质量的风格迁移结果,定性实验证明本方法适用于人像、风景、静物等多种题材,水墨、油画、素描等多种风格,且视觉效果优于当前最先进的风格迁移方法。定量实验证明本方法提升了风格准确性指标,达到了最先进的结果。
结论
2
证明所提出方法能够生成高质量、风格准确的风格迁移结果。
Objective
2
Generative methods based on diffusion models have garnered widespread attention due to their diverse generation outcomes and high-quality effects. Compared to traditional style transfer frameworks, diffusion models exhibit a distinct advantage in terms of the diversity, quality, and realism of generated images, and are effective in capturing the complex distributions of data. Owing to their robust data modeling capabilities, diffusion models are particularly suitable for multimodal generation tasks. However, applying diffusion models to text-to-image generation for style transfer tasks still faces several challenges. Firstly, the artistic styles involved in style transfer often possess highly visual and artistic characteristics that are difficult to precisely express through language. Although text-guided stylization techniques can generate artistic images using natural language and textual prompts, the textual prompts for the target style are often limited to a rough description of materials, artistic genres, artists, or famous artworks. To reproduce vivid content and style, detailed auxiliary textual input is usually required to guide the generation process. Even so, the generated results may still fail to fully capture the creativity and conception of a specific painting. Researchers in the field of image generation have explored the use of reference images as visual cues to enhance the quality and diversity of generated images. These methods mainly include fine-tuning diffusion models or text embedding techniques. These techniques are often accompanied by high fine-tuning costs, including the need for a large amount of data and computational resources.
Method
2
The core of image style transfer lies in two aspects: transferring the artistic style of the image while maintaining the content information of the original image. We have designed a novel three-branch parallel generation path, namely the new image generation path, the content reference path, and the style reference path, with the goal of integrating style and content features into the image generation process. This feature fusion is achieved through an interactive attention module. Specifically, the process begins with three generation paths: one starting from text prompts and initial noise, naturally forming an image that matches the text prompts; the other two paths are directed towards an artistic image and a content image, respectively. To generate an image that integrates the reference style features, we utilize the hierarchical generation characteristics of the diffusion model, which generates structured information in the deep network of the model and texture, color, and other information in the shallow network, corresponding to the content and style information set in the task. Based on this characteristic, in the deep self-attention module of the diffusion model's main network, the key and value features in the original path are exchanged with the corresponding features in the content path. Subsequently, in the shallow self-attention module of the diffusion model's main network, the key and value features in the original path are exchanged with the corresponding features in the style path. In this way, the attention mechanism uses the similarity between the original query features and the reference key features to reconstruct the reference value features in the form of weights. Tone consistency is a key factor in evaluating the effect of style transfer. Although methods based on interactive attention perform well in transferring the brushstrokes and textures of artistic images, they often find it difficult to align the overall tone of the content image with the style image. The image generation of the diffusion model depends not only on the generation network and text conditions but is also significantly influenced by the initial noise. Previous studies have pointed out that the color tone of the generated image is closely related to the sampling of the initial noise. Based on this finding, we address the tone consistency issue between the generated image and the style image by adjusting the initial noise. We propose a new method to improve tone consistency by aligning the initial noise of the generated image with the distribution of the style image.
Result
2
We conducted a comparative analysis with five state-of-the-art traditional style transfer methods, namely ArtFlow, AdaAttN, StyTr2, CAST, and AesPA-Net. For the test dataset, we utilized a collection of 100 artistic images from WikiArt and 100 real-world images from the Places365 dataset, which were randomly sampled from a dataset exceeding one hundred thousand images to ensure a fair comparison. Regarding implementation details, throughout all our experiments, we employed SDXL with its default hyper-parameters. Our approach enhanced the style accuracy by 4%, achieving a state-of-the-art style transfer effect. All baselines were evaluated using publicly available implementations and default configurations. Furthermore, we compared our method with two of the most advanced diffusion model-based style transfer methods, InST and Z*. Our method adeptly transferred the tonality and brushstrokes of the style image while achieving the best content preservation effect.
Conclusion
2
Through qualitative and quantitative comparisons with other arbitrary style transfer methods, our method has been proven to generate higher quality and more accurate style transfer results. Future work will focus on further optimizing the algorithm to enhance the model's adaptability to various styles and content, as well as exploring more innovative application scenarios. We anticipate that this method will play a more significant role in fields such as artistic creation and multimedia design, providing users with a richer and more personalized visual experience.
廖远鸿 , 钱文华 , 曹进德 . 2023 . 风格强度可变的人脸风格迁移网络 . 中国图象图形学报 , 28 ( 12 ): 3784- 3796 DOI: 10.11834/jig.221149.
Liao Yuanhong , Qian Wenhua , Cao Jinde . 2023 . MStarGAN: a face style transfer network with changeable style intensity . Journal of Image and Graphics , 28 ( 12 ): 3784 - 3796 DOI: 10.11834/jig.221149 http://dx.doi.org/10.11834/jig.221149 .
刘哲良 , 朱玮 , 袁梓洋 . 结合全卷积网络与CycleGAN的图像实例风格迁移 [J]. 中国图象图形学报 , 2019 , 24 ( 8 ): 1283-1291 . DOI: 10.11834/jig.180624 http://dx.doi.org/10.11834/jig.180624 .
Zheliang Liu , Wei Zhu , Ziyang Yuan . Image instance style transfer combined with fully convolutional network and cycleGAN [J]. Journal of Image and Graphics , 2019 , 24 ( 8 ): 1283 - 1291 . DOI: 10.11834/jig.180624 http://dx.doi.org/10.11834/jig.180624 .
孙梅婷 , 代龙泉 , 唐金辉 . 2023 . 基于Transformer方法的任意风格迁移策略 . 中国图象图形学报 , 28 ( 11 ): 3536- 3549 DOI: 10.11834/jig.211237.
Sun Meiting , Dai Longquan , Tang Jinhui . 2023 . Transformer-based multi-style information transfer in image processing . Journal of Image and Graphics , 28 ( 11 ): 3536 - 3549 DOI: 10.11834/jig.211237 http://dx.doi.org/10.11834/jig.211237 .
An J. , Huang S. , Song Y. , Dou D. , Liu W . and Luo, J ., 2021 . Artflow: Unbiased image style transfer via reversible neural flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 862 - 871 ). [ DOI: 10.1109/CVPR46437.2021.00092 http://dx.doi.org/10.1109/CVPR46437.2021.00092 ]
Avrahami O. , Lischinski D. and Fried O. , 2022 . Blended diffusion for text-driven editing of natural images . In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18208 - 18218 ).[ DOI: 10.1109/CVPR52688.2022.01767 http://dx.doi.org/10.1109/CVPR52688.2022.01767 ]
Chen H. , Zhao L. , Wang Z. , Zhang H. , Zuo Z. , Li A. , Xing W . and Lu, D ., 2021 . Dualast: Dual style-learning networks for artistic style transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 872 - 881 ).[ DOI: 10.1109/CVPR46437.2021.00093 http://dx.doi.org/10.1109/CVPR46437.2021.00093 ]
Deng Y. , He X. , Tang F. , and Dong W . $ Z^* $: Zero-shot Style Transfer via Attention Rearrangement [J]. arXiv preprint arXiv:2311.16491, 2023 . .[ DOI: 10.48550/arXiv.2311.16491 http://dx.doi.org/10.48550/arXiv.2311.16491 ]
Deng Y. , Tang F. , Dong W. , Ma C. , Pan X. , Wang L . and Xu, C ., 2022 . Stytr2: Image style transfer with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11326 - 11336 ).[ DOI: 10.1109/CVPR52688.2022.01104 http://dx.doi.org/10.1109/CVPR52688.2022.01104 ]
Gatys L.A. , Ecker A . S. and Bethge M., 2016 . Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2414 - 2423 ).[ DOI: 10.1109/cvpr.2016.265 http://dx.doi.org/10.1109/cvpr.2016.265 ]
Gal R. , Alaluf Y. , Atzmon Y. , Patashnik O. , Bermano A. H. , Chechik G. , and Cohen-Or D . An image is worth one word: Personalizing text-to-image generation using textual inversion [J]. arXiv preprint arXiv:2208. 01618 , 2022 . [ DOI: 10.48550/arXiv.2208.01618 http://dx.doi.org/10.48550/arXiv.2208.01618 ]
Ho J , Jain A , Abbeel P . Denoising diffusion probabilistic models [J]. Advances in neural information processing systems , 2020 , 33 : 6840 - 6851 .. [ DOI: 10.48550/arXiv.2006.11239 http://dx.doi.org/10.48550/arXiv.2006.11239 ]
Hong K. , Jeon S. , Lee J. , Ahn N. , Kim K. , Lee P. , Kim D. , Uh Y . and Byun, H ., 2023 . AesPA-Net: Aesthetic pattern-aware style transfer networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 22758 - 22767 ).[ DOI: 10.1109/ICCV51070.2023.02080 http://dx.doi.org/10.1109/ICCV51070.2023.02080 ]
Huang X . and Belongie S. , 2017 . Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision (pp. 1501 - 1510 ).[ DOI: 10.1109/ICCV.2017.167 http://dx.doi.org/10.1109/ICCV.2017.167 ]
Johnson J. , Alahi A . and Fei-Fei L., 2016 . Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016 : 14th European Conference, Amsterdam, The Netherlands , October 11-14 , 2016 , Proceedings , Part II 14 (pp. 694-711). Springer International Publishing.. [ DOI: 10.1007/978-3-319-46475-6_43 http://dx.doi.org/10.1007/978-3-319-46475-6_43 ]
Kumari N. , Zhang B. , Zhang R. , Shechtman E. and Zhu J . Y. , 2023 . Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1931-1941). [ DOI: 10.1109/cvpr52729.2023.00192 http://dx.doi.org/10.1109/cvpr52729.2023.00192 ]
Li Y. , Fang C. , Yang J. , Wang Z. , Lu X . and Yang, M .H., 2017 . Universal style transfer via feature transforms. Advances in neural information processing systems , 30 . [ DOI: 10.48550/arXiv.1705.08086 http://dx.doi.org/10.48550/arXiv.1705.08086 ]
Liu S. , Lin T. , He D. , Li F. , Wang M. , Li X. , Sun Z. , Li Q . and Ding, E ., 2021 . Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6649 - 6658 ).[ DOI: 10.1109/ICCV48922.2021.00658 http://dx.doi.org/10.1109/ICCV48922.2021.00658 ]
Mirza M , Osindero S . Conditional Generative Adversarial Nets [J]. Computer Science , 2014 : 2672 - 2680 . [ DOI: 10.1201/9781003281344-9 http://dx.doi.org/10.1201/9781003281344-9 ].
Park D . Y. and Lee K.H. , 2019 . Arbitrary style transfer with style-attentional networks. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5880 - 5888 ).[ DOI: 10.1109/CVPR.2019.00603 http://dx.doi.org/10.1109/CVPR.2019.00603 ]
Phillips F . and Mackintosh B. , 2011 . Wiki art gallery, inc.: A case for critical thinking. Issues in Accounting Education , 26 ( 3 ), pp. 593 - 608 . [ DOI: 10.2308/iace-50038 http://dx.doi.org/10.2308/iace-50038 ]
Podell D , English Z , Lacey K , et al . Sdxl: Improving latent diffusion models for high-resolution image synthesis [J]. arXiv preprint arXiv:2307. 01952 , 2023 . [ DOI: 10.48550/arXiv.2307.01952 http://dx.doi.org/10.48550/arXiv.2307.01952 ].
Rombach R. , Blattmann A. , Lorenz D. , Esser P. and Ommer B. , 2022 . High-resolution image synthesis with latent diffusion models . In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684 - 10695 ).[ DOI: 10.1109/CVPR52688.2022.01042 http://dx.doi.org/10.1109/CVPR52688.2022.01042 ]
Ruiz N. , Li Y. , Jampani V. , Pritch Y. , Rubinstein M . an d Aberman, K ., 2023 . Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 22500 - 22510 ). [ DOI: 10.1109/cvpr52729.2023.02155 http://dx.doi.org/10.1109/cvpr52729.2023.02155 ]
Song J , Meng C , Ermon S . Denoising diffusion implicit models [J]. arXiv preprint arXiv:2010. 02502 , 2020 .s. [ DOI: 10.48550/arXiv.2010.02502 http://dx.doi.org/10.48550/arXiv.2010.02502 ]
Vaswani A . Attention is all you need [J]. Advances in Neural Information Processing Systems , 2017 . [ DOI: 10.5040/9781350101272.00000005 http://dx.doi.org/10.5040/9781350101272.00000005 ]
Zhang Y. , Huang N. , Tang F. , Huang H. , Ma C. , Dong W . and Xu, C ., 2023 . Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10146 - 10156 ).[ DOI: 10.1109/CVPR52729.2023.00978 http://dx.doi.org/10.1109/CVPR52729.2023.00978 ]
Zhang Y. , Tang F. , Dong W. , Huang H. , Ma C. , Lee T . Y . and Xu , C., 2022 , July . Domain enhanced arbitrary image style transfer via contrastive learning. In ACM SIGGRAPH 2022 proceedingsconference (pp. 1 - 8 ).[ DOI: 10.1145/3528233.3530736 http://dx.doi.org/10.1145/3528233.3530736 ]
Zhou B. , Lapedriza A. , Khosla A. , Oliva A. and Torralba A. , 2017 . Places: A 10 million image database for scene recognition . IEEE transactions on pattern analysis and machine intelligence , 40 ( 6 ), pp. 1452 - 1464 . [ DOI: 10.1109/tpami.2017.2723009 http://dx.doi.org/10.1109/tpami.2017.2723009 ]
Zhu J.Y. , Park T. , Isola P . and Efros, A.A., 2017 . Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223 - 2232 ). [ DOI: doi.org/10.1109/iccv.2017.244 http://dx.doi.org/doi.org/10.1109/iccv.2017.244 ]
相关作者
相关机构