面向驾驶场景精准图像翻译的条件扩散模型
Precise image translation based on conditional diffusion model for driving scenarios
- 2024年29卷第11期 页码:3305-3318
纸质出版日期: 2024-11-16
DOI: 10.11834/jig.230785
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-11-16 ,
移动端阅览
徐映芬, 胡学敏, 黄婷玉, 李燊, 陈龙. 2024. 面向驾驶场景精准图像翻译的条件扩散模型. 中国图象图形学报, 29(11):3305-3318
Xu Yingfen, Hu Xuemin, Huang Tingyu, Li Shen, Chen Long. 2024. Precise image translation based on conditional diffusion model for driving scenarios. Journal of Image and Graphics, 29(11):3305-3318
目的
2
针对虚拟到现实驾驶场景翻译中成对的数据样本匮乏、翻译结果不精准以及模型训练不稳定等问题,提出一种多模态数据融合的条件扩散模型。
方法
2
首先,为解决目前主流的基于生成对抗网络的图像翻译方法中存在的模式崩塌、训练不稳定等问题,以生成多样性强、训练稳定性好的扩散模型为基础,构建图像翻译模型;其次,为解决传统扩散模型无法融入先验信息从而无法控制图像生成这一问题,提出基于多头自注意力机制的多模态特征融合方法,该方法能将多模态信息融入扩散模型的去噪过程,从而起到条件控制的作用;最后,基于语义分割图和深度图能分别表征物体的轮廓信息和深度信息这一特点,将其与噪声图像进行融合后输入去噪网络,以此构建多模态数据融合的条件扩散模型,从而实现更精准的驾驶场景图像翻译。
结果
2
在Cityscapes数据集上训练本文提出的模型,并且将本文方法与先进方法进行比较,结果表明,本文方法可以实现轮廓细节更细致、距离远近更一致的驾驶场景图像翻译,在弗雷歇初始距离(Fréchet inception distance, FID) 和学习感知图像块相似度(learned perceptual image patch similarity, LPIPS)等指标上均取得了更好的结果,分别为44.20和0.377。
结论
2
本文方法能有效解决现有图像翻译方法中数据样本匮乏、翻译结果不精准以及模型训练不稳定等问题,提高驾驶场景的翻译精确度,为实现安全实用的自动驾驶提供理论支撑和数据基础。
Objective
2
Safety is the most important consideration for autonomous driving vehicles. New autonomous driving methods need numerous training and testing processes before their application in real vehicles. However, training and testing autonomous driving methods directly in real-world scenarios is a costly and risky task. Many researchers first train and test their methods in simulated-world scenarios and then transfer the trained knowledge to real-world scenarios. However, many differences in scene modeling, light, and vehicle dynamics are observed between the two-world scenarios. Therefore, the autonomous driving model trained in simulated-world scenarios cannot be effectively generalized to real-world scenarios. With the development of deep learning technologies, image translation, which aims to transform the content of an image from one presentation form to another, has made considerable achievements in many fields, such as image beautification, style transfer, scene design, and video special effects. If image translation technology is applied to the translation of simulated driving scenarios to real ones, then this technology can not only solve the problem of poor generalization capability of autonomous driving models but can also effectively reduce the cost and risk of training in the real scenarios. Unfortunately, existing image translation methods applied in autonomous driving lack datasets of paired simulated and real scenarios, and most of the mainstream image translation methods are based on generative adversarial network (GAN), which have problems of mode collapse and unstable training. The generated images also suffer from numerous detail problems, such as distorted object contours and unnatural small objects in the scene. These problems will not only further affect the perception of automatic driving, which will then impact the decision regarding automatic driving, but will also influence the evaluation metrics of image translation. In this paper, a multimodal conditional diffusion model based on the denoising diffusion probabilistic model (DDPM), which has achieved remarkable success in various image generation tasks, is proposed to address the problems of insufficient paired simulation-real data, mode collapse, unstable training, and inadequate diversity of generated data in existing image translation.
Method
2
First, an image translation method based on the diffusion model with good training stability and generative diversity is proposed to solve the problems of mode collapse and unstable training in existing mainstream image translation methods based on GAN. Second, a multimodal feature fusion method based on a multihead self-attention mechanism is developed in this paper to address the problem of traditional diffusion models, which cannot integrate prior information without controlling the image generation process. The proposed method can send the early fused data to the convolutional layer, extract the high-level features, and then obtain the high-level fused feature vectors through the multihead self-attention mechanism. Finally, considering the semantic segmentation and depth maps, which can precisely represent the contour and depth information, respectively, the conditional diffusion model (CDM) is designed by fusing the semantic segmentation and depth maps with the noise image before sending them to the denoising network. In this model, the semantic segmentation map, depth map, and noise image can perceive each other through the proposed multimodal feature fusion method. The output fusion features will then be fed to the next sublayer in the network. After the denoising iterative process, the final output of the denoising network contains semantic and depth information; thus, the semantic segmentation and depth maps can play a conditional guiding role in the diffusion model. According to the settings in the DDPM, the U-Net network is utilized as the denoising network. Compared with the U-Net in DDPM, the self-attention layer is modified to match the improved self-attention proposed in this paper for effectively learning the fusion features. The proposed model can be applied to the image translation of simulated-to-real scenarios after training the denoising network in the CDM. Noise is first added to the simulated images collected from the Carla simulator, and paired semantic segmentation and depth maps are then sent to the denoising network to perform a step-by-step denoising process. Finally, real driving scene images are obtained to realize image translation with highly precise contour details and consistent distance in simulated and real images.
Result
2
The model is trained on the Cityscapes dataset and compared with state-of-the-art (SOTA) methods in recent years. Experimental results indicate that the proposed approach achieves a superior translation result with improved semantic precision and additional contour details. The evaluation metrics include Fréchet inception distance (FID) and the learned perceptual image patch similarity (LPIPS), which indicate the similarity between the generated and original images, and the difference between the generated images, respectively. A lower FID score represents better generation quality with a smaller gap between the generated and real image distributions, while a higher LPIPS value indicates better generation diversity. Compared with the comparative SOTA methods, the proposed method can achieve better results in the FID and LPIPS indicators, revealing scores of 44.20 and 0.377, respectively.
Conclusion
2
In this paper, a novel image-to-image translation method based on a conditional diffusion model and a multimodal fusion method with a multihead attention mechanism for autonomous driving scenarios are proposed. Experimental results show that the proposed method can effectively solve the problems of insufficient paired datasets, imprecise translation results, unstable training, and insufficient generation diversity in existing image translation methods. Thus, this method improves the image translation precision of driving scenarios and provides theoretical support and a data basis to realize safe and practical autonomous driving systems.
虚拟到现实图像翻译扩散模型多模态融合驾驶场景
simulation to realityimage translationdiffusion modelmulti-modal fusiondriving scenario
Abbasnejad I, Zambetta F, Salim F, Wiley T, Chan J, Gallagher R and Abbasnejad E. 2023. SCONE-GAN: semantic contrastive learning-based generative adversarial network for an end-to-end image translation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Vancouver, Canada: IEEE: 1111-1120 [DOI: 10.1109/CVPRW59228.2023.00118http://dx.doi.org/10.1109/CVPRW59228.2023.00118]
Amit T, Shaharbany T, Nachmani E and Wolf L. 2022. SegDiff: image segmentation with diffusion probabilistic models [EB/OL]. [2023-11-23]. https://arxiv.org/pdf/2112.00390.pdfhttps://arxiv.org/pdf/2112.00390.pdf
Austin J, Johnson D D, Ho J, Tarlow D and van den Berg R. 2021. Structured denoising diffusion models in discrete state-spaces//Proceedings of the 35th International Conference on Neural Information Processing Systems. [s.l.]: Curran Associates Inc.: 17981-17993
Baranchuk D, Rubachev I, Voynov A, Khrulkov V and Babenko A. 2022. Label-efficient semantic segmentation with diffusion models [EB/OL]. [2023-11-23]. https://arxiv.org/pdf/2112.03126.pdfhttps://arxiv.org/pdf/2112.03126.pdf
Brempong E A, Kornblith S, Chen T, Parmar N, Minderer M and Norouzi M. 2022. Denoising pretraining for semantic segmentation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 4174-4185 [DOI: 10.1109/CVPRW56347.2022.00462http://dx.doi.org/10.1109/CVPRW56347.2022.00462]
Brock A, Donahue J and Simonyan K. 2019. Large scale GAN training for high fidelity natural image synthesis//Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: [s.n.]: 1-35
Chen T D. 2006. The synthesis of non-photorealistic motion effects for cartoon//Proceedings of the 6th International Conference on Intelligent Systems Design and Applications. Ji’an, China: IEEE: 811-818 [DOI: 10.1109/ISDA.2006.253717http://dx.doi.org/10.1109/ISDA.2006.253717]
Choi Y, Choi M, Kim M, Ha J W, Kim S and Choo J. 2018. StarGAN: unified generative adversarial networks for multi-domain image-to-image translation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8789-8797 [DOI: 10.1109/CVPR.2018.00916http://dx.doi.org/10.1109/CVPR.2018.00916]
Choi Y, Uh Y, Yoo J and Ha J W. 2020. StarGAN v2: diverse image synthesis for multiple domains//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8185-8194 [DOI: 10.1109/CVPR42600.2020.00821http://dx.doi.org/10.1109/CVPR42600.2020.00821]
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S and Schiele B. 2016. The cityscapes dataset for semantic urban scene understanding//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3213-3223 [DOI: 10.1109/CVPR.2016.350http://dx.doi.org/10.1109/CVPR.2016.350]
Dosovitskiy A, Ros G, Codevilla F, Lopez A and Koltun V. 2017. CARLA: an open urban driving simulator//Proceedings of the 1st Annual Conference on Robot Learning. Mountain View, USA: PMLR: 1-16
Fortune S. 1986. A sweepline algorithm for Voronoi diagrams//Proceedings of the 2nd Annual Symposium on Computational Geometry. Yorktown Heights, USA: ACM: 313-322 [DOI: 10.1145/10515.10549http://dx.doi.org/10.1145/10515.10549]
Gatys L A, Ecker A S and Bethge M. 2015. Texture synthesis using convolutional neural networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 262-270
Hertzmann A, Jacobs C E, Oliver N, Curless B and Salesin D H. 2001. Image analogies//Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. Los Angeles, California, USA: ACM: 327-340 [DOI: 10.1145/383259.383295http://dx.doi.org/10.1145/383259.383295]
Heusel M, Ramsauer H, Unterthiner T, Nessler B and Hochreiter S. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6629-6640
Ho J, Jain A and Abbeel P. 2020. Denoising diffusion probabilistic models//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 6840-6851 [DOI: 10.48550/arXiv.2006.11239http://dx.doi.org/10.48550/arXiv.2006.11239]
Ho J, Saharia C, Chan W, Fleet D J, Norouzi M and Salimans T. 2022. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1): 2249-2281
Hoff K E III, Culver T, Keyser J, Lin M and Manocha D. 2000. Fast computation of generalized Voronoi diagrams using graphics hardware//Proceedings of the 16th Annual Symposium on Computational Geometry. Hong Kong, China: ACM: 375-376 [DOI: 10.1145/336154.336226http://dx.doi.org/10.1145/336154.336226]
Hoffman J, Tzeng E, Park T, Zhu J Y, Isola P, Saenko K, Efros A A and Darrell T. 2018. CyCADA: cycle-consistent adversarial domain adaptation//Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR: 1989-1998
Huang X, Liu M Y, Belongie S and Kautz J. 2018. Multimodal unsupervised image-to-image translation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 179-196 [DOI: 10.1007/978-3-030-01219-9_11http://dx.doi.org/10.1007/978-3-030-01219-9_11]
Isola P, Zhu J Y, Zhou T H and Efros A A. 2017. Image-to-image translation with conditional adversarial networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5967-5976 [DOI: 10.1109/CVPR.2017.632http://dx.doi.org/10.1109/CVPR.2017.632]
Kong Z F, Ping W, Huang J J, Zhao K X and Catanzaro B. 2021. DiffWave: a versatile diffusion model for audio synthesis//Proceedings of the 9th International Conference on Learning Representations. Virtual: IEEE: 1-17
Lee H Y, Tseng H Y, Mao Q, Huang J B, Lu Y D, Singh M and Yang M H. 2020. DRIT++: diverse image-to-image translation via disentangled representations. International Journal of Computer Vision, 128(10/11): 2402-2417 [DOI: 10.1007/s11263-019-01284-zhttp://dx.doi.org/10.1007/s11263-019-01284-z]
Li X Y, Ye Z H, Wei S K, Chen Z, Chen X T, Tian Y H, Dang J W, Fu S J and Zhao Y. 2023. 3D object detection for autonomous driving from image: a survey——benchmarks, constraints and error analysis. Journal of Image and Graphics, 28(6): 1709-1740
李熙莹, 叶芝桧, 韦世奎, 陈泽, 陈小彤, 田永鸿, 党建武, 付树军, 赵耀. 2023. 基于图像的自动驾驶3D目标检测综述——基准、制约因素和误差分析. 中国图象图形学报, 28(6): 1709-1740 [DOI: 10.11834/jig.230036http://dx.doi.org/10.11834/jig.230036]
Li Y J, Fang C, Yang J M, Wang Z W, Lu X and Yang M H. 2017. Universal style transfer via feature transforms//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 385-395
Liu X H, Yin G J, Shao J, Wang X G and Li H S. 2019. Learning to predict layout-to-image conditional convolutions for semantic image synthesis//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 570-580
Liu Y F, Hu X M, Chen G W, Liu S H and Chen L. 2021. Review of end-to-end motion planning for autonomous driving with visual perception. Journal of Image and Graphics, 26(1): 49-66
刘旖菲, 胡学敏, 陈国文, 刘士豪, 陈龙. 2021. 视觉感知的端到端自动驾驶运动规划综述. 中国图象图形学报, 26(1): 49-66 [DOI: 10.11834/jig.200276http://dx.doi.org/10.11834/jig.200276]
Luan F J, Paris S, Shechtman E and Bala K. 2017. Deep photo style transfer//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6997-7005 [DOI: 10.1109/CVPR.2017.740http://dx.doi.org/10.1109/CVPR.2017.740]
Lugmayr A, Danelljan M, Romero A, Yu F, Timofte R and van Gool L. 2022. RePaint: inpainting using denoising diffusion probabilistic models//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 11451-11461 [DOI: 10.1109/CVPR52688.2022.01117http://dx.doi.org/10.1109/CVPR52688.2022.01117]
Mao Q, Lee H Y, Tseng H Y, Ma S W and Yang M H. 2019. Mode seeking generative adversarial networks for diverse image synthesis//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1429-1437 [DOI: 10.1109/CVPR.2019.00152http://dx.doi.org/10.1109/CVPR.2019.00152]
Meng C L, He Y T, Song Y, Song J M, Wu J J, Zhu J Y and Ermon S. 2022. SDEdit: guided image synthesis and editing with stochastic differential equations [EB/OL]. [2023-11-23]. https://arxiv.org/pdf/2108.01073.pdfhttps://arxiv.org/pdf/2108.01073.pdf
Nichol A Q and Dhariwal P. 2021. Improved denoising diffusion probabilistic models//Proceedings of the 38th International Conference on Machine Learning. Virtual: PMLR: 8162-8171
Park T, Efros A A, Zhang R and Zhu J Y. 2020. Contrastive learning for unpaired image-to-image translation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 319-345 [DOI: 10.1007/978-3-030-58545-7_19http://dx.doi.org/10.1007/978-3-030-58545-7_19]
Prakash A, Chitta K and Geiger A. 2021. Multi-modal fusion transformer for end-to-end autonomous driving//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 7073-7083 [DOI: 10.1109/CVPR46437.2021.00700http://dx.doi.org/10.1109/CVPR46437.2021.00700]
Ranftl R, Lasinger K, Hafner D, Schindler K and Koltun V. 2022. Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3): 1623-1637 [DOI: 10.1109/TPAMI.2020.3019967http://dx.doi.org/10.1109/TPAMI.2020.3019967]
Richardson E, Alaluf Y, Patashnik O, Nitzan Y, Azar Y, Shapiro S and Cohen-Or D. 2021. Encoding in style: a StyleGAN encoder for image-to-image translation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 2287-2296 [DOI: 10.1109/CVPR46437.2021.00232http://dx.doi.org/10.1109/CVPR46437.2021.00232]
Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241 [DOI: 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]
Saharia C, Chan W, Chang H W, Lee C, Ho J, Salimans T, Fleet D and Norouzi M. 2022. Palette: image-to-image diffusion models//Proceedings of 2022 ACM SIGGRAPH Conference Proceedings. Vancouver, Canada: ACM: #15 [DOI: 10.1145/3528233.3530757http://dx.doi.org/10.1145/3528233.3530757]
Saharia C, Ho J, Chan W, Salimans T, Fleet D J and Norouzi M. 2023. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4): 4713-4726 [DOI: 10.1109/TPAMI.2022.3204461http://dx.doi.org/10.1109/TPAMI.2022.3204461]
Schönfeld E, Sushko V, Zhang D, Gall J, Schiele B and Khoreva A. 2021. You only need adversarial supervision for semantic image synthesis//Proceedings of the 9th International Conference on Learning Representations. Virtual: IEEE: 1-32
Secord A. 2002. Weighted Voronoi stippling//The 2nd International Symposium on Non-photorealistic Animation and Rendering. Annecy, France: ACM: 37-43 [DOI: 10.1145/508530.508537http://dx.doi.org/10.1145/508530.508537]
Sohl-Dickstein J, Weiss E W, Maheswaranathan N and Ganguli S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR.org: 2256-2265
Song Y, Sohl-Dickstein J, Kingma D P, Kumar A, Ermon S and Poole B. 2020. Score-based generative modeling through stochastic differential equations//Proceedings of the 9th International Conference on Learning Representations. Virtual: IEEE: 1-36
Su X, Song J M, Meng C L and Ermon S. 2023. Dual diffusion implicit bridges for image-to-image translation//Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: IEEE: 1-18
Suvorov R, Logacheva E, Mashikhin A, Remizova A, Ashukha A, Silvestrov A, Kong N, Goka H, Park K and Lempitsky V. 2022. Resolution-robust large mask inpainting with Fourier convolutions//Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 3172-3182 [DOI: 10.1109/WACV51458.2022.00323http://dx.doi.org/10.1109/WACV51458.2022.00323]
Tan Z T, Chai M L, Chen D D, Liao J, Chu Q, Liu B, Hua G and Yu N H. 2021. Diverse semantic image synthesis via probability distribution modeling//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 7958-7967 [DOI: 10.1109/CVPR46437.2021.00787http://dx.doi.org/10.1109/CVPR46437.2021.00787]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Wang T C, Liu M Y, Zhu J Y, Tao A, Kautz J and Catanzaro B. 2018. High-resolution image synthesis and semantic manipulation with conditional GANs//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8798-8807 [DOI: 10.1109/CVPR.2018.00917http://dx.doi.org/10.1109/CVPR.2018.00917]
Yu F, Koltun V and Funkhouser T. 2017. Dilated residual networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 636-644 [DOI: 10.1109/CVPR.2017.75http://dx.doi.org/10.1109/CVPR.2017.75]
Zhang H Y, Wang X Y and Peng X W. 2023. Foreground segmentation-relevant multi-feature fusion person re-identification. Journal of Image and Graphics, 28(5): 1360-1371
张红颖, 王徐泳, 彭晓雯. 2023. 结合前景分割的多特征融合行人重识别. 中国图象图形学报, 28(5): 1360-1371 [DOI: 10.11834/jig.220683http://dx.doi.org/10.11834/jig.220683]
Zhang R, Isola P, Efros A A, Shechtman E and Wang O. 2018. The unreasonable effectiveness of deep features as a perceptual metric//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 586-595 [DOI: 10.1109/CVPR.2018.00068http://dx.doi.org/10.1109/CVPR.2018.00068]
Zhu J Y, Park T, Isola P and Efros A A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2242-2251 [DOI: 10.1109/ICCV.2017.244http://dx.doi.org/10.1109/ICCV.2017.244]
Zhu Z, Xu Z L, You A S and Bai X. 2020. Semantically multi-modal image synthesis//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 5466-5475 [DOI: 10.1109/CVPR42600.2020.00551http://dx.doi.org/10.1109/CVPR42600.2020.00551]
相关作者
相关机构