扩散模型生成视频数据集及其检测基准研究
Research on diffusion model generated video datasets and detection benchmarks
- 2024年 页码:1-13
网络出版日期: 2024-10-22
DOI: 10.11834/jig.240259
移动端阅览
浏览全部资源
扫码关注微信
网络出版日期: 2024-10-22 ,
移动端阅览
郑天鹏,陈雁翔,温心哲等.扩散模型生成视频数据集及其检测基准研究[J].中国图象图形学报,
Zheng Tianpeng,Chen Yanxiang,Wen Xinzhe,et al.Research on diffusion model generated video datasets and detection benchmarks[J].Journal of Image and Graphics,
目的
2
扩散模型在视频生成领域取得了非常显著的成功,目前用于视频生成的扩散模型简单易用,也更容易让此类视频被随意滥用。目前,视频取证相关的数据集更多聚焦在人脸伪造领域上,缺少通用场景的描述,让生成视频检测的研究具有局限性。随着视频扩散模型的发展,视频扩散模型可以生成通用场景视频,但目前生成视频数据集类型单一,数据量少,且部分数据集不包含真实视频,不适用于生成视频检测任务。为了解决这些问题,本文提出了包含文本到视频(text to video, T2V)和图片到视频(image to video, I2V)两种方法的多类型、大规模的生成视频数据集与检测基准。
方法
2
使用现有的文本到视频和图片到视频等扩散视频生成方法,生成类型多样,数量规模大的生成视频数据,结合从网络获取的真实视频得到最终数据集。T2V视频生成中,使用15种类别的提示文本生成场景丰富的T2V视频,I2V使用下载的高质量图片数据集生成高质量的I2V视频。为了评估数据集生成视频的质量,使用目前先进的生成视频评估方法对视频的生成质量进行评估,以及使用视频检测方法进行生成视频的检测工作。
结果
2
创建了包含T2V和I2V两类生成视频的通用场景生成视频数据集,扩散模型生成视频数据集(Diffusion generated video dataset,DGVD)并结合当前先进的生成视频评估方法EvalCrafter和AIGCBench提出了包含T2V和I2V的生成视频质量估计方法。生成视频检测基准使用了4种图片级检测方法CNNdet (CNN Detection)、DIRE(DIffusion Reconstruction Error )、WDFC(Wavelet Domain Forgery Clues)和 DIF(Deep Image Fingerprint)和6种视频级检测
方法
2
I3D(Inflated 3D)、X3D(Expand 3D)、C2D、Slow、SlowFast和MViT(Multiscale Vision Transformer),其中图片级检测方法无法对未知数据进行有效检测,泛化性较差,而视频级检测方法能够对同一骨干网络实现的方法生成的视频有较好的表现,具有一定泛化能力,但仍然无法在其他网络中实现较好的指标。
结论
2
本文创建了生成类别丰富,场景多样的大规模视频数据集,该数据集和基准完善了生成视频检测任务在此类场景下数据集和基准不足的问题,有助于促进生成视频检测领域的发展。论文相关代码:
https://github.com/ZenT4n/DVGD
https://github.com/ZenT4n/DVGD
Objective
2
Diffusion video model have showed remarkable success in video generation, such as Open AI sora, we can simply use a text or image to generate a video. However, this convenient video generation also raises concerns regarding the potential abuse of generated videos for deceptive purposes. While existing detection techniques primarily target face video, there exists a noticeable lack of datasets specialize for the detection of forged scene videos, which generated by diffusion model. Moreover, many existing datasets suffer from limitations such as single conditional modality and insufficient data volume. To address these challenges, we propose a multi-conditional video generation datasets and corresponding detection benchmark. Traditional generated videos detection methods often rely on a single conditional modality, such as text or image, which restricts their ability to effectively detect a wide range of generated videos. For instance, algorithms trained solely on videos generated by text-to-video (T2V) model may fail to identify video generated by image-to-video (I2V) model. By introducing multi-conditional generated video, we aim to provide a more comprehensive and robust dataset that encompasses T2V and I2V generated videos. The proposed dataset development process involves collecting diverse videos generated with multi-condition and real videos downloaded in Internet. The generation method employed by each generation produces substantial videos that can be utilized to train the detection model. These diverse condition and large number generated videos ensure that the dataset contain the broad feature of diffusion videos, thereby enhancing the effectiveness of generated video detection models trained on this dataset.
Method
2
Generated video dataset can provide training data for the detection work, allowing the detector to recognize whether the video is AI-generated or not. Our dataset uses the existing advanced diffusion video model to generate numerous videos, which include T2V model generated video and I2V model generated video. One of keys to generate high-quality video is prompt text. we use the ChatGPT to generate the prompt texts. To get more general prompt texts, we set 15 entity words, such as dog, cat and then combine them with a template as the input for ChatGPT. In this way, we gain 231 prompt texts and then generate T2V videos. Different from T2V model, the I2V model use the image as the input condition to make the content of image moving. Using the existing advanced T2V and I2V methods for video generation, combined with real videos obtained from the web to build the final dataset. The generation quality of the video is evaluated using the existing advanced methods of generative video evaluation. We combine the metrics of the EvalCrafter and AIGCBench to evaluate generated videos. For the detection module, we use the advanced video detection methods, which contains 4 image level video detectors and 6 video level video detectors, to evaluate the performance of existing detectors on our dataset with different experiments.
Result
2
We introduce a generalized scene generative video dataset, (Diffusion generated video dataset, DGVD) and a detection benchmark that constructed using multiple generated video detect methods. A generative video quality estimation method containing T2V and I2V is proposed by combining the current state-of-the-art generative video evaluation methods EvalCrafter and AIGCBench. Generated video detection experiments were conducted on 4 image-level detector CNNdet (CNN Detection), DIRE (DIffusion Reconstruction Error), WDFC (Wavelet Domain Forgery Clues), DIF (Deep Image Fingerprint) and 6 video-level detectors I3D (Inflated 3D), X3D (Expand 3D), C2D, Slow, SlowFast, MViT (Multiscale Vision Transformer). We set 2 experiments, within-classes detection and cross-classes detection. For the within-classes detection, we train the detectors on T2V train dataset, and evaluate the perform in T2V test dataset. For the cross-classes detection, we train the detectors on T2V as the same in within-classes detection, but evaluate the perform in I2V test dataset. The results of the experiment demonstrate that image-level detection methods are unable to effectively detect unknown data and exhibit poor generalization. Conversely, video-level detection methods demonstrate the capacity to perform better on video generated by methods implemented in the same backbone network. However, they still cannot achieve better generalizability in other classes. The result indicates that the existing video detector is unable to identify the majority of videos generated by the diffusion video model.
Conclusion
2
In our paper, we introduce a novel dataset, diffusion generated video dataset (DGVD), designed to encompass a diverse array of categories and generative scenarios to address the need for advancements in generative
video detection. By providing comprehensive dataset and benchmarks, we offer a more challenging environment for training and evaluating detection models. This dataset and benchmarks not only highlight the current gaps in generative video detection benchmark but also serve as assistance for further progress in the field. We hope significant strides towards enhancing the robustness and effectiveness of generative video detection systems, ultimately driving innovation and advancement in the field. The Code at
https://github.com/ZenT4n/DVGD
https://github.com/ZenT4n/DVGD
视频生成扩散模型生成视频检测提示文本生成视频质量评估
video generationdiffusion modelgenerated video detectionprompt text generationvideo quality evaluation
Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A, Jampani V and Rombach R. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets[EB/OL]. [2023-11-25]. https://arxiv.org/pdf/2311.15127.pdfhttps://arxiv.org/pdf/2311.15127.pdf
Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A, Jampani V and Rombach R. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets[EB/OL]. [2023-11-25]. https://arxiv.org/pdf/2311.15127.pdfhttps://arxiv.org/pdf/2311.15127.pdf
Brooks T, Peebles B, Holmes C, DePue W, Guo Y, Jing L, Schnurr D, Taylor J, Luhman T, Luhman E, Ng C, Wang R, and Ramesh A. 2024. Video generation models as world simulators[EB/OL]. [2024-2-15] https://openai.com/research/video-generation-models-as-world-simulatorshttps://openai.com/research/video-generation-models-as-world-simulators
Carreira J, and Zisserman A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA: IEEE: 6299–6308 [DOI: 10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502]
Chen H X, Xia M H, He Y Q, Zhang Y, Cun X D, Yang S S, Xing J B, Liu Y F, Chen Q F, Wang X T, Chao W and Shan Y. 2023. Videocrafter1: Open diffusion models for high-quality video generation[EB/OL]. [2023-10-30] https://arxiv.org/pdf/2310.19512.pdfhttps://arxiv.org/pdf/2310.19512.pdf
Deng Y F, Deng X, Duan Y P and Xu M. 2023. Diffusion-Generated Fake Face Detection by Exploring Wavelet Domain Forgery Clues//2023 International Conference on Wireless Communications and Signal Processing (WCSP), Hangzhou, China: IEEE: 1–6 [DOI: 10.1109/WCSP58612.2023.10404721http://dx.doi.org/10.1109/WCSP58612.2023.10404721]
Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M L, and Ferrer C C. 2020. The deepfake detection challenge (dfdc) dataset[EB/OL]. [2020-10-28] https://arxiv.org/pdf/2006.07397.pdfhttps://arxiv.org/pdf/2006.07397.pdf
Fan F, Luo C J, Gao W L, and Zhan, J F. 2024. AIGCBench: Comprehensive evaluation of image-to-video content generated by AI. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 3(4): 100152 [DOI: 10.1016/j.tbench.2024.100152http://dx.doi.org/10.1016/j.tbench.2024.100152]
Fan H, Xiong B, Mangalam K, Li Y H, Yan Z C, Malik J, and Feichtenhofer C. 2021. Multiscale vision transformers//Proceedings of the IEEE/CVF international conference on computer vision, Montreal, Canada: IEEE: 6824–6835 [DOI: 10.1109/ICCV48922.2021.00675http://dx.doi.org/10.1109/ICCV48922.2021.00675]
Feichtenhofer C. 2020. X3d: Expanding architectures for efficient video recognition//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, USA: IEEE: 203–213 [DOI: 10.1109/CVPR42600.2020.00028http://dx.doi.org/10.1109/CVPR42600.2020.00028]
Feichtenhofer C, Fan H Q, Malik J and He K M. 2019. SlowFast Networks for Video Recognition//2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE: 6201–6210 [DOI: 10.1109/ICCV.2019.00630http://dx.doi.org/10.1109/ICCV.2019.00630]
Guo Y W, Yang C Y, Rao A Y, Liang Z Y, Wang Y H, Qiao Y, Agrawala M, Lin D and Dai B. 2024. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning[EB/OL]. [2024-2-8]. https://arxiv.org/pdf/2307.04725.pdfhttps://arxiv.org/pdf/2307.04725.pdf
Ho J, Salimans T, Gritsenko, A, Chan W L, Norouzi M and Fleet D J. 2022. Video diffusion models//Advances in Neural Information Processing Systems, New Orleans, USA: Curran Associates Inc.: 8633–8646.
Li W, Huang T Q, Huang L Q, Zheng A K and Xu C. 2024. Large-scale datasets for facial tampering detection with inpainting techniques. Journal of Image and Graphics, 29(7): 1834-1848
李伟, 黄添强, 黄丽清, 郑翱鲲, 徐超. 2024. 面向人脸修复篡改检测的大规模数据集. 中国图象图形学报, 29(7): 1834-1848 [DOI: 10.11834/jig.230422http://dx.doi.org/10.11834/jig.230422]
Li Y Z, Qi H G, Yang X, Sun P and Lyu S. 2020. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics//IEEE Conference on Computer Vision and Patten Recognition (CVPR). Seattle, USA: IEEE: 3204-3213 [DOI: 10.1109/CVPR42600.2020.00327http://dx.doi.org/10.1109/CVPR42600.2020.00327]
Liu A, Su Y T, Wang L J, Li B, Qian Z X, Zhang W M, Zhou L N, Zhang X P, Zhang Y D, Huang J W and Yu N H. 2024. Review on the progress of the AIGC visual content generation and traceability. Journal of Image and Graphics, 29(6):1535-1554
刘安安, 苏育挺, 王岚君, 李斌, 钱振兴, 张卫明, 周琳娜, 张新鹏, 张勇东, 黄继武, 俞能海. 2024. AIGC视觉内容生成与溯源研究进展. 中国图象图形学报, 29(6): 1535-1554 [DOI: 10.11834/jig.240003http://dx.doi.org/10.11834/jig.240003]
Liu H T, Li C Y, Wu Q Y, and Lee Y J. 2023. Visual instruction tuning//Advances in neural information processing systems. New Orleans, USA: Curran Associates Inc.: 34892–34916
Liu Y F, Cun X D, Liu X B, Wang X T, Zhang Y, Chen H X, Liu Y, Zeng T Y, Chan R and Shan Y. 2024. EvalCrafter: Benchmarking and Evaluating Large Video Generation Models[EB/OL]. [2024-3-23] arxiv.org/pdf/2310.11440.pdf
Liu Y., Li L., Ren S., Gao R., Li S., Chen S., Sun X., and Hou L. 2023. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation//Advances in Neural Information Processing Systems, New Orleans, USA: Curran Associates Inc.: 62352–62387
Ma L, Zhang J J, Deng H P, Zhang N Y, Liao Y and Yu H Y. 2024. DeCoF: Generated Video Detection via Frame Consistency[EB/OL] [2024-2-6] https://arxiv.org/pdf/2402.02085.pdfhttps://arxiv.org/pdf/2402.02085.pdf
OpenAI. 2024. Gpt-4 technical report[EB/OL]. [2024-3-4] https://arxiv.org/pdf/2303.08774.pdfhttps://arxiv.org/pdf/2303.08774.pdf
Peebles W, and Xie S N. 2023. Scalable diffusion models with transformers//Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France: IEEE: 4195–4205. [DOI: 10.1109/ICCV51070.2023.00387http://dx.doi.org/10.1109/ICCV51070.2023.00387]
Rombach R, Blattmann A, Lorenz D, Esser P and Ommer B. 2022. High-Resolution Image Synthesis with Latent Diffusion Models//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA: IEEE:10674–10685. [DOI: 10.1109/CVPR52688.2022.01042http://dx.doi.org/10.1109/CVPR52688.2022.01042]
Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J and Nießner M. 2019. FaceForensics++: Learning to Detect Manipulated Facial Images//Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE:1-11 [DOI: 10.1109/ICCV.2019.00009http://dx.doi.org/10.1109/ICCV.2019.00009]
Sinitsa, S and Fried O. 2024. Deep Image Fingerprint: Towards Low Budget Synthetic Image Detection and Model Lineage Analysis//2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA: IEEE: 4055–4064. [DOI: 10.1109/WACV57701.2024.00402http://dx.doi.org/10.1109/WACV57701.2024.00402]
Skorokhodov I, Sotnikov G and Elhoseiny M. 2021. Aligning Latent and Image Spaces to Connect the Unconnectable//2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada: IEEE: 14124–14133. [DOI:10.1109/ICCV48922.2021.01388http://dx.doi.org/10.1109/ICCV48922.2021.01388]
Tang Z, Yang Z Y, Zhu C G, Zeng M and Bansal M. 2023. Any-to-Any Generation via Composable Diffusion//Thirty-seventh Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 16083–16099
Wang J N, Yuan H J, Chen D Y, Zhang Y Y, Wang X and Zhang S W. 2023. Modelscope text-to-video technical report[EB/OL]. [2023-8-12] arxiv.org/pdf/2308.06571.pdf
Wang S Y, Wang O, Zhang R, Owens A and Efros A A. 2020. CNN-Generated Images Are Surprisingly Easy to Spot… for Now//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA: IEEE: 8692–8701. [DOI:10.1109/CVPR42600.2020.00872http://dx.doi.org/10.1109/CVPR42600.2020.00872]
Wang X L, Girshick R, Gupta A and He K M. 2018. Non-local Neural Networks//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA: IEEE: 7794–7803. [DOI: 10.1109/CVPR.2018.00813http://dx.doi.org/10.1109/CVPR.2018.00813]
Wang Z D, Bao J M, Zhou W G, Wang W L, Hu H Z, Chen H and Li H Q. 2023. DIRE for Diffusion-Generated Image Detection//2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France: IEEE: 22388–22398. [DOI:10.1109/ICCV51070.2023.02051http://dx.doi.org/10.1109/ICCV51070.2023.02051]
Zhang S W, Wang J Y, Zhang Y Y, Zhao K, Yuan H J, Qing Z W, Wang X, Zhao D L and Zhou J R. 2023. I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models[EB/OL]. [2023-11-7] https://arxiv.org/pdf/2311.04145.pdfhttps://arxiv.org/pdf/2311.04145.pdf
相关作者
相关机构