HK-DETR:改进RT-DETR的持刀危险行为检测算法
HK-DETR: An Improved Knife-Holding Dangerous Behavior Detection Algorithm Based on RT-DETR
- 2024年 页码:1-15
网络出版日期: 2024-09-18
DOI: 10.11834/jig.240295
移动端阅览
浏览全部资源
扫码关注微信
网络出版日期: 2024-09-18 ,
移动端阅览
金涛,胡配雨.HK-DETR:改进RT-DETR的持刀危险行为检测算法[J].中国图象图形学报,
Jin Tao,Hu Peiyu.HK-DETR: An Improved Knife-Holding Dangerous Behavior Detection Algorithm Based on RT-DETR[J].Journal of Image and Graphics,
目的
2
在对公安系统网络摄像头获取的视频数据进行分析时,行人危险持刀行为的自动检测面临刀具形状、大小的多样性,以及遮挡和多目标重叠等因素导致的检测精度低、误检率高的问题。针对上述问题,本文提出了一种改进实时物体检测Transformer(real-time detection Transformer,RT-DETR)的持刀危险行为检测算法(human-knife detection Transformer,HK-DETR)。
方法
2
首先,设计了倒置残差级联模块(inverted residual cascade block,IRCB)作为主干网络中的基本块(BasicBlock),这使得网络更加轻量化,减少了计算冗余,并提高了对全局特征和长距离依赖关系的理解能力;其次,提出了跨阶并行空洞融合网络结构(cross stage partial-parallel multi-atrous convolution,CSP-PMAC),专注于多尺度特征的提取,使模型能有效识别不同大小和角度的刀具;最后,引入了Haar小波下采样(Haar wavelet-based downsampling,HWD)模块来替换原模型中的下采样操作,为多尺度特征融合提供了更丰富的信息。同时,采用了最小点距离交并比(minimum point distance based intersection over union,MPDIoU)损失函数来进一步提升检测性能。
结果
2
对比实验结果表明,与原RT-DETR算法相比,改进后的模型网络参数量下降了25%,精度、召回率、平均精度(mean average precision,mAP)分别提高了2.3%、5.5%、5.2%;与YOLOv5m、YOLOv8m和Gold-YOLO-s相比,在模型网络参数量较低的情况下mAP又分别提高了6.3%、5.2%、1.8%。
结论
2
本文提出的HK-DETR算法能够有效适应网络摄像头下多种复杂环境的持刀危险行为检测场景,相较于其他参与对比的模型,其性能优势得到了充分的展现。
Objective
2
In contemporary society, public safety concerns have garnered increasing attention, particularly in crowded venues such as subway stations, railway terminals, and commercial centers, where timely and accurate detection and response to potential threatening behaviors are paramount for maintaining societal stability. The extensive deployment of network cameras by public security systems serves as a vital surveillance tool, capable of capturing and recording vast amounts of video data in real-time, providing a rich source of information for security analysis. Nevertheless, a pivotal challenge arises when delving into the depths of these video data: the automated detection of pedestrians engaging in dangerous knife-wielding behaviors. The complexity of this task stems primarily from the diversity in knife shapes and sizes, ranging from conventional elongated knives to folding knives and daggers, each exhibiting distinct visual representations in images, posing significant challenges for detection algorithms. Furthermore, occlusions, a common occurrence in real-world surveillance scenarios, including body occlusions between pedestrians, obstructions by trees or buildings, can lead to incomplete target feature information, thereby compromising detection performance. Additionally, multi-object occlusion, prevalent in densely populated areas, where multiple pedestrians or objects overlap in images, exacerbates the difficulty in accurately distinguishing and localizing knife-wielding individuals. To address these issues and enhance the precision and efficiency of detecting dangerous knife-wielding behaviors, this paper proposes an algorithm named human-knife detection Transformer(HK-DETR), which is an improvement upon the real-time detection Transformer(RT-DETR). Building upon the inherent strengths of RT-DETR, HK-DETR incorporates numerous optimizations and innovations tailored specifically to the characteristics of knife-detection tasks.
Method
2
First, we have meticulously designed the inverted residual cascade block(IRCB) as a fundamental building block(BasicBlock) within the backbone network. This innovative design not only achieves a lightweight network architecture, effectively alleviating computational resource scarcity, but also significantly reduces redundant computations. By optimizing the processing flow of feature maps, the IRCB module substantially enhances the backbone network's ability to capture and distinguish diverse features, thereby laying a solid foundation for subsequent complex knife detection tasks. Subsequently, we propose the cross stage partial-parallel multi-atrous convolution(CSP-PMAC) module, a revolutionary feature fusion strategy. This module directs the network to focus more intently on capturing and integrating multi-scale feature information during the fusion stage, which is pivotal for identifying knives of varying shapes and angles. This design equips the model with exceptional adaptability, enabling it to accurately identify both small knives and large knives, thus significantly improving the model’s performance in complex scenarios. In further optimizing the model, we have selected the novel Haar wavelet-based downsampling(HWD) module as a downsampling method to replace the traditional downsampling mechanism within the network. By leveraging its unique hierarchical wavelet decomposition technique, the HWD module effectively diminishes data dimensionality while retaining richer details of object scale variations. This enriches and refines feature representations in subsequent multi-scale feature fusion, enhancing the model’s robustness in handling scale variations. Finally, to comprehensively enhance detection accuracy, we have adopted the minimum point distance based intersection over union (MPDIoU) loss function. This improved loss function optimizes object localization accuracy by more precisely measuring the overlap between predicted bounding boxes and actual target boxes. It not only considers classification accuracy but also intensifies the pursuit of localization precision, enabling the model to maintain superior detection performance even in the presence of dense or overlapping targets.
Result
2
Ablation experiments were conducted on the pedestrian knife-carrying dataset, which revealed that each improvement strategy, when applied individually, contributed to a certain degree of performance enhancement for the original RT-DETR model, despite the persistence of challenges such as missed detections and confidence issues in some cases. However, when these improvement strategies were combined, a significant boost in detection performance was achieved. To validate the effectiveness of the proposed model, comparative experiments were performed on the pedestrian knife-carrying dataset. The results demonstrated that compared to the original RT-DETR algorithm, the refined model exhibited a 25% reduction in network parameters while achieving improvements of 2.3%, 5.5%, and 5.2% in accuracy, recall, and mean average precision(mAP), respectively. When benchmarked against YOLOv5m, YOLOv8m, and Gold-YOLO-s, the refined model, with a lower number of network parameters, demonstrated notable mAP enhancements of 6.3%, 5.2%, and 1.8%, respectively.
Conclusion
2
The proposed HK-DETR algorithm in this paper exhibits remarkable performance advantages in the task of automatically detecting dangerous knife-carrying behaviors of pedestrians in video data captured by public security system network cameras. This algorithm effectively addresses the challenges posed by the diversity of knife shapes and sizes, occlusion, and multi-target overlapping in complex scenarios, while significantly enhancing detection accuracy, recall rate, and mAP through a series of innovative designs. Compared to the original RT-DETR algorithm and other mainstream detection models such as YOLOv5m, YOLOv8m, and Gold-YOLO-s, HK-DETR achieves notable performance improvements. This result underscores the algorithm's ability to maintain high efficiency and accuracy in diverse and complex environments, offering robust technical support for the field of public security surveillance. Within the realm of public safety, HK-DETR holds immense potential for widespread adoption in surveillance systems of public places like railway stations, airports, subway stations, and shopping malls, enabling real-time detection and early warning of potential knife-related dangers, thereby providing timely and effective information support for law enforcement agencies. Moreover, as technology continues to evolve and mature, the HK-DETR algorithm is poised to expand its reach into other domains, such as intelligent transportation and industrial automation, offering potent solutions to an array of practical problems.
持刀行为检测RT-DETR目标检测多尺度特征融合Transformer危险行为检测
knife-holding behavior detectionRT-DETR(real-time detection Transformer)object detectionmulti-scale feature fusionTransformerdangerous behavior detection
Akcay S and Breckon T. 2022. Towards automatic threat detection: a survey of advances of deep learning within X-ray security imaging. Pattern Recognition, 122: 108245 [DOI: 10.1016/j.patcog.2021.108245http://dx.doi.org/10.1016/j.patcog.2021.108245]
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A and Zagoruyko S. 2020. End-to-end object detection with Transformers//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 213-229 [DOI: 10.1007/978-3-030- 58452-8_13http://dx.doi.org/10.1007/978-3-030-58452-8_13]
Chen Q, Chen X K, Wang J, Feng H C, Han J Y, Ding E R, Zeng G and Wang J D. 2023. Group DETR: fast detr training with group-wise one-to-many assignment//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 6610-6619 [DOI: 10.1109/ICCV51070.2023.00610http://dx.doi.org/10.1109/ICCV51070.2023.00610]
Han K, Wang Y H, Chen H T, Chen X H, Guo J Y, Liu Z H, Tang Y H, Xiao A, Xu C J, Xu Y X, Yang Z H, Zhang Y M and Tao D C. 2023. A survey on visual Transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1): 87-110 [DOI: 10.1109/TPAMI.2022.3152247http://dx.doi.org/10.1109/TPAMI.2022.3152247]
Liao Y, Li Z M and Liu S. 2022. A review of deep learning based human-object interaction detection. Journal of Image and Graphics, 27(09): 2611-2628
廖越, 李智敏, 刘偲. 2022. 基于深度学习的人—物交互关系检测综述. 中国图象图形学报, 27(09): 2611-2628 [DOI: 10.11834/jig.211268http://dx.doi.org/10.11834/jig.211268]
Liang J F, Li T, Yang J Q, Li Y N, Fang Z W and Yang F. 2023. Video anomaly detection by fusing self-attention and autoencoder. Journal of Image and Graphics, 28(04): 1029-1040
梁家菲, 李婷, 杨佳琪, 李亚楠, 方智文, 杨丰. 2023. 融合自注意力和自编码器的视频异常检测. 中国图象图形学报, 28(04): 1029-1040 [DOI: 10.11834/jig.211147http://dx.doi.org/10.11834/jig.211147]
Long Z R, Cai L F, Ye B Q, Tang B, Zhao M F, Tang Y L, Wang J X, and Zhou M. 2023. Research on improved lightweight object detection algorithm based on YOLOv5s. Journal of Chongqing University of Technology (Natural Science Edition), 37(12): 244-251.
龙邹荣, 蔡林峰, 叶彬强, 汤 斌, 赵明富, 唐跃林, 王建旭, 周 密. 2023. 改进YOLOv5s的轻量化目标检测算法研究. 重庆理工大学学报(自然科学), 37(12): 244-251 [DOI: 10.3969/j.issn.1674-825(z).2023.12.028http://dx.doi.org/10.3969/j.issn.1674-825(z).2023.12.028]
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot MultiBox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Nether-lands: Springer: 21-37 [DOI: 10.1007/978-3-319-46448-02http://dx.doi.org/10.1007/978-3-319-46448-02]
Liu S, Li F, Zhang H, Yang X, Qi X B, Su H, Zhu J and Zhang L. 2022. DAB-DETR: dynamic anchor boxes are better queries for DETR [EB/OL]. [2024-3-10]. https://arxiv.org/pdf/2201.12329.pdfhttps://arxiv.org/pdf/2201.12329.pdf
Liu X Y, Peng H W, Zheng N X, Yang Y Q, Hu H and Yuan Y X. 2023. EfficientViT: memory efficient vision Transformer with cascaded group attention//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 14420-14430 [DOI: 10.1109/CVPR52729.2023.01386http://dx.doi.org/10.1109/CVPR52729.2023.01386]
Li F, Zhang H, Liu S L, Guo J, Ni L M and Zhang L. 2024. DN-DETR: accelerate DETR training by introducing query denoising. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6): 2239-2251 [DOI: 10.1109/TPAMI.2023.3335410http://dx.doi.org/10.1109/TPAMI.2023.3335410]
Ma S L and Xu Y. 2023. MPDIoU: a loss for efficient and accurate bounding box regression [EB/OL]. [2024-3-10]. https://arxiv.org/pdf/2307.07662.pdfhttps://arxiv.org/pdf/2307.07662.pdf
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI: 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031]
Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 779-788 [DOI: 10.1109/CVPR.2016.91http://dx.doi.org/10.1109/CVPR.2016.91]
Redmon J and Farhadi A. 2018. Yolov3: an incremental improvement [EB/OL]. [2024-3-12]. https://arxiv.org/pdf/1804.02767.pdfhttps://arxiv.org/pdf/1804.02767.pdf
Tong Z J, Chen Y H, Xu Z W and Yu R. 2023. Wise-IoU: bounding box regression loss with dynamic focusing mechanism [EB/OL]. [2024-3-11]. https://arxiv.org/pdf/2301.10051.pdfhttps://arxiv.org/pdf/2301.10051.pdf
Wang Y M, Zhang X Y, Yang T and Sun J. 2022. Anchor DETR: query design for Transformer-based detector//Proceedings of the 36th AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI, 2021: 2567-2575 [DOI: 10.1609/aaai.v36i3.20158http://dx.doi.org/10.1609/aaai.v36i3.20158]
Wang C Y, Liao H Y M, Wu Y H, Chen P Y, Hsieh J W and Yeh I H. 2020. CSPNet: a new backbone that can enhance learning capability of CNN//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle, USA: IEEE: 1571-1580 [DOI: 10.1109/CVPRW50498.2020.00203http://dx.doi.org/10.1109/CVPRW50498.2020.00203]
Wang C C, He W, Nie Y, Guo J Y, Liu C J, Han K and Wang Y H. 2023. Gold-YOLO: efficient object detector via gather-and-distribute mechanism//The 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc.: 51094-51112
Xu G P, Liao W T, Zhang X, Li C, He X W and Wu X L. 2023. Haar wavelet downsampling: a simple but effective downsampling module for semantic segmentation. Pattern Recognition, 143: 109819 [DOI: 10.1016/j.patcog.2023.109819http://dx.doi.org/10.1016/j.patcog.2023.109819]
Yu F and Koltun V. 2015. Multi-scale context aggregation by dilated convolutions [EB/OL]. [2024-3-25]. https://arxiv.org/pdf/1511.07122.pdfhttps://arxiv.org/pdf/1511.07122.pdf
Yang Z M, Wang X L and Li J G. 2021. EIoU: an improved vehicle detection algorithm based on vehiclenet neural network. Journal of Physics: Conference Series, 1924(1): 012001 [DOI: 10.1088/1742-6596/1924/1/012001http://dx.doi.org/10.1088/1742-6596/1924/1/012001]
Zhu X Z, Su W J, Lu L W, Li B, Wang X G and Dai J F. 2020. Deformable DETR: deformable Transformers for end-to-end object detection [EB/OL]. [2024-2-22]. https://arxiv.org/pdf/2010.04159.pdfhttps://arxiv.org/pdf/2010.04159.pdf
Zhao Y, Lv W Y, Xu S L, Wang G Z, Wei J M, Dang Q Q, Liu Y and Chen J. 2023. DETRs beat YOLOs on real-time object detection [EB/OL]. [2024-3-20]. https://arxiv.org/pdf/2304.08069.pdfhttps://arxiv.org/pdf/2304.08069.pdf
Zhang J N, Li X T, Li J, Liu L, Xue Z C, Zhang B S, Jiang Z K, Huang T X, Wang Y B and Wang C J. 2023. Rethinking mobile block for efficient attention-based models//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 1389-1400 [DOI: 10.1109/ICCV51070.2023.00134http://dx.doi.org/10.1109/ICCV51070.2023.00134]
Zheng Z H, Wang P, Ren D W, Liu W, Ye R G, Hu Q H and Zuo W M. 2022. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Transactions on Cybernetics, 52(8): 8574-8586 [DOI: 10.1109/tcyb.2021.3095305http://dx.doi.org/10.1109/tcyb.2021.3095305]
Zhang H, Xu C and Zhang S J. 2023. Inner-IoU: more effective intersection over union loss with auxiliary bounding box [EB/OL]. [2024-3-11]. https://arxiv.org/pdf/2311.02877.pdfhttps://arxiv.org/pdf/2311.02877.pdf
相关作者
相关机构