双光流网络指导的视频目标检测

尉婉青; 禹晶; 史薪琪; 肖创柏

doi:10.11834/jig.200413

NCIG 2020 | 浏览量 : 0 下载量: 0 CSCD: 1

PDF
导出
分享
收藏
专辑

双光流网络指导的视频目标检测
Dual optical flow network-guided video object detection
2021年26卷第10期页码：2473-2484
纸质出版日期： 2021-09-16 ，

录用日期： 2020-11-30
DOI： 10.11834/jig.200413
稿件说明：

移动端阅览

尉婉青, 禹晶, 史薪琪, 肖创柏. 双光流网络指导的视频目标检测[J]. 中国图象图形学报, 2021,26(10):2473-2484.

Wanqing Yu, Jing Yu, Xinqi Shi, Chuangbai Xiao. Dual optical flow network-guided video object detection[J]. Journal of Image and Graphics, 2021,26(10):2473-2484.
尉婉青, 禹晶, 史薪琪, 肖创柏. 双光流网络指导的视频目标检测[J]. 中国图象图形学报, 2021,26(10):2473-2484. DOI： 10.11834/jig.200413.

Wanqing Yu, Jing Yu, Xinqi Shi, Chuangbai Xiao. Dual optical flow network-guided video object detection[J]. Journal of Image and Graphics, 2021,26(10):2473-2484. DOI： 10.11834/jig.200413.

摘要

目的

卷积神经网络广泛应用于目标检测中，视频目标检测的任务是在序列图像中对运动目标进行分类和定位。现有的大部分视频目标检测方法在静态图像目标检测器的基础上，利用视频特有的时间相关性来解决运动目标遮挡、模糊等现象导致的漏检和误检问题。

方法

本文提出一种双光流网络指导的视频目标检测模型，在两阶段目标检测的框架下，对于不同间距的近邻帧，利用两种不同的光流网络估计光流场进行多帧图像特征融合，对于与当前帧间距较小的近邻帧，利用小位移运动估计的光流网络估计光流场，对于间距较大的近邻帧，利用大位移运动估计的光流网络估计光流场，并在光流的指导下融合多个近邻帧的特征来补偿当前帧的特征。

结果

实验结果表明，本文模型的mAP（mean average precision）为76.4%，相比于TCN（temporal convolutional networks）模型、TPN+LSTM（tubelet proposal network and long short term memory network）模型、D（&T loss）模型和FGFA（flow-guided feature aggregation）模型分别提高了28.9%、8.0%、0.6%和0.2%。

结论

本文模型利用视频特有的时间相关性，通过双光流网络能够准确地从近邻帧补偿当前帧的特征，提高了视频目标检测的准确率，较好地解决了视频目标检测中目标漏检和误检的问题。

Abstract

Objective

Object detection is a fundamental task in computer vision applications

and it provides support for subsequent object tracking

instance segmentation

and behavior recognition. The rapid development of deep learning has facilitated the wide use of convolutional neural network in object detection and shifted object detection from the traditional object detection method to the recent object detection method based on deep learning. Still image object detection has considerably progressed in recent years. It aims to determine the category and the position of each object in an image. The task of video object detection is to locate moving object in sequential images and assign the category label to each object. The accuracy of video object detection suffers from degenerated object appearances in videos

such as motion blur

multi-object occlusion

and rare poses. The methods of still image object detection have achieved excellent results. However

directly applying them to video object detection is challenging because still-image detectors may generate false negatives and positives caused by motion blur and object occlusion. Most existing video object detection methods incorporate temporal consistency across frames to improve upon single-frame detections.

Method

We propose a video object detection method guided by dual optical flow networks

which precisely propagate the features from adjacent frames to the feature of the current frame and enhance the feature of the current frame by fusing the features of the adjacent frames. Under the framework of two-stage object detection

the deep convolutional network model is used for the feature extraction to produce the feature in each frame of the video. According to the optical flow field

the features of the adjacent frames are used to compensate the feature of the current frame. According to the time interval between the adjacent frames and the current frame

two different optical flow networks are applied to estimate optical flow fields. Specifically

the optical flow network used for small displacement motion estimation is utilized to estimate the optical flow fields for closer adjacent frames. Moreover

the optical flow network used for large displacement motion estimation is utilized to estimate the optical flow fields for further adjacent frames. The compensated feature maps of multiple frames

as well as the feature map of the current frame

are aggregated according to adaptive weights. The adaptive weights indicate the importance of all compensated feature maps to the current frame. Here

the similarity between the compensated feature map and the feature map extracted from the current frame is measured using the cosine similarity metric. If the compensated feature map gets close to the feature map of the current frame

then the compensated feature map is assigned a larger weight; otherwise

it is assigned a smaller weight. An embedding network that consists of three convolutional layers is also applied on the compensated feature maps and the current feature map to produce the embedding feature maps. Then

we utilize the embedding feature maps to compute the adaptive weights.

Result

Experimental results show that the mean average precision (mAP) score of the proposed method on the ImageNet for video object detection (VID) dataset can achieve 76.42%

which is 28.92%

8.02%

0.62%

and 0.24% higher than those of the temporal convolutional network

the method combining tubelet proposal network(TPN) with long short memory network

the method of D(& T loss)

and flow-guided feature aggregation (FGFA)

respectively. We also report the mAP scores over the slow

medium

and fast objects. Our method combining the two optical flow networks improve the mAP scores of slow

medium

and fast objects by 0.2%

0.48% and 0.23%

respectively

compared with the method of FGFA. Furthermore

that dual optical flow networks can improve the estimation of optical flow field between the adjacent frames and the current frame. Then

the feature of the current frame can be compensated more precisely using adjacent frames.

Conclusion

Considering the special temporal correlation of video

the proposed model improves the accuracy of video object detection through the feature aggregation guided by dual optical flow networks under the framework of the two-stage object detection. The usage of dual optical flow networks can accurately compensate the feature of the current frame from the adjacent frames. Accordingly

we can fully utilize the feature of each adjacent frame and reduce false negatives and positives through temporal feature fusion in video object detection.

关键词

目标检测卷积神经网络(CNN)运动估计运动补偿光流网络特征融合

Keywords

object detectionconvolutional neural network(CNN)motion estimationmotion compensationoptical flow networkfeature fusion

references

Dai J F, Li Y, He K M and Sun J. 2016. R-FCN: object detection via region-based fully convolutional networks//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: NIPS: 379-387[DOI: 10.5555/3157096.3157139http://dx.doi.org/10.5555/3157096.3157139]

Dosovitskiy A, Fischer P, Ilg E, Häusser P, Hazirbas C, Golkov V, van der Smagt P, Cremers D and Brox T. 2015. FlowNet: learning optical flow with convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2758-2766[DOI: 10.1109/ICCV.2015.316http://dx.doi.org/10.1109/ICCV.2015.316]

Feichtenhofer C, Pinz A and Zisserman A. 2017. Detect to track and track to detect//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3057-3065[DOI: 10.1109/ICCV.2017.330http://dx.doi.org/10.1109/ICCV.2017.330]

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1440-1448[DOI: 10.1109/ICCV.2015.169http://dx.doi.org/10.1109/ICCV.2015.169]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[DOI: 10.1109/CVPR.2014.81http://dx.doi.org/10.1109/CVPR.2014.81]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A and Brox T. 2017. FlowNet 2.0: evolution of optical flow estimation with deep networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1647-1655[DOI: 10.1109/CVPR.2017.179http://dx.doi.org/10.1109/CVPR.2017.179]

Kang K, Li H S, Xiao T, Ouyang W L, Yan J J, Liu X H and Wang X G. 2017.Object detection in videos with tubelet proposal networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 889-897[DOI: 10.1109/CVPR.2017.101http://dx.doi.org/10.1109/CVPR.2017.101]

Kang K, Ouyang W L, Li H S and Wang X G. 2016. Object detection from video tubelets with convolutional neural networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 817-825[DOI: 10.1109/CVPR.2016.95http://dx.doi.org/10.1109/CVPR.2016.95]

Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: NIPS: 1097-1105[DOI: 10.5555/2999134.2999257http://dx.doi.org/10.5555/2999134.2999257]

Liu L, Ouyang W L, Wang X G, Fieguth P, Chen J, Liu X W and Pietikäinen M. 2020. Deep learning for generic object detection: a survey. International Journal of Computer Vision, 128(2): 261-318[DOI:10.1007/s11263-019-01247-4]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 21-37[DOI: 10.1007/978-3-319-46448-0_2http://dx.doi.org/10.1007/978-3-319-46448-0_2]

Luo H L and Chen H K. 2020. Survey of object detection based on deep learning. Acta Electronica Sinica, 48(6): 1230-1239

罗会兰, 陈鸿坤. 2020. 基于深度学习的目标检测研究综述. 电子学报, 48(6): 1230-1239[DOI:10.3969/j.issn.0372-2112.2020.06.026]

Pei W, Xu Y M, Zhu Y Y, Wang P Q, Lu M Y and Li F. 2019. The target detection method of aerial photography images with improved SSD. Journal of Software, 30(3): 738-758

裴伟, 许晏铭, 朱永英, 王鹏乾, 鲁明羽, 李飞. 2019. 改进的SSD航拍目标检测方法. 软件学报, 30(3): 738-758)[DOI:10.13328/j.cnki.jos.005695]

Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 779-788[DOI: 10.1109/CVPR.2016.91http://dx.doi.org/10.1109/CVPR.2016.91]

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI:10.1109/TPAMI.2016.2577031]

Uijlings J R R, van de Sande K E A, Gevers T and Smeulders A W M. 2013. Selective search for object recognition. International Journal of Computer Vision, 104(2): 154-171[DOI:10.1007/s11263-013-0620-5]

Xiao F Y and Lee Y J. 2018. Video object detection with an aligned spatial-temporal memory//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 494-510[DOI: 10.1007/978-3-030-01237-3_30http://dx.doi.org/10.1007/978-3-030-01237-3_30]

Zhang H, Wang K F and Wang F Y. 2017. Advances and perspectives on applications of deep learning in visual object detection. Acta Automatica Sinica, 43(8): 1289-1305

张慧, 王坤峰, 王飞跃. 2017. 深度学习在目标视觉检测中的应用进展与展望. 自动化学报, 43(8): 1289-1305)[DOI:10.16383/j.aas.2017.c160822]

Zhao Y Q, Rao Y, Dong S P and Zhang J Y. 2020. Survey on deep learning object detection. Journal of Image and Graphics, 25(4): 629-654

赵永强, 饶元, 董世鹏, 张君毅. 2020. 深度学习目标检测方法综述. 中国图象图形学报, 25(4): 629-654)[DOI:10.11834/jig.190307]

Zhu X Z, Wang Y J, Dai J F, Yuan L and Wei X C. 2017. Flow-guided feature aggregation for video object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 408-417[DOI: 10.1109/ICCV.2017.52http://dx.doi.org/10.1109/ICCV.2017.52]

文章被引用时，请邮件提醒。

提交

红外与可见光图像特征动态选择的目标检测网络