Visual words and self-attention mechanism fusion based video object segmentation method
- Vol. 27, Issue 8, Pages: 2444-2457(2022)
Published: 16 August 2022 ,
Accepted: 15 June 2021
DOI: 10.11834/jig.210155
移动端阅览
浏览全部资源
扫码关注微信
Published: 16 August 2022 ,
Accepted: 15 June 2021
移动端阅览
Chuanjun Ji, Yadang Chen, Xun Che. Visual words and self-attention mechanism fusion based video object segmentation method. [J]. Journal of Image and Graphics 27(8):2444-2457(2022)
目的
2
视频目标分割(video object segmentation
VOS)是在给定初始帧的目标掩码条件下,实现对整个视频序列中感兴趣对象的分割,但是视频中往往会出现目标形状不规则、背景中存在干扰信息和运动速度过快等情况,影响视频目标分割质量。对此,本文提出一种融合视觉词和自注意力机制的视频目标分割算法。
方法
2
对于参考帧,首先将其图像输入编码器中,提取分辨率为原图像1/8的像素特征。然后将该特征输入由若干卷积核构成的嵌入空间中,并将其结果上采样至原始尺寸。最后结合参考帧的目标掩码信息,通过聚类算法对嵌入空间中的像素进行聚类分簇,形成用于表示目标对象的视觉词。对于目标帧,首先将其图像通过编码器并输入嵌入空间中,通过单词匹配操作用参考帧生成的视觉词来表示嵌入空间中的像素,并获得多个相似图。然后,对相似图应用自注意力机制捕获全局依赖关系,最后取通道方向上的最大值作为预测结果。为了解决目标对象的外观变化和视觉词失配的问题,提出在线更新机制和全局校正机制以进一步提高准确率。
结果
2
实验结果表明,本文方法在视频目标分割数据集DAVIS(densely annotated video segmentation)2016和DAVIS 2017上取得了有竞争力的结果,区域相似度与轮廓精度之间的平均值J&F-mean(Jaccard and F-score mean)分别为83.2%和72.3%。
结论
2
本文提出的算法可以有效地处理由遮挡、变形和视点变化等带来的干扰问题,实现高质量的视频目标分割。
Objective
2
Video object segmentation (VOS) involves foreground objects segmentation from the background in a video sequence. Its applications are relevant to video detection
video classification
video summarization
and self-driving. Our research is focused on a semi-supervised setting
which estimates the mask of the target object in the remaining frames of the video based on the target mask given in the initial frame. However
current video object segmentation algorithms are constrained of the issue of irregular shape
interference information and super-fast motion. Hence
our research develops a video object segmentation algorithm based on the integration of visual words and self-attention mechanism.
Method
2
For the reference frame
the reference frame image is first fed into the encoder to extract features of those resolutions are 1/8 of the original image. Subsequently
the extracted features are fed into the embedding space composed of several 3 × 3 convolution kernels
and the result is up-sampled to the original size. During the training process
the pixels from the same target in the embedding space are close to each other
while the pixels from different targets are far apart. Finally
the visual words representing the target objects are formed by combining the mask information annotated in the reference frames and clustering the pixels in the embedding space using a clustering algorithm. For the target frame
its image is first fed into the encoder and passed through the embedding space
and then a word matching operation is performed to represent the pixels in the embedding space with a certain number of visual words to obtain similarity maps. However
learning visual words is a challenging task because there is no real information about their corresponding object parts. Therefore
a meta-training algorithm is used to alternate between unsupervised learning of visual words and supervised learning of pixel classification given these visual words. The application of visual vocabulary allows for more robust matching because an object may be obscured
deformed
changed perspective
or disappear and reappear from the same video
and its partial appearance may remain the same. Then
the self-attention mechanism is applied to the generated similarity map to capture the global dependency
and the maximum value is taken in the channel direction as the predicted result. To resolve significant appearance changes and global mismatch issues
an efficient online update and global correction mechanism is adopted to improve the accuracy further. For the online update mechanism
the updated timing has an impact on the performance of the model. When the update interval is shorter
the dictionary is updated more frequently
which aids the network to adapt dynamic scenes and fast-moving objects better. However
if the interval is too short
it is possible to cause more noisy visual words
which will affect the performance of the algorithm. Therefore
it is important to use an appropriate update frequency. Here
the visual dictionary is set to be updated every 5 frames. Furthermore
to ensure that the prediction masks used to update visual words in the online update mechanism are reliable
a simple outlier removal process is applied to the prediction masks. Specifically
given a region with the same prediction annotation
the prediction region is accepted only if it intersects the object mask predicted in the previous frame. If there is no intersection
this prediction mask is discarded and the prediction is made directly on it based on the previous result.
Result
2
We validate the effectiveness and robustness of our method on the challenging DAVIS 2016(densely annotated video segmentation) and DAVIS 2017 datasets. Our method is compared to state-of-the-art methods
with J&F-mean(Jaccard and F-score mean) score of 83.2% on DAVIS 2016
with J&F-mean score of 72.3% on DAVIS 2017. We achieved comparable accuracy to the fine-tuning-based method
and reached a competitive level in terms of the speed/accuracy trade-off of the two video object segmentation datasets.
Conclusion
2
The proposed algorithm can effectively deal with the interference problems caused by occlusion
deformation and viewpoint change
and achieve high quality video object segmentation.
视频目标分割(VOS)聚类算法视觉词自注意力机制在线更新机制全局校正机制
video object segmentation (VOS)clustering algorithmvisual wordsself-attention mechanismonline update mechanismglobal correction mechanism
Bertinetto L, Henriques J F, Torr P H S and Vedaldi A. 2019. Meta-learning with differentiable closed-form solvers[EB/OL]. [2021-03-03].https://arxiv.org/pdf/1805.08136.pdfhttps://arxiv.org/pdf/1805.08136.pdf
Bhat G, Danelljan M, van Gool L and Timofte R. 2019. Learning discriminative model prediction for tracking//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 6181-6190[DOI: 10.1109/ICCV.2019.00628http://dx.doi.org/10.1109/ICCV.2019.00628]
Caelles S, Maninis K K, Pont-Tuset J, Leal-Taixé L, Cremers D and van Gool L. 2017. One-shot video object segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5320-5329[DOI: 10.1109/CVPR.2017.565http://dx.doi.org/10.1109/CVPR.2017.565]
Chen Y H, Pont-Tuset J, Montes A and van Gool L. 2018. Blazingly fast video object segmentation with pixel-wise metric learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1189-1198[DOI: 10.1109/CVPR.2018.00130http://dx.doi.org/10.1109/CVPR.2018.00130]
Choi J, Kwon J and Lee K M. 2019. Deep meta learning for real-time target-aware visual tracking//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 911-920[DOI: 10.1109/ICCV.2019.00100http://dx.doi.org/10.1109/ICCV.2019.00100]
Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255[DOI: 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848]
Dosovitskiy A, Fischer P, Ilg E, Häusser P,Hazirbas C, Golkov V, van der Smagt P, Cremers D and Brox T. 2015. FlowNet: learning optical flow with convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2758-2766[DOI: 10.1109/ICCV.2015.316http://dx.doi.org/10.1109/ICCV.2015.316]
Finn C, Abbeel P and Levine S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks[EB/OL]. [2021-03-03].https://arxiv.org/pdf/1703.03400.pdfhttps://arxiv.org/pdf/1703.03400.pdf
Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3141-3149[DOI: 10.1109/CVPR.2019.00326http://dx.doi.org/10.1109/CVPR.2019.00326]
Hao T L and Li X Y. 2021. Video object detection method for improving the stability of bounding box. Journal of Image and Graphics, 26(1): 113-122
郝腾龙, 李熙莹. 2021. 提升预测框定位稳定性的视频目标检测. 中国图象图形学报, 26(1): 113-122 [DOI: 10.11834/jig.200417]
Hao X Y, Xiong J F, Xue X D, Shi J, Wen K, Han W T, Li X Y, Zhao J and Fu X L. 2020. 3D U-Net with dual attention mechanism for lung tumor segmentation. Journal of Image and Graphics, 25(10): 2119-2127
郝晓宇, 熊俊峰, 薛旭东, 石军, 文可, 韩文廷, 李骁扬, 赵俊, 傅小龙. 2020. 融合双注意力机制3D U-Net的肺肿瘤分割. 中国图象图形学报, 25(10): 2119-2127 [DOI: 10.11834/jig.200282]
He K M, Zhang, X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Hu T Y, Huang J B and Schwing A G. 2018. VideoMatch: matching based video object segmentation//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 56-73[DOI: 10.1007/978-3-030-01237-3_4http://dx.doi.org/10.1007/978-3-030-01237-3_4]
Huang Z L, Wang X G, Huang L C, Huang C, Wei Y C and Liu W Y. 2019. CCNet: criss-cross attention for semantic segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 603-612[DOI: 10.1109/ICCV.2019.00069http://dx.doi.org/10.1109/ICCV.2019.00069]
Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A and Brox T. 2017. FlowNet 2.0: evolution of optical flow estimation with deep network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1647-1655[DOI: 10.1109/CVPR.2017.179http://dx.doi.org/10.1109/CVPR.2017.179]
Khoreva A, Benenson R, Ilg E, Brox T and Schiele B. 2019. Lucid data dreaming for video object segmentation. International Journal of Computer Vision, 127 (2): 1175-1197[DOI: 10.1007/s11263-019-01164-6]
Lee K, Maji S, Ravichandran A and Soatto S. 2019. Meta-learning with differentiable convex optimization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 10649-10657[DOI: 10.1109/CVPR.2019.01091http://dx.doi.org/10.1109/CVPR.2019.01091]
Li X X and Change Loy C. 2018. Video object segmentation with joint re-identification and attention-aware mask propagation//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 93-110[DOI: 10.1007/978-3-030-01219-9_6http://dx.doi.org/10.1007/978-3-030-01219-9_6]
Li Y, Shen Z R and Shan Y. 2020. Fast video object segmentation using the global context module//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, UK: Springer: 735-750[DOI: 10.1007/978-3-030-58607-2_43http://dx.doi.org/10.1007/978-3-030-58607-2_43]
Li Z W, Chen Q F and Koltun V. 2018. Interactive image segmentation with latent diversity//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 577-585[DOI: 10.1109/CVPR.2018.00067http://dx.doi.org/10.1109/CVPR.2018.00067]
Liang Y Q, Li X, Jafari N and Chen Q. 2020. Video object segmentation with adaptive feature bank and uncertain-region refinement[EB/OL]. [2021-03-03].https://arxiv.org/pdf/2010.07958.pdfhttps://arxiv.org/pdf/2010.07958.pdf
Liu Y, Liu L Q, Zhang H K, Rezatofighi H and Reid I. 2020. Meta learning with differentiable closed-form solver for fast video object segmentation//Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Las Vegas, USA: IEEE: 8439-8446[DOI: 10.1109/IROS45743.2020.9341282http://dx.doi.org/10.1109/IROS45743.2020.9341282]
Lu X K, Wang W G, Danelljan M, Zhou T F, Shen J B and van Gool L. 2020. Video object segmentation with episodic graph memory networks//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, UK: Springer: 661-679[DOI: 10.1007/978-3-030-58580-8_39http://dx.doi.org/10.1007/978-3-030-58580-8_39]
Luiten J, Voigtlaender P and Leibe B. 2018. PReMVOS: proposal-generation, refinement and merging for video object segmentation//Proceedings of the 14th Asian Conference on Computer Vision (ACCV). Perth, Australia: Springer: 565-580[DOI: 10.1007/978-3-030-20870-7_35http://dx.doi.org/10.1007/978-3-030-20870-7_35]
Maninis K K, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixé L, Cremers D and van Gool L. 2019. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(6): 1515-1530[DOI: 10.1109/TPAMI.2018.2838670]
Meinhardt T and Leal-Taixé L. 2020. Make one-shot video object segmentation efficient again[EB/OL]. [2021-03-03].https://arxiv.org/pdf/2012.01866.pdfhttps://arxiv.org/pdf/2012.01866.pdf
Oh S W, Lee J Y, Sunkavalli K and Kim S J. 2018. Fast video object segmentation by reference-guided mask propagation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7376-7385[DOI: 10.1109/CVPR.2018.00770http://dx.doi.org/10.1109/CVPR.2018.00770]
Oh S W, Lee J Y, Xu N and Kim S J. 2019. Video object segmentation using space-time memory networks//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9225-9234[DOI: 10.1109/ICCV.2019.00932http://dx.doi.org/10.1109/ICCV.2019.00932]
Park E and Berg A C. 2018. Meta-tracker: fast and robust online adaptation for visual object trackers//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 587-604[DOI: 10.1007/978-3-030-01219-9_35http://dx.doi.org/10.1007/978-3-030-01219-9_35]
Perazzi F, Khoreva A, Benenson R, Schiele B and Sorkine-Hornung A. 2017. Learning video object segmentation from static images//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3491-3500[DOI: 10.1109/CVPR.2017.372http://dx.doi.org/10.1109/CVPR.2017.372]
Seong H, Hyun J and Kim E. 2020. Kernelized memory network for video object segmentation//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, UK: Springer: 629-645[DOI: 10.1007/978-3-030-58542-6_38http://dx.doi.org/10.1007/978-3-030-58542-6_38]
Shen Z Q, Liu Z, Li J G, Jiang Y G, Chen Y R and Xue X Y. 2020. Object detection from scratch with deep supervision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 398-412[DOI: 10.1109/TPAMI.2019.2922181]
Tang R F, Song H H, Zhang K H and Jiang S H. 2019. Video object segmentation via feature attention pyramid modulating network. Journal of Image and Graphics, 24(8): 1349-1357
汤润发, 宋慧慧, 张开华, 姜斯浩. 2019. 特征注意金字塔调制网络的视频目标分割. 中国图象图形学报, 24(8): 1349-1357 [DOI: 10.11834/jig.180661]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need[EB/OL]. [2021-03-03].https://arxiv.org/pdf/1706.03762.pdfhttps://arxiv.org/pdf/1706.03762.pdf
Voigtlaender P, Chai Y N, Schroff F, Adam H, Leibe B and Chen L C. 2019. FEELVOS: fast end-to-end embedding learning for video object segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9473-9482[DOI: 10.1109/CVPR.2019.00971http://dx.doi.org/10.1109/CVPR.2019.00971]
Voigtlaender P and Leibe B. 2017. Online adaptation of convolutional neural networks for video object segmentation[EB/OL]. [2021-03-03].https://arxiv.org/pdf/1706.09364.pdfhttps://arxiv.org/pdf/1706.09364.pdf
Wang Z Q, Xu J, Liu L, Zhu F and Shao L. 2019. RANet: ranking attention network for fast video object segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 3977-3986[DOI: 10.1109/ICCV.2019.00408http://dx.doi.org/10.1109/ICCV.2019.00408]
Yang L J, Wang Y R, Xiong X H, Yang J C and Katsaggelos A K. 2018. Efficient video object segmentation via network modulation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6499-6507[DOI: 10.1109/CVPR.2018.00680http://dx.doi.org/10.1109/CVPR.2018.00680]
Yang Z X, Wei Y C and Yang Y. 2020. Collaborative video object segmentation by foreground-background integration[EB/OL]. [2021-03-03].https://arxiv.org/pdf/2003.08333.pdfhttps://arxiv.org/pdf/2003.08333.pdf
Zhang H, Zhang H, Wang C G and Xie J Y. 2019. Co-occurrent features in semantic segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 548-557[DOI: 10.1109/CVPR.2019.00064http://dx.doi.org/10.1109/CVPR.2019.00064]
Zhang Z P and Peng H W. 2019. Deeper and wider Siamese networks for real-time visual tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4586-4595[DOI: 10.1109/CVPR.2019.00472http://dx.doi.org/10.1109/CVPR.2019.00472]
相关文章
相关作者
相关机构