Visual words and self-attention mechanism fusion based video object segmentation method

Chuanjun Ji; Yadang Chen; Xun Che

doi:10.11834/jig.210155

Image Analysis and Recognition | Views : 0 下载量: 0 CSCD: 0

PDF
Export
Share
Collection
Album

Visual words and self-attention mechanism fusion based video object segmentation method
Vol. 27, Issue 8, Pages: 2444-2457(2022)
Published： 16 August 2022 ，

Accepted： 15 June 2021
DOI： 10.11834/jig.210155
稿件说明：

移动端阅览

Chuanjun Ji, Yadang Chen, Xun Che. Visual words and self-attention mechanism fusion based video object segmentation method. [J]. Journal of Image and Graphics 27(8):2444-2457(2022)
DOI：

Chuanjun Ji, Yadang Chen, Xun Che. Visual words and self-attention mechanism fusion based video object segmentation method. [J]. Journal of Image and Graphics 27(8):2444-2457(2022) DOI： 10.11834/jig.210155.

摘要

目的

视频目标分割(video object segmentation

VOS)是在给定初始帧的目标掩码条件下，实现对整个视频序列中感兴趣对象的分割，但是视频中往往会出现目标形状不规则、背景中存在干扰信息和运动速度过快等情况，影响视频目标分割质量。对此，本文提出一种融合视觉词和自注意力机制的视频目标分割算法。

方法

对于参考帧，首先将其图像输入编码器中，提取分辨率为原图像1/8的像素特征。然后将该特征输入由若干卷积核构成的嵌入空间中，并将其结果上采样至原始尺寸。最后结合参考帧的目标掩码信息，通过聚类算法对嵌入空间中的像素进行聚类分簇，形成用于表示目标对象的视觉词。对于目标帧，首先将其图像通过编码器并输入嵌入空间中，通过单词匹配操作用参考帧生成的视觉词来表示嵌入空间中的像素，并获得多个相似图。然后，对相似图应用自注意力机制捕获全局依赖关系，最后取通道方向上的最大值作为预测结果。为了解决目标对象的外观变化和视觉词失配的问题，提出在线更新机制和全局校正机制以进一步提高准确率。

结果

实验结果表明，本文方法在视频目标分割数据集DAVIS(densely annotated video segmentation)2016和DAVIS 2017上取得了有竞争力的结果，区域相似度与轮廓精度之间的平均值J&F-mean(Jaccard and F-score mean)分别为83.2%和72.3%。

结论

本文提出的算法可以有效地处理由遮挡、变形和视点变化等带来的干扰问题，实现高质量的视频目标分割。

Abstract

Objective

Video object segmentation (VOS) involves foreground objects segmentation from the background in a video sequence. Its applications are relevant to video detection

video classification

video summarization

and self-driving. Our research is focused on a semi-supervised setting

which estimates the mask of the target object in the remaining frames of the video based on the target mask given in the initial frame. However

current video object segmentation algorithms are constrained of the issue of irregular shape

interference information and super-fast motion. Hence

our research develops a video object segmentation algorithm based on the integration of visual words and self-attention mechanism.

Method

For the reference frame

the reference frame image is first fed into the encoder to extract features of those resolutions are 1/8 of the original image. Subsequently

the extracted features are fed into the embedding space composed of several 3 × 3 convolution kernels

and the result is up-sampled to the original size. During the training process

the pixels from the same target in the embedding space are close to each other

while the pixels from different targets are far apart. Finally

the visual words representing the target objects are formed by combining the mask information annotated in the reference frames and clustering the pixels in the embedding space using a clustering algorithm. For the target frame

its image is first fed into the encoder and passed through the embedding space

and then a word matching operation is performed to represent the pixels in the embedding space with a certain number of visual words to obtain similarity maps. However

learning visual words is a challenging task because there is no real information about their corresponding object parts. Therefore

a meta-training algorithm is used to alternate between unsupervised learning of visual words and supervised learning of pixel classification given these visual words. The application of visual vocabulary allows for more robust matching because an object may be obscured

deformed

changed perspective

or disappear and reappear from the same video

and its partial appearance may remain the same. Then

the self-attention mechanism is applied to the generated similarity map to capture the global dependency

and the maximum value is taken in the channel direction as the predicted result. To resolve significant appearance changes and global mismatch issues

an efficient online update and global correction mechanism is adopted to improve the accuracy further. For the online update mechanism

the updated timing has an impact on the performance of the model. When the update interval is shorter

the dictionary is updated more frequently

which aids the network to adapt dynamic scenes and fast-moving objects better. However

if the interval is too short

it is possible to cause more noisy visual words

which will affect the performance of the algorithm. Therefore

it is important to use an appropriate update frequency. Here

the visual dictionary is set to be updated every 5 frames. Furthermore

to ensure that the prediction masks used to update visual words in the online update mechanism are reliable

a simple outlier removal process is applied to the prediction masks. Specifically

given a region with the same prediction annotation

the prediction region is accepted only if it intersects the object mask predicted in the previous frame. If there is no intersection

this prediction mask is discarded and the prediction is made directly on it based on the previous result.

Result

We validate the effectiveness and robustness of our method on the challenging DAVIS 2016(densely annotated video segmentation) and DAVIS 2017 datasets. Our method is compared to state-of-the-art methods

with J&F-mean(Jaccard and F-score mean) score of 83.2% on DAVIS 2016

with J&F-mean score of 72.3% on DAVIS 2017. We achieved comparable accuracy to the fine-tuning-based method

and reached a competitive level in terms of the speed/accuracy trade-off of the two video object segmentation datasets.

Conclusion

The proposed algorithm can effectively deal with the interference problems caused by occlusion

deformation and viewpoint change

and achieve high quality video object segmentation.

关键词

视频目标分割(VOS)聚类算法视觉词自注意力机制在线更新机制全局校正机制

Keywords

video object segmentation (VOS)clustering algorithmvisual wordsself-attention mechanismonline update mechanismglobal correction mechanism

references

Bertinetto L, Henriques J F, Torr P H S and Vedaldi A. 2019. Meta-learning with differentiable closed-form solvers[EB/OL]. [2021-03-03].https://arxiv.org/pdf/1805.08136.pdfhttps://arxiv.org/pdf/1805.08136.pdf

Bhat G, Danelljan M, van Gool L and Timofte R. 2019. Learning discriminative model prediction for tracking//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 6181-6190[DOI: 10.1109/ICCV.2019.00628http://dx.doi.org/10.1109/ICCV.2019.00628]

Caelles S, Maninis K K, Pont-Tuset J, Leal-Taixé L, Cremers D and van Gool L. 2017. One-shot video object segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5320-5329[DOI: 10.1109/CVPR.2017.565http://dx.doi.org/10.1109/CVPR.2017.565]

Chen Y H, Pont-Tuset J, Montes A and van Gool L. 2018. Blazingly fast video object segmentation with pixel-wise metric learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1189-1198[DOI: 10.1109/CVPR.2018.00130http://dx.doi.org/10.1109/CVPR.2018.00130]

Choi J, Kwon J and Lee K M. 2019. Deep meta learning for real-time target-aware visual tracking//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 911-920[DOI: 10.1109/ICCV.2019.00100http://dx.doi.org/10.1109/ICCV.2019.00100]

Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255[DOI: 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848]

Dosovitskiy A, Fischer P, Ilg E, Häusser P,Hazirbas C, Golkov V, van der Smagt P, Cremers D and Brox T. 2015. FlowNet: learning optical flow with convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2758-2766[DOI: 10.1109/ICCV.2015.316http://dx.doi.org/10.1109/ICCV.2015.316]

Finn C, Abbeel P and Levine S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks[EB/OL]. [2021-03-03].https://arxiv.org/pdf/1703.03400.pdfhttps://arxiv.org/pdf/1703.03400.pdf

Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3141-3149[DOI: 10.1109/CVPR.2019.00326http://dx.doi.org/10.1109/CVPR.2019.00326]

Hao T L and Li X Y. 2021. Video object detection method for improving the stability of bounding box. Journal of Image and Graphics, 26(1): 113-122

郝腾龙, 李熙莹. 2021. 提升预测框定位稳定性的视频目标检测. 中国图象图形学报, 26(1): 113-122 [DOI: 10.11834/jig.200417]

Hao X Y, Xiong J F, Xue X D, Shi J, Wen K, Han W T, Li X Y, Zhao J and Fu X L. 2020. 3D U-Net with dual attention mechanism for lung tumor segmentation. Journal of Image and Graphics, 25(10): 2119-2127

郝晓宇, 熊俊峰, 薛旭东, 石军, 文可, 韩文廷, 李骁扬, 赵俊, 傅小龙. 2020. 融合双注意力机制3D U-Net的肺肿瘤分割. 中国图象图形学报, 25(10): 2119-2127 [DOI: 10.11834/jig.200282]

He K M, Zhang, X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Hu T Y, Huang J B and Schwing A G. 2018. VideoMatch: matching based video object segmentation//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 56-73[DOI: 10.1007/978-3-030-01237-3_4http://dx.doi.org/10.1007/978-3-030-01237-3_4]

Huang Z L, Wang X G, Huang L C, Huang C, Wei Y C and Liu W Y. 2019. CCNet: criss-cross attention for semantic segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 603-612[DOI: 10.1109/ICCV.2019.00069http://dx.doi.org/10.1109/ICCV.2019.00069]

Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A and Brox T. 2017. FlowNet 2.0: evolution of optical flow estimation with deep network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1647-1655[DOI: 10.1109/CVPR.2017.179http://dx.doi.org/10.1109/CVPR.2017.179]

Khoreva A, Benenson R, Ilg E, Brox T and Schiele B. 2019. Lucid data dreaming for video object segmentation. International Journal of Computer Vision, 127 (2): 1175-1197[DOI: 10.1007/s11263-019-01164-6]

Lee K, Maji S, Ravichandran A and Soatto S. 2019. Meta-learning with differentiable convex optimization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 10649-10657[DOI: 10.1109/CVPR.2019.01091http://dx.doi.org/10.1109/CVPR.2019.01091]

Li X X and Change Loy C. 2018. Video object segmentation with joint re-identification and attention-aware mask propagation//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 93-110[DOI: 10.1007/978-3-030-01219-9_6http://dx.doi.org/10.1007/978-3-030-01219-9_6]

Li Y, Shen Z R and Shan Y. 2020. Fast video object segmentation using the global context module//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, UK: Springer: 735-750[DOI: 10.1007/978-3-030-58607-2_43http://dx.doi.org/10.1007/978-3-030-58607-2_43]

Li Z W, Chen Q F and Koltun V. 2018. Interactive image segmentation with latent diversity//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 577-585[DOI: 10.1109/CVPR.2018.00067http://dx.doi.org/10.1109/CVPR.2018.00067]

Liang Y Q, Li X, Jafari N and Chen Q. 2020. Video object segmentation with adaptive feature bank and uncertain-region refinement[EB/OL]. [2021-03-03].https://arxiv.org/pdf/2010.07958.pdfhttps://arxiv.org/pdf/2010.07958.pdf

Liu Y, Liu L Q, Zhang H K, Rezatofighi H and Reid I. 2020. Meta learning with differentiable closed-form solver for fast video object segmentation//Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Las Vegas, USA: IEEE: 8439-8446[DOI: 10.1109/IROS45743.2020.9341282http://dx.doi.org/10.1109/IROS45743.2020.9341282]

Lu X K, Wang W G, Danelljan M, Zhou T F, Shen J B and van Gool L. 2020. Video object segmentation with episodic graph memory networks//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, UK: Springer: 661-679[DOI: 10.1007/978-3-030-58580-8_39http://dx.doi.org/10.1007/978-3-030-58580-8_39]

Luiten J, Voigtlaender P and Leibe B. 2018. PReMVOS: proposal-generation, refinement and merging for video object segmentation//Proceedings of the 14th Asian Conference on Computer Vision (ACCV). Perth, Australia: Springer: 565-580[DOI: 10.1007/978-3-030-20870-7_35http://dx.doi.org/10.1007/978-3-030-20870-7_35]

Maninis K K, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixé L, Cremers D and van Gool L. 2019. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(6): 1515-1530[DOI: 10.1109/TPAMI.2018.2838670]

Meinhardt T and Leal-Taixé L. 2020. Make one-shot video object segmentation efficient again[EB/OL]. [2021-03-03].https://arxiv.org/pdf/2012.01866.pdfhttps://arxiv.org/pdf/2012.01866.pdf

Oh S W, Lee J Y, Sunkavalli K and Kim S J. 2018. Fast video object segmentation by reference-guided mask propagation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7376-7385[DOI: 10.1109/CVPR.2018.00770http://dx.doi.org/10.1109/CVPR.2018.00770]

Oh S W, Lee J Y, Xu N and Kim S J. 2019. Video object segmentation using space-time memory networks//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9225-9234[DOI: 10.1109/ICCV.2019.00932http://dx.doi.org/10.1109/ICCV.2019.00932]

Park E and Berg A C. 2018. Meta-tracker: fast and robust online adaptation for visual object trackers//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer: 587-604[DOI: 10.1007/978-3-030-01219-9_35http://dx.doi.org/10.1007/978-3-030-01219-9_35]

Perazzi F, Khoreva A, Benenson R, Schiele B and Sorkine-Hornung A. 2017. Learning video object segmentation from static images//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3491-3500[DOI: 10.1109/CVPR.2017.372http://dx.doi.org/10.1109/CVPR.2017.372]

Seong H, Hyun J and Kim E. 2020. Kernelized memory network for video object segmentation//Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow, UK: Springer: 629-645[DOI: 10.1007/978-3-030-58542-6_38http://dx.doi.org/10.1007/978-3-030-58542-6_38]

Shen Z Q, Liu Z, Li J G, Jiang Y G, Chen Y R and Xue X Y. 2020. Object detection from scratch with deep supervision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 398-412[DOI: 10.1109/TPAMI.2019.2922181]

Tang R F, Song H H, Zhang K H and Jiang S H. 2019. Video object segmentation via feature attention pyramid modulating network. Journal of Image and Graphics, 24(8): 1349-1357

汤润发, 宋慧慧, 张开华, 姜斯浩. 2019. 特征注意金字塔调制网络的视频目标分割. 中国图象图形学报, 24(8): 1349-1357 [DOI: 10.11834/jig.180661]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need[EB/OL]. [2021-03-03].https://arxiv.org/pdf/1706.03762.pdfhttps://arxiv.org/pdf/1706.03762.pdf

Voigtlaender P, Chai Y N, Schroff F, Adam H, Leibe B and Chen L C. 2019. FEELVOS: fast end-to-end embedding learning for video object segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9473-9482[DOI: 10.1109/CVPR.2019.00971http://dx.doi.org/10.1109/CVPR.2019.00971]

Voigtlaender P and Leibe B. 2017. Online adaptation of convolutional neural networks for video object segmentation[EB/OL]. [2021-03-03].https://arxiv.org/pdf/1706.09364.pdfhttps://arxiv.org/pdf/1706.09364.pdf

Wang Z Q, Xu J, Liu L, Zhu F and Shao L. 2019. RANet: ranking attention network for fast video object segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 3977-3986[DOI: 10.1109/ICCV.2019.00408http://dx.doi.org/10.1109/ICCV.2019.00408]

Yang L J, Wang Y R, Xiong X H, Yang J C and Katsaggelos A K. 2018. Efficient video object segmentation via network modulation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6499-6507[DOI: 10.1109/CVPR.2018.00680http://dx.doi.org/10.1109/CVPR.2018.00680]

Yang Z X, Wei Y C and Yang Y. 2020. Collaborative video object segmentation by foreground-background integration[EB/OL]. [2021-03-03].https://arxiv.org/pdf/2003.08333.pdfhttps://arxiv.org/pdf/2003.08333.pdf

Zhang H, Zhang H, Wang C G and Xie J Y. 2019. Co-occurrent features in semantic segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 548-557[DOI: 10.1109/CVPR.2019.00064http://dx.doi.org/10.1109/CVPR.2019.00064]

Zhang Z P and Peng H W. 2019. Deeper and wider Siamese networks for real-time visual tracking//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4586-4595[DOI: 10.1109/CVPR.2019.00472http://dx.doi.org/10.1109/CVPR.2019.00472]

Alert me when the article has been cited

提交

Bilateral cross enhancement with self-attention compensation for semantic segmentation of point clouds

Label distribution learning and spatio-temporal attentional awareness for video action quality assessment

Survey on Transformer for image classification

Video anomaly detection by fusing self-attention and autoencoder

Deepfake video detection with feature interaction amongst key frames