多模态数据的行为识别综述
Review of action recognition based on multimodal data
- 2022年27卷第11期 页码:3139-3159
纸质出版日期: 2022-11-16 ,
录用日期: 2021-11-26
DOI: 10.11834/jig.210786
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2022-11-16 ,
录用日期: 2021-11-26
移动端阅览
王帅琛, 黄倩, 张云飞, 李兴, 聂云清, 雒国萃. 多模态数据的行为识别综述[J]. 中国图象图形学报, 2022,27(11):3139-3159.
Shuaichen Wang, Qian Huang, Yunfei Zhang, Xing Li, Yunqing Nie, Guocui Luo. Review of action recognition based on multimodal data[J]. Journal of Image and Graphics, 2022,27(11):3139-3159.
行为识别是当前计算机视觉方向中视频理解领域的重要研究课题。从视频中准确提取人体动作的特征并识别动作,能为医疗、安防等领域提供重要的信息,是一个十分具有前景的方向。本文从数据驱动的角度出发,全面介绍了行为识别技术的研究发展,对具有代表性的行为识别方法或模型进行了系统阐述。行为识别的数据分为RGB模态数据、深度模态数据、骨骼模态数据以及融合模态数据。首先介绍了行为识别的主要过程和人类行为识别领域不同数据模态的公开数据集;然后根据数据模态分类,回顾了RGB模态、深度模态和骨骼模态下基于传统手工特征和深度学习的行为识别方法,以及多模态融合分类下RGB模态与深度模态融合的方法和其他模态融合的方法。传统手工特征法包括基于时空体积和时空兴趣点的方法(RGB模态)、基于运动变化和外观的方法(深度模态)以及基于骨骼特征的方法(骨骼模态)等;深度学习方法主要涉及卷积网络、图卷积网络和混合网络,重点介绍了其改进点、特点以及模型的创新点。基于不同模态的数据集分类进行不同行为识别技术的对比分析。通过类别内部和类别之间两个角度对比分析后,得出不同模态的优缺点与适用场景、手工特征法与深度学习法的区别和融合多模态的优势。最后,总结了行为识别技术当前面临的问题和挑战,并基于数据模态的角度提出了未来可行的研究方向和研究重点。
Body action oriented recognition issue is an essential domain for video interpretation of computer vision analysis. Its potentials can be focused on accurate video-based features extraction for body actions and the related recognition for multiple applications. The data modes of body action recognition modals can be segmented into RGB
depth
skeleton and fusion
respectively. Our multi-modals based critical analysis reviews the research and development of body action recognition algorithm. Our literature review is systematically focused on current algorithms or models. First
we introduce the key aspects of body action recognition method
which can be divided into video input
feature extraction
classification and output results. Next
we introduce the popular datasets of different data modal in the context of body action recognition
including human motion database(HMDB-51)
UCF101 dataset
Something-Something datasets of RGB mode
depth modal and skeleton-mode MSR-Action3D dataset
MSR daily activity dataset
UTD-multimodal human action recognition dataset(MHAD) and RGB mode/depth mode/skeleton modal based NTU RGB + D 60/120 dataset
the characteristics of each dataset are explained in detail. Compared to more action recognition reviews
our contributions can be proposed as following: 1) data modal/method/datasets classifications are more instructive; 2) data modal/fusion for body action recognition is discussed more comprehensively; 3) recent challenges of body action recognition is just developed in deep learning and lacks of early manual features methods. We analyze the pros of manual features and deep learning; and 4) their advantages and disadvantages of different data modal
the challenges of action recognition and the future research direction are discussed. According to the data modal classification
the traditional manual feature and deep learning action recognition methods are reviewed via modals analysis of RGB/depth modal/skeleton
as well as multi-modal fused classification and related fusion methods of RGB modal and depth modal. For RGB modal
the traditional manual feature method is related to spatiotemporal volume
spatiotemporal based interest points and the skeleton trajectory based method. The deep learning method involves 3D convolutional neural network and double flow network. 3D convolution can construct the relationship between spatial and temporal dimensions
and the spatiotemporal features extraction is taken into account. The manual feature depth modal methods involve motion change/appearance features. The depth learning method includes representative point cloud network. Point cloud network is leveraged from image processing network. It can extract action features better via point sets processing. The skeleton modal oriented manual method is mainly based on skeleton features
and the deep learning technique mainly uses graph convolution network. Graph convolution network is suitable for the graph shape characteristics of skeleton data
which is beneficial to the transmission of information between skeletons. Next
we summarize the recognition accuracy of representative algorithms and models on RGB modal HMDB-51 dataset
UCF101 dataset and Something-Something V2 dataset
select the data representing manual feature method and depth learning method
and accomplish a histogram for more specific comparative results. For depth modal
we collect the recognition rates of some algorithms and models on MSR-Action3D dataset and NTU RGB + D 60 depth dataset. For skeleton data
we select NTU RGB + D 60 skeleton dataset and NTU RGB + D 120 skeleton dataset
the recognition accuracy of the model is compared. At the same time
we draw the clue of the parameters that need to be trained in the model in recent years. For the multi-modal fusion method
we adopt NTU RGB + D 60 dataset including RGB modal
depth modal and skeleton modal. The comparative recognition rate of single modal is derived of the same algorithm or model and the improved accuracy after multi-modal fusion. We sorted out that RGB modal
depth modal and skeleton modal have their potentials/drawbacks to match applicable scenarios. The fusion of multiple modals can complement mutual information to a certain extent and improve the recognition effect; manual feature method is suitable for some small datasets
and the algorithm complexity is lower; deep learning is suitable for large datasets and can automatically extract features from a large number of data; more researches have changed from deepening the network to lightweight and high recognition rate network. Finally
the current problems and challenges of body action recognition technology are summarized on the aspects of multiple body action recognition
fast and similar body action recognition
and depth and skeleton data collection. At the same time
the data modal issues are predicted further in terms of effective modal fusion
novel network design
and added attention module.
计算机视觉行为识别深度学习神经网络多模态模态融合
computer visionaction recognitiondeep learningneural networkmultimodalmodal fusion
Ahmad Z, Illanko K, Khan N and Androutsos D. 2019. Human action recognition using convolutional neural network and depth sensor data//Proceedings of 2019 International Conference on Information Technology and Computer Communications. Singapore, Singapore: ACM: 1-5 [DOI:10.1145/3355402.3355419http://dx.doi.org/10.1145/3355402.3355419]
Baradel F, Wolf C and Mille J. 2018. Human activity recognition with pose-driven attention to RGB [EB/OL]. [2021-09-02].https://hal.inria.fr/hal-01828083/documenthttps://hal.inria.fr/hal-01828083/document
Bobick A F and Davis J W. 2001. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3): 257-267 [DOI: 10.1109/34.910878]
Bulbul M F, Tabussum S, Ali H, Zheng W L, Lee M Y and Ullah A. 2021. Exploring 3D human action recognition using STACOG on multi-view depth motion maps sequences. Sensors, 21(11): #3642 [DOI: 10.3390/s21113642]
Caetano C, Sena J, Brémond F, Dos Santos J A and Schwartz W R. 2019. SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition//Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Taipei, China: IEEE: 1-8 [DOI:10.1109/AVSS.2019.8909840http://dx.doi.org/10.1109/AVSS.2019.8909840]
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733 [DOI:10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502]
Carreira J, Noland E, Banki-Horvath A, Hillier C and Zisserman A. 2018. A short note about kinetics-600 [EB/OL]. [2021-09-02].https://arxiv.org/pdf/1808.01340.pdfhttps://arxiv.org/pdf/1808.01340.pdf
Carreira J, Noland E, Hillier C and Zisserman A. 2019. A short note on the kinetics-700 human action dataset [EB/OL]. [2021-09-02].https://arxiv.org/pdf/1907.06987.pdfhttps://arxiv.org/pdf/1907.06987.pdf
Chakraborty B, Holte M B, Moeslund T B and Gonzàlez J. 2012. Selective spatio-temporal interest points. Computer Vision and Image Understanding, 116(3): 396-410 [DOI: 10.1016/j.cviu.2011.09.010]
Chen C, Jafari R and Kehtarnavaz N. 2015a. UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor//Proceedings of 2015 IEEE International Conference on Image Processing (ICIP). Quebec City, Canada: IEEE: 168-172 [DOI:10.1109/ICIP.2015.7350781http://dx.doi.org/10.1109/ICIP.2015.7350781]
Chen W B and Guo G D. 2015. TriViews: a general framework to use 3D depth data effectively for action recognition. Journal of Visual Communication and Image Representation, 26: 182-191 [DOI: 10.1016/j.jvcir.2014.11.008]
Chen Y X, Zhang Z Q, Yuan C F, Li B, Deng Y and Hu W M. 2021a. Channel-wise topology refinement graph convolution for skeleton-based action recognition//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 13339-13348 [DOI:10.1109/ICCV48922.2021.01311http://dx.doi.org/10.1109/ICCV48922.2021.01311]
Chen Z, Li S C, Yang B, Li Q H and Liu H. 2021b. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition [EB/OL]. [2021-09-02].https://www.aaai.org/AAAI21Papers/AAAI-5287.ChenZ.pdfhttps://www.aaai.org/AAAI21Papers/AAAI-5287.ChenZ.pdf
Cheng K, Zhang Y F, Cao C Q, Shi L, Cheng J and Lu H Q. 2020a. Decoupling GCN with dropgraph module for skeleton-based action recognition//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 536-553 [DOI:10.1007/978-3-030-58586-0_32http://dx.doi.org/10.1007/978-3-030-58586-0_32]
Cheng K, Zhang Y F, He X F, Chen W H, Cheng J and Lu H Q. 2020b. Skeleton-based action recognition with shift graph convolutional network//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 180-189 [DOI:10.1109/cvpr42600.2020.00026http://dx.doi.org/10.1109/cvpr42600.2020.00026]
Das S, Sharma S and Dai R. 2020. VPN: learning video-pose embedding for activities of daily living//Proceedings of 2020 European Conference on Computer Vision. Online: Springer: 72-90
Das S, Dai R, Yang D and Bremond F. 2021. VPN++: rethinking video-pose embeddings for understanding activities of daily living[EB/OL]. [2021-09-02].https://arxiv.org/pdf/2105.08141.pdfhttps://arxiv.org/pdf/2105.08141.pdf
Davoodikakhki M and Yin K K. 2020. Hierarchical action classification with network pruning//15th International Symposium on Visual Computing. San Diego, USA: Springer: 291-305 [DOI:10.1007/978-3-030-64556-4_23http://dx.doi.org/10.1007/978-3-030-64556-4_23]
Duan H D, Zhao Y, Chen K, Lin D H and Dai B. 2022. Revisiting skeleton-based action recognition[EB/OL]. [2021-09-02].https://arxiv.org/pdf/2104.13586.pdfhttps://arxiv.org/pdf/2104.13586.pdf
Elmadany N E D, He Y F and Guan L. 2018. Information fusion for human action recognition via Biset/Multiset globality locality preserving canonical correlation analysis. IEEE Transactions on Image Processing, 27(11): 5275-5287 [DOI: 10.1109/tip.2018.2855438]
Evangelidis G, Singh G and Horaud R. 2014. Skeletal quads: human action recognition using joint quadruples//Proceedings of the 22nd International Conference on Pattern Recognition. Stockholm, Sweden: IEEE: 4513-4518 [DOI:10.1109/ICPR.2014.772http://dx.doi.org/10.1109/ICPR.2014.772]
Fan H H, Yang Y and Kankanhalli M. 2021a. Point 4D transformer networks for spatio-temporal modeling in point cloud videos//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 14199-14208 [DOI:10.1109/CVPR46437.2021.01398http://dx.doi.org/10.1109/CVPR46437.2021.01398]
Fan H Q, Xiong B, Mangalam K, Li Y H, Yan Z C, Malik J and Feichtenhofer C. 2021b. Multiscale vision transformers[EB/OL]. [2021-09-02].https://arxiv.org/pdf/2104.11227.pdfhttps://arxiv.org/pdf/2104.11227.pdf
Gaidon A, Harchaoui Z and Schmid C. 2014. Activity representation with motion hierarchies. International Journal of Computer Vision, 107(3): 219-238 [DOI: 10.1007/s11263-013-0677-1]
Gowda S N, Rohrbach M and Sevilla-Lara L. 2020. SMART frame selection for action recognition [EB/OL]. [2021-09-02].https://arxiv.org/pdf/2012.10671.pdfhttps://arxiv.org/pdf/2012.10671.pdf
Goyal R, Kahou S E, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, Hoppe F, Thurau C, Bax I and Memisevic R. 2017. The "something something" video database for learning and evaluating visual common sense//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5843-5851 [DOI:10.1109/ICCV.2017.622http://dx.doi.org/10.1109/ICCV.2017.622]
Hu J F, Zheng W S, Pan J H, Lai J H and Zhang J G. 2018. Deep bilinear learning for RGB-D action recognition//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 346-362 [DOI:10.1007/978-3-030-01234-2_21http://dx.doi.org/10.1007/978-3-030-01234-2_21]
Jalal A, Kim Y H, Kim Y J, Kamal S and Kim D. 2017. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recognition, 61: 295-308 [DOI: 10.1016/j.patcog.2016.08.003]
Ji X P, Cheng J, Feng W and Tao D P. 2018. Skeleton embedded motion body partition for human action recognition using depth sequences. Signal Processing, 143: 56-68 [DOI: 10.1016/j.sigpro.2017.08.016]
Jiang B Y, Wang M M, Gan W H, Wu W and Yan J J. 2019. STM: spatiotemporal and motion encoding for action recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 2000-2009 [DOI:10.1109/iccv.2019.00209http://dx.doi.org/10.1109/iccv.2019.00209]
Kalfaoglu M E, Kalkan S and Alatan A A. 2020. Late temporal modeling in 3D CNN architectures with BERT for action recognition//Proceedings of 2020 European Conference on Computer Vision. Glasgow, UK: Springer: 731-747 [DOI:10.1007/978-3-030-68238-5_48http://dx.doi.org/10.1007/978-3-030-68238-5_48]
Klaser A, Marszałek M and Schmid C. 2008. A spatio-temporal descriptor based on 3D-gradients//BMVC 2008-19th British Machine Vision Conference. [s. l.]: British Machine Vision Association: #99 [DOI:10.5244/C.22.99http://dx.doi.org/10.5244/C.22.99]
Komkov S, Dzabraev M and Petiushko A. 2020. Mutual modality learning for video action classification[EB/OL]. [2021-09-02].https://arxiv.org/pdf/2011.02543.pdfhttps://arxiv.org/pdf/2011.02543.pdf
Koniusz P, Cherian A and Porikli F. 2016. Tensor representations via kernel linearization for action recognition from 3D skeletons//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 37-53 [DOI:10.1007/978-3-319-46493-0_3http://dx.doi.org/10.1007/978-3-319-46493-0_3]
Korban M and Li X. 2020. DDGCN: a dynamic directed graph convolutional network for action recognition//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 761-776 [DOI:10.1007/978-3-030-58565-5_45http://dx.doi.org/10.1007/978-3-030-58565-5_45]
Kuehne H, Jhuang H, Garrote E, Poggio T and Serre T. 2011. HMDB: a large video database for human motion recognition//Proceedings of 2011 IEEE International Conference on Computer Vision. Barcelona, Spain: IEEE: 2556-2563 [DOI:10.1109/ICCV.2011.6126543http://dx.doi.org/10.1109/ICCV.2011.6126543]
Kwon H, Kim M, Kwak S and Cho M. 2020. MotionSqueeze: neural motion feature learning for video understanding//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 345-362 [DOI:10.1007/978-3-030-58517-4_21http://dx.doi.org/10.1007/978-3-030-58517-4_21]
Lee Y, Kim H I, Yun K and Moon J. 2021. Diverse temporal aggregation and depthwise spatiotemporal factorization for efficient video classification[EB/OL]. [2021-09-02].https://arxiv.org/pdf/2012.00317.pdfhttps://arxiv.org/pdf/2012.00317.pdf
Li C, Huang Q, Li X and Wu Q H. 2021. A multi-scale human action recognition method based on Laplacian pyramid depth motion images//Proceedings of the 2nd ACM International Conference on Multimedia in Asia. Singapore, Singapore: ACM: #31 [DOI:10.1145/3444685.3446284http://dx.doi.org/10.1145/3444685.3446284]
Li J N, Wong Y, Zhao Q and Mohan S K. 2018. Unsupervised learning of view-invariant action representations [EB/OL]. [2021-09-02].https://arxiv.org/pdf/1809.01844.pdfhttps://arxiv.org/pdf/1809.01844.pdf
Li M S, Chen S H, Chen X, Zhang Y, Wang Y F and Tian Q. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 3590-3598 [DOI:10.1109/CVPR.2019.00371http://dx.doi.org/10.1109/CVPR.2019.00371]
Li W Q, Zhang Z Y and Liu Z C. 2010. Action recognition based on a bag of 3D points//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Workshops. San Francisco, USA: IEEE: 9-14 [DOI:10.1109/CVPRW.2010.5543273http://dx.doi.org/10.1109/CVPRW.2010.5543273]
Lin J, Gan C and Han S. 2019. TSM: temporal shift module for efficient video understanding//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7082-7092 [DOI:10.1109/iccv.2019.00718http://dx.doi.org/10.1109/iccv.2019.00718]
Liu J, Shahroudy A, Perez M, Wang G, Duan L Y and Kot A C. 2020a. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10): 2684-2701 [DOI: 10.1109/TPAMI.2019.2916873]
Liu J, Shahroudy A, Xu D and Wang G. 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 816-833 [DOI:10.1007/978-3-319-46487-9_50http://dx.doi.org/10.1007/978-3-319-46487-9_50]
Liu J H and Xu D. 2021. GeometryMotion-net: a strong two-stream baseline for 3D action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 31(12): 4711-4721 [DOI: 10.1109/tcsvt.2021.3101847]
Liu X Y, Yan M Y and Bohg J. 2019. MeteorNet: deep learning on dynamic 3D point cloud sequences//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 9245-9254 [DOI:10.1109/iccv.2019.00934http://dx.doi.org/10.1109/iccv.2019.00934]
Liu Y, Xue P P, Li H and Wang C X. 2021. A review of action recognition using joints based on deep learning. Journal of Electronics and Information Technology, 43(6): 1789-1802
刘云, 薛盼盼, 李辉, 王传旭. 2021. 基于深度学习的关节点行为识别综述. 电子与信息学报, 43(6): 1789-1802 [DOI: 10.11999/JEIT200267]
Liu Z, Ning J, Cao Y, Wei Y X, Zhang Z, Lin S and Hu H. 2021. Video swin transformer [EB/OL]. [2021-09-02].https://arxiv.org/pdf/2106.13230v1https://arxiv.org/pdf/2106.13230v1
Liu Z Y, Zhang H W, Chen Z H, Wang Z Y and Ouyang W L. 2020b. Disentangling and unifying graph convolutions for skeleton-based action recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 140-149 [DOI:10.1109/cvpr42600.2020.00022http://dx.doi.org/10.1109/cvpr42600.2020.00022]
Nguyen T V, Song Z and Yan S C. 2015. STAP: spatial-temporal attention-aware pooling for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 25(1): 77-86 [DOI: 10.1109/TCSVT.2014.2333151]
Obinata Y and Yamamoto T. 2021. Temporal extension module for skeleton-based action recognition//Proceedings of the 25th International Conference on Pattern Recognition (ICPR). Milan, Italy: IEEE: 534-540 [DOI:10.1109/ICPR48806.2021.9412113http://dx.doi.org/10.1109/ICPR48806.2021.9412113]
Ohn-Bar E and Trivedi M M. 2013. Joint angles similarities and HOG2 for action recognition//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Portland, USA: IEEE: 465-470 [DOI:10.1109/cvprw.2013.76http://dx.doi.org/10.1109/cvprw.2013.76]
Oreifej O and Liu Z C. 2013. HON4D: histogram of oriented 4D normals for activity recognition from depth sequences//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 716-723 [DOI:10.1109/CVPR.2013.98http://dx.doi.org/10.1109/CVPR.2013.98]
Patrick M, Campbell D, Asano Y M, Misra I, Metze F, Feichtenhofer C, Vedaldi A and Henriques J F. 2021. Keeping your eye on the ball: trajectory attention in video transformers [EB/OL]. [2021-09-02].https://arxiv.org/pdf/2106.05392.pdfhttps://arxiv.org/pdf/2106.05392.pdf
Peng X J and Schmid C. 2016. Multi-region two-stream R-CNN for action detection//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 744-759 [DOI:doi.org/10.1007/978-3-319-46493-0_45http://dx.doi.org/doi.org/10.1007/978-3-319-46493-0_45]
Qi C R, Yi L, Su H and Guibas L J. 2017. PointNet++: deep hierarchical feature learning on point sets in a metric space [EB/OL]. [2021-09-02].https://arxiv.org/pdf/1706.02413.pdfhttps://arxiv.org/pdf/1706.02413.pdf
Qiu Z F, Yao T, Ngo C W, Tian X M and Mei T. 2019. Learning spatio-temporal representation with local and global diffusion//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 12048-12057 [DOI:10.1109/CVPR.2019.01233http://dx.doi.org/10.1109/CVPR.2019.01233]
Rahmani H, Mahmood A, Huynh D Q and Mian A. 2014. Real time action recognition using histograms of depth gradients and random decision forests//Proceedings of 2014 IEEE Winter Conference on Applications of Computer Vision. Steamboat Springs, USA: IEEE: 626-633 [DOI:10.1109/WACV.2014.6836044http://dx.doi.org/10.1109/WACV.2014.6836044]
Rahmani H, Mahmood A, Huynh D and Mian A. 2016. Histogram of oriented principal components for cross-view action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(12): 2430-2443 [DOI: 10.1109/tpami.2016.2533389]
Ren Z L, Zhang Q S, Cheng J, Hao F S and Gao X Y. 2021. Segment spatial-temporal representation and cooperative learning of convolution neural networks for multimodal-based action recognition. Neurocomputing, 433: 142-153 [DOI: 10.1016/j.neucom.2020.12.020]
Shahroudy A, Liu J, Ng T T and Wang G. 2016a. NTU RGB+D: a large scale dataset for 3D human activity analysis//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 1010-1019 [DOI:10.1109/CVPR.2016.115http://dx.doi.org/10.1109/CVPR.2016.115]
Shahroudy A, Ng T T, Gong Y H and Wang G. 2016b. Deep multimodal feature analysis for action recognition in RGB+D videos [EB/OL]. [2021-09-02].https://arxiv.org/pdf/1603.07120.pdfhttps://arxiv.org/pdf/1603.07120.pdf
Shi L, Zhang Y F, Cheng J and Lu H Q. 2019. Skeleton-based action recognition with directed graph neural networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7904-7913 [DOI:10.1109/cvpr.2019.00810http://dx.doi.org/10.1109/cvpr.2019.00810]
Shi L, Zhang Y F, Cheng J and Lu H Q. 2020. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing, 29: 9532-9545 [DOI: 10.1109/TIP.2020.3028207]
Shu Z X, Yun K and Samaras D. 2015. Action detection with improved dense trajectories and sliding window//Proceedings of 2015 European Conference on Computer Vision. Zurich, Switzerland: Springer: 541-551 [DOI:10.1007/978-3-319-16178-5_38http://dx.doi.org/10.1007/978-3-319-16178-5_38]
Song Y F, Zhang Z and Wang L. 2019. Richly activated graph convolutional network for action recognition with incomplete skeletons//Proceedings of 2019 IEEE International Conference on Image Processing (ICIP), Taibei, China: IEEE: 1-5
Song Y F, Zhang Z, Shan C F and Wang L. 2022. Constructing stronger and faster baselines for skeleton-based action recognition [EB/OL]. [2021-09-02].https://arxiv.org/pdf/2106.15125.pdfhttps://arxiv.org/pdf/2106.15125.pdf
Soomro K, Zamir A R and Shah M. 2012. UCF101: a dataset of 101 human actions classes from videos in the wild [EB/OL]. [2021-09-02].https://arxiv.org/pdf/1212.0402.pdfhttps://arxiv.org/pdf/1212.0402.pdf
Sun B, Kong D H, Wang S F, Wang L C, Wang Y P and Yin B C. 2019. Effective human action recognition using global and local offsets of skeleton joints. Multimedia Tools and Applications, 78(5): 6329-6353 [DOI: 10.1007/s11042-018-6370-1]
Trelinski J and Kwolek B. 2019. Ensemble of classifiers using CNN and hand-crafted features for depth-based action recognition//Proceedings of the 18th International Conference on Artificial Intelligence and Soft Computing. Zakopane, Poland: Springer: 91-103 [DOI:10.1007/978-3-030-20915-5_9http://dx.doi.org/10.1007/978-3-030-20915-5_9]
Vemulapalli R, Arrate F and Chellappa R. 2014. Human action recognition by representing 3D skeletons as points in a lie group//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 588-595 [DOI:10.1109/CVPR.2014.82http://dx.doi.org/10.1109/CVPR.2014.82]
Wang H and Schmid C. 2013. Action recognition with improved trajectories//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 3551-3558 [DOI:10.1109/ICCV.2013.441http://dx.doi.org/10.1109/ICCV.2013.441]
Wang J, Liu Z C, Wu Y and Yuan J S. 2012. Mining actionlet ensemble for action recognition with depth cameras//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 1290-1297 [DOI:10.1109/CVPR.2012.6247813http://dx.doi.org/10.1109/CVPR.2012.6247813]
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O and Van Gool L. 2016. Temporal segment networks: towards good practices for deep action recognition [EB/OL]. [2021-09-02].https://arxiv.org/pdf/1608.00859.pdfhttps://arxiv.org/pdf/1608.00859.pdf
Wang P, Li W, Wan J, Ogunbona P and Liu X. 2018b. Cooperative training of deep aggregation networks for RGB-D action recognition [EB/OL]. [2021-09-02].https://arxiv.org/pdf/1801.01080v1.pdfhttps://arxiv.org/pdf/1801.01080v1.pdf
Wang P C, Li W Q, Gao Z M, Tang C and Ogunbona P O. 2018a. Depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Transactions on Multimedia, 20(5): 1051-1061 [DOI: 10.1109/TMM.2018.2818329]
Wang P C, Li W Q, Gao Z M, Zhang J, Tang C and Ogunbona P. 2015. Deep convolutional neural networks for action recognition using depth map sequences [EB/OL]. [2021-09-02].https://arxiv.org/pdf/1501.04686.pdfhttps://arxiv.org/pdf/1501.04686.pdf
Wang Y C, Xiao Y, Xiong F, Jiang W X, Cao Z G, Zhou J T and Yuan J S. 2020. 3DV: 3D dynamic voxel for action recognition in depth video//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 508-517 [DOI:10.1109/CVPR42600.2020.00059http://dx.doi.org/10.1109/CVPR42600.2020.00059]
Xia L and Aggarwal J K. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 2834-2841 [DOI:10.1109/CVPR.2013.365http://dx.doi.org/10.1109/CVPR.2013.365]
Xiao Y, Chen J, Wang Y C, Cao Z G, Zhou J T and Bai X. 2018. Action recognition for depth video using multi-view dynamic images[EB/OL]. [2021-09-02].https://arxiv.org/pdf/1806.11269v2.pdfhttps://arxiv.org/pdf/1806.11269v2.pdf
Yan S J, Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition[EB/OL]. [2021-09-02].https://arxiv.org/pdf/1801.07455.pdfhttps://arxiv.org/pdf/1801.07455.pdf
Yang X D and Tian Y L. 2014. Super normal vector for activity recognition using depth sequences//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 804-811 [DOI:10.1109/CVPR.2014.108http://dx.doi.org/10.1109/CVPR.2014.108]
Yang X D, Zhang C Y and Tian Y L. 2012. Recognizing actions using depth motion maps-based histograms of oriented gradients//Proceedings of the 20th ACM International Conference on Multimedia. Nara, Japan: ACM: 1057-1060 [DOI:10.1145/2393347.2396382http://dx.doi.org/10.1145/2393347.2396382]
Ye F F, Pu S L, Zhong Q Y, Li C, Xie D and Tang H M. 2020. Dynamic GCN: context-enriched topology learning for skeleton-based action recognition//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 55-63 [DOI:10.1145/3394171.3413941http://dx.doi.org/10.1145/3394171.3413941]
Ye Y C and Tian Y L. 2016. Embedding sequential information into spatiotemporal features for action recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Las Vegas, USA: IEEE: 1110-1118 [DOI:10.1109/cvprw.2016.142http://dx.doi.org/10.1109/cvprw.2016.142]
Yu J, Gao H and Yang W. 2020. A discriminative deep model with feature fusion and temporal attention for human action recognition. IEEE Access, 8: 43243-43255
Zhang C, Zou Y X, Chen G and Gan L. 2020a. Pan: towards fast action recognition via learning persistence of appearance[EB/OL]. [2021-09-02].https://arxiv.org/pdf/2008.03462.pdfhttps://arxiv.org/pdf/2008.03462.pdf
Zhang J R, Shen F M, Xu X and Shen H T. 2019. Cooperative cross-stream network for discriminative action representation [EB/OL]. [2021-09-02].https://arxiv.org/pdf/1908.10136.pdfhttps://arxiv.org/pdf/1908.10136.pdf
Zhang P F, Lan C L, Zeng W J, Xing J L, Xue J R and Zheng N N. 2020b. Semantics-guided neural networks for efficient skeleton-based human action recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1109-1118 [DOI:10.1109/cvpr42600.2020.00119http://dx.doi.org/10.1109/cvpr42600.2020.00119]
Zhu Y, Lan Z Z, Newsam S and Hauptmann A. 2018. Hidden two-stream convolutional networks for action recognition//Proceedings of the 14th Asian Conference on Computer Vision. Perth, Australia: Springer: 363-378 [DOI:10.1007/978-3-030-20893-6_23http://dx.doi.org/10.1007/978-3-030-20893-6_23]
相关作者
相关机构