-
导出
分享
-
收藏
-
专辑
多级卷积特征金字塔的细粒度食物图片识别
Fine-grained food image recognition of a multi-level convolution feature pyramid
- 2019年24卷第6期 页码:870-881
纸质出版日期: 2019-06-16
DOI: 10.11834/jig.180495
移动端阅览
浏览全部资源
扫码关注微信
导出
分享
收藏
专辑
纸质出版日期: 2019-06-16 ,
移动端阅览
梁华刚, 温晓倩, 梁丹丹, 李怀德, 茹锋. 多级卷积特征金字塔的细粒度食物图片识别[J]. 中国图象图形学报, 2019,24(6):870-881.
Huagang Liang, Xiaoqian Wen, Dandan Liang, Huaide Li, Feng Ru. Fine-grained food image recognition of a multi-level convolution feature pyramid[J]. Journal of Image and Graphics, 2019,24(6):870-881.
目的
2
食物图片具有结构多变、背景干扰大、类间差异小、类内差异大等特点,比普通细粒度图片的识别难度更大。目前在食物图片识别领域,食物图片的识别与分类仍存在精度低、泛化性差等问题。为了提高食物图片的识别与分类精度,充分利用食物图片的全局与局部细节信息,本文提出了一个多级卷积特征金字塔的细粒度食物图片识别模型。
方法
2
本文模型从整体到局部逐级提取特征,将干扰较大的背景信息丢弃,仅针对食物目标区域提取特征。模型主要由食物特征提取网络、注意力区域定位网络和特征融合网格3部分组成,并采用3级食物特征提取网络的级联结构来实现特征由全局到局部的转移。此外,针对食物图片尺度变化大的特点,本文模型在每级食物特征提取网络中加入了特征金字塔结构,提高了模型对目标大小的鲁棒性。
结果
2
本文模型在目前主流公开的食物图片数据集Food-101、ChineseFoodNet和Food-172上进行实验,分别获得了91.4%、82.8%、90.3%的Top-1正确率,与现有方法相比提高了1%~8%。
结论
2
本文提出了一种多级卷积神经网络食物图片识别模型,可以自动定位食物图片区分度较大的区域,融合食物图片的全局与局部特征,实现了食物图片的细粒度识别,有效提高了食物图片的识别精度。实验结果表明,该模型在目前主流食物图片数据集上取得了最好的结果。
Objective
2
Food images have special characteristics
uncertainties in food appearances
complex backgrounds
inter-class similarities
and intra-class differences. Hence
these images are more difficult to identify than ordinary fine-grained pictures. Traditional food image recognition mainly uses manual design features
including color
histogram of oriented gradient (HOG)
and local binary pattern (LBP)
then utilizes a classifier (e.g.
support vector machine (SVM)) to deal with features. However
manual design features cannot establish the connection between various features. Several integrated feature methods only superimpose numerous features; thus
the recognition accuracy on each food image data set is up to 70% only. Compared with the weak expression capability of manual design features
deep learning has a stronger feature representation capability. They both use large-scale
labeled food images to train multi-level convolutional neural network models for food image recognition to improve recognition accuracy. However
in the current method of using the sonorous convolutional neural network for food image classification
the food image is directly inputted into the convolutional neural network to extract features. The food image has a relatively complicated background information
which critically influences the recognition result. We developed a model called multi-level convolution feature pyramid for fine-grained food image recognition to improve the accuracy of food image recognition and take full advantage of the local details.
Method
2
We extracted features from the whole to local
which not only avoids the shortcomings of baseline methods but also retains the global information and local details. We extracted features only from the target areas of the food image and discarded the background information with large interference. The multi-level convolution feature pyramid model consists of three main parts
namely
food feature extraction
attention localization
and feature fusion networks. The single-level feature extraction network cannot obtain the global and local features of the food image simultaneously. We developed a three-level food feature extraction network by cascading
which can transfer features from global to local. Moreover
a feature pyramid network was constructed between the feature maps of each food feature extraction network to deal with the large variation of food image scale. To automatically locate the network to the fine-grained area
an attention area localization network was designed between the levels of the feature extraction network
and the feature extraction range was reduced from global to local. Then
the fine-grained area of the original picture was cropped
enlarged
and inputted to the next-level feature extraction network. The features extracted by each level of the feature extraction network were subsequently sent to the feature fusion network. The merged features included the global features of the food image and the detailed features of the food target. For our model
two loss functions were used to optimize the feature extraction
feature fusion
and attention localization networks. For the feature extraction and feature fusion networks
the SoftMax loss function
which is referred to as the classification loss function
was used. The inter-stage loss function was utilized for the attention area positioning network.
Result
2
We adopted step-by-step and alternating training methods to train the feature extraction and attention localization networks and the cascade model separately. We conducted experiments on current mainstream datasets of food images. Our model obtained the top accuracy rates with 91.4%
82.8%
and 90.3% on Food-101
ChineseFoodNet
and Food-172 datasets
respectively. The implemented framework showed the best performance compared with baselines for food picture recognition
with 1%8% improvement in recognition accuracy. Moreover
we trained the model in the Food-202 dataset
which we constructed ourselves
to verify the performance of our model fully. Food-202 is a food image dataset of 202 classes
and the number of food images in each class is more than 1 000; it includes Chinese and Western food. Results show that the accuracy of the model with the feature pyramid network increased by 2.4%.
Conclusion
2
We built a fine-grained food image recognition model with a multi-level feature pyramid convolutional neural network. The model can automatically locate areas with large discrimination of food images and integrate the global and local features of food images to achieve fine-grained recognition. It can effectively enhance the accuracy of food recognition and the robustness of the target size. Experimental results show that the proposed model demonstrated better performance than the baseline models in current mainstream food image datasets.
食物图片识别卷积神经网络注意力网络细粒度识别特征金字塔
food picture recognitionconvolutional neural networkattention networkfine-grained recognitionfeature pyramid
Liu C, Cao Y, Luo Y, et al. Deepfood: deep learning-based food image recognition for computer-aided dietary assessment[C]//Proceedings of the 14th International Conference on Smart Homes and Health Telematics. Wuhan: Springer, 2016: 37-48.[DOI: 10.1007/978-3-319-39601-9_4http://dx.doi.org/10.1007/978-3-319-39601-9_4]
He H S, Kong F Y, Tan J D. DietCam:multiview food recognition using a multikernel SVM[J]. IEEE Journal of Biomedical and Health Informatics, 2016, 20(3):848-855.[DOI:10.1109/JBHI.2015.2419251]
Kawano Y, Yanai K. FoodCam:a real-time food recognition system on a smartphone[J]. Multimedia Tools and Applications, 2015, 74(14):5263-5287.[DOI:10.1007/s11042-014-2000-8]
Martinel N, Piciarelli C, Micheloni C. A supervised extreme learning committee for food recognition[J]. Computer Vision and Image Understanding, 2016, 148:67-86.[DOI:10.1016/j.cviu.2016.01.012]
Chen M Y, Yang Y H, Ho C J, et al. Automatic Chinese food identification and quantity estimation[C]//Proceedings SA'12 SIGGRAPH Asia 2012 Technical Briefs. Singapore: ACM, 2012.[DOI: 10.1145/2407746.2407775http://dx.doi.org/10.1145/2407746.2407775]
Zhao W T, Wang Y H, Chen X X, et al. Learning deep feature fusion for group images classification[C]//Proceedings of the 2nd Chinese Conference on Computer Vision. Tianjin: Springer, 2017: 566-576.[DOI: 10.1007/978-981-10-7302-1_47http://dx.doi.org/10.1007/978-981-10-7302-1_47]
Bolaños M, Radeva P. Simultaneous food localization and recognition[C]//Proceedings of 2016 International Conference on Pattern Recognition. Cancun, Mexico: IEEE, 2016: 3140-3145.[DOI: 10.1109/ICPR.2016.7900117http://dx.doi.org/10.1109/ICPR.2016.7900117]
Pandey P, Deepthi A, Mandal B, et al. FoodNet:recognizing foods using ensemble of deep networks[J]. IEEE Signal Processing Letters, 2017, 24(12):1758-1762.[DOI:10.1109/LSP.2017.2758862]
Aguilar E, Bolaños M, Radeva P. Food recognition using fusion of classifiers based on CNNs[C]//Proceedings of the 19th International Conference on Image Analysis and Processing. Catania, Italy: Springer, 2017: 213-224.[DOI: 10.1007/978-3-319-68548-9_20http://dx.doi.org/10.1007/978-3-319-68548-9_20]
Martinel N, Foresti G L, Micheloni C. Wide-slice residual networks for food recognition[J]. arXiv: 1612.06543, 2016.
Chen J J, Ngo C W. Deep-based ingredient recognition for cooking recipe retrieval[C]//Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, the Netherlands: ACM, 2016: 32-41.[DOI: 10.1145/2964284.2964315http://dx.doi.org/10.1145/2964284.2964315]
Chen X, Zhu Y, Zhou H, et al. ChineseFoodNet: a large-scale image dataset for chinese food recognition[J]. arXiv: 1705.02743, 2017.
Hassannejad H, Matrella G, Ciampolini P, et al. Food image recognition using very deep convolutional networks[C]//Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management. Amsterdam, The Netherlands: ACM, 2016: 41-49.[DOI: 10.1145/2986035.2986042http://dx.doi.org/10.1145/2986035.2986042]
Liu X, Xia T, Wang J, et al. Fully convolutional attention networks for fine-grained recognition[J]. arXiv: 1603.06765, 2016.
Zhou B L, Khosla A, Lapedriza A, et al. Object detectors emerge in deep scene CNNs[C]//Proceedings of 2015 International Conference on Learning Representations. arXiv: 1412.6856, 2015.
Bau D, Zhou B L, Khosla A, et al. Network dissection: quantifying interpretability of deep visual representations[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA: IEEE, 2017: 3319-3327.[DOI: 10.1109/CVPR.2017.354http://dx.doi.org/10.1109/CVPR.2017.354]
He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9):1904-1916.[DOI:10.1109/TPAMI.2015.2389824]
Wei X S, Luo J H, Wu J X, et al. Selective convolutional descriptor aggregation for fine-grained image retrieval[J]. IEEE Transactions on Image Processing, 2017, 26(6):2868-2881.[DOI:10.1109/TIP.2017.2688133]
Zhang X F, Zhou F, Lin Y Q, et al. Embedding label structures for fine-grained feature representation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 1114-1123.[DOI: 10.1109/CVPR.2016.126http://dx.doi.org/10.1109/CVPR.2016.126]
Yanai K, Kawano Y. Food image recognition using deep convolutional network with pre-training and fine-tuning[C]//Proceedings of 2015 IEEE International Conference on Multimedia&Expo Workshops. IEEE, 2015: 1-6.[DOI: 10.1109/ICMEW.2015.7169816http://dx.doi.org/10.1109/ICMEW.2015.7169816]
Bossard L, Guillaumin M, Gool L V. Food-101: Mining discriminative components with random forests[C]. Computer Vision-ECCV2014. Zurish, Switzerland: Springer, 2014.[DOI: 10.1007/978-3-319-10599-4_29http://dx.doi.org/10.1007/978-3-319-10599-4_29]
相关作者
相关机构