多级卷积特征金字塔的细粒度食物图片识别

梁华刚; 温晓倩; 梁丹丹; 李怀德; 茹锋

doi:10.11834/jig.180495

图像分析和识别 | 浏览量 : 0 下载量: 0 CSCD: 3

导出
分享
收藏
专辑

多级卷积特征金字塔的细粒度食物图片识别
Fine-grained food image recognition of a multi-level convolution feature pyramid
2019年24卷第6期页码：870-881
纸质出版日期： 2019-06-16 ，
DOI： 10.11834/jig.180495
稿件说明：

移动端阅览

梁华刚, 温晓倩, 梁丹丹, 李怀德, 茹锋. 多级卷积特征金字塔的细粒度食物图片识别[J]. 中国图象图形学报, 2019,24(6):870-881.

Huagang Liang, Xiaoqian Wen, Dandan Liang, Huaide Li, Feng Ru. Fine-grained food image recognition of a multi-level convolution feature pyramid[J]. Journal of Image and Graphics, 2019,24(6):870-881.
梁华刚, 温晓倩, 梁丹丹, 李怀德, 茹锋. 多级卷积特征金字塔的细粒度食物图片识别[J]. 中国图象图形学报, 2019,24(6):870-881. DOI： 10.11834/jig.180495.

Huagang Liang, Xiaoqian Wen, Dandan Liang, Huaide Li, Feng Ru. Fine-grained food image recognition of a multi-level convolution feature pyramid[J]. Journal of Image and Graphics, 2019,24(6):870-881. DOI： 10.11834/jig.180495.

摘要

目的

食物图片具有结构多变、背景干扰大、类间差异小、类内差异大等特点，比普通细粒度图片的识别难度更大。目前在食物图片识别领域，食物图片的识别与分类仍存在精度低、泛化性差等问题。为了提高食物图片的识别与分类精度，充分利用食物图片的全局与局部细节信息，本文提出了一个多级卷积特征金字塔的细粒度食物图片识别模型。

方法

本文模型从整体到局部逐级提取特征，将干扰较大的背景信息丢弃，仅针对食物目标区域提取特征。模型主要由食物特征提取网络、注意力区域定位网络和特征融合网格3部分组成，并采用3级食物特征提取网络的级联结构来实现特征由全局到局部的转移。此外，针对食物图片尺度变化大的特点，本文模型在每级食物特征提取网络中加入了特征金字塔结构，提高了模型对目标大小的鲁棒性。

结果

本文模型在目前主流公开的食物图片数据集Food-101、ChineseFoodNet和Food-172上进行实验，分别获得了91.4%、82.8%、90.3%的Top-1正确率，与现有方法相比提高了1%~8%。

结论

本文提出了一种多级卷积神经网络食物图片识别模型，可以自动定位食物图片区分度较大的区域，融合食物图片的全局与局部特征，实现了食物图片的细粒度识别，有效提高了食物图片的识别精度。实验结果表明，该模型在目前主流食物图片数据集上取得了最好的结果。

Abstract

Objective

Food images have special characteristics

uncertainties in food appearances

complex backgrounds

inter-class similarities

and intra-class differences. Hence

these images are more difficult to identify than ordinary fine-grained pictures. Traditional food image recognition mainly uses manual design features

including color

histogram of oriented gradient (HOG)

and local binary pattern (LBP)

then utilizes a classifier (e.g.

support vector machine (SVM)) to deal with features. However

manual design features cannot establish the connection between various features. Several integrated feature methods only superimpose numerous features; thus

the recognition accuracy on each food image data set is up to 70% only. Compared with the weak expression capability of manual design features

deep learning has a stronger feature representation capability. They both use large-scale

labeled food images to train multi-level convolutional neural network models for food image recognition to improve recognition accuracy. However

in the current method of using the sonorous convolutional neural network for food image classification

the food image is directly inputted into the convolutional neural network to extract features. The food image has a relatively complicated background information

which critically influences the recognition result. We developed a model called multi-level convolution feature pyramid for fine-grained food image recognition to improve the accuracy of food image recognition and take full advantage of the local details.

Method

We extracted features from the whole to local

which not only avoids the shortcomings of baseline methods but also retains the global information and local details. We extracted features only from the target areas of the food image and discarded the background information with large interference. The multi-level convolution feature pyramid model consists of three main parts

namely

food feature extraction

attention localization

and feature fusion networks. The single-level feature extraction network cannot obtain the global and local features of the food image simultaneously. We developed a three-level food feature extraction network by cascading

which can transfer features from global to local. Moreover

a feature pyramid network was constructed between the feature maps of each food feature extraction network to deal with the large variation of food image scale. To automatically locate the network to the fine-grained area

an attention area localization network was designed between the levels of the feature extraction network

and the feature extraction range was reduced from global to local. Then

the fine-grained area of the original picture was cropped

enlarged

and inputted to the next-level feature extraction network. The features extracted by each level of the feature extraction network were subsequently sent to the feature fusion network. The merged features included the global features of the food image and the detailed features of the food target. For our model

two loss functions were used to optimize the feature extraction

feature fusion

and attention localization networks. For the feature extraction and feature fusion networks

the SoftMax loss function

which is referred to as the classification loss function

was used. The inter-stage loss function was utilized for the attention area positioning network.

Result

We adopted step-by-step and alternating training methods to train the feature extraction and attention localization networks and the cascade model separately. We conducted experiments on current mainstream datasets of food images. Our model obtained the top accuracy rates with 91.4%

82.8%

and 90.3% on Food-101

ChineseFoodNet

and Food-172 datasets

respectively. The implemented framework showed the best performance compared with baselines for food picture recognition

with 1%8% improvement in recognition accuracy. Moreover

we trained the model in the Food-202 dataset

which we constructed ourselves

to verify the performance of our model fully. Food-202 is a food image dataset of 202 classes

and the number of food images in each class is more than 1 000; it includes Chinese and Western food. Results show that the accuracy of the model with the feature pyramid network increased by 2.4%.

Conclusion

We built a fine-grained food image recognition model with a multi-level feature pyramid convolutional neural network. The model can automatically locate areas with large discrimination of food images and integrate the global and local features of food images to achieve fine-grained recognition. It can effectively enhance the accuracy of food recognition and the robustness of the target size. Experimental results show that the proposed model demonstrated better performance than the baseline models in current mainstream food image datasets.

关键词

食物图片识别卷积神经网络注意力网络细粒度识别特征金字塔

Keywords

food picture recognitionconvolutional neural networkattention networkfine-grained recognitionfeature pyramid

references

Liu C, Cao Y, Luo Y, et al. Deepfood: deep learning-based food image recognition for computer-aided dietary assessment[C]//Proceedings of the 14th International Conference on Smart Homes and Health Telematics. Wuhan: Springer, 2016: 37-48.[DOI: 10.1007/978-3-319-39601-9_4http://dx.doi.org/10.1007/978-3-319-39601-9_4]

He H S, Kong F Y, Tan J D. DietCam:multiview food recognition using a multikernel SVM[J]. IEEE Journal of Biomedical and Health Informatics, 2016, 20(3):848-855.[DOI:10.1109/JBHI.2015.2419251]

Kawano Y, Yanai K. FoodCam:a real-time food recognition system on a smartphone[J]. Multimedia Tools and Applications, 2015, 74(14):5263-5287.[DOI:10.1007/s11042-014-2000-8]

Martinel N, Piciarelli C, Micheloni C. A supervised extreme learning committee for food recognition[J]. Computer Vision and Image Understanding, 2016, 148:67-86.[DOI:10.1016/j.cviu.2016.01.012]

Chen M Y, Yang Y H, Ho C J, et al. Automatic Chinese food identification and quantity estimation[C]//Proceedings SA'12 SIGGRAPH Asia 2012 Technical Briefs. Singapore: ACM, 2012.[DOI: 10.1145/2407746.2407775http://dx.doi.org/10.1145/2407746.2407775]

Zhao W T, Wang Y H, Chen X X, et al. Learning deep feature fusion for group images classification[C]//Proceedings of the 2nd Chinese Conference on Computer Vision. Tianjin: Springer, 2017: 566-576.[DOI: 10.1007/978-981-10-7302-1_47http://dx.doi.org/10.1007/978-981-10-7302-1_47]

Bolaños M, Radeva P. Simultaneous food localization and recognition[C]//Proceedings of 2016 International Conference on Pattern Recognition. Cancun, Mexico: IEEE, 2016: 3140-3145.[DOI: 10.1109/ICPR.2016.7900117http://dx.doi.org/10.1109/ICPR.2016.7900117]

Pandey P, Deepthi A, Mandal B, et al. FoodNet:recognizing foods using ensemble of deep networks[J]. IEEE Signal Processing Letters, 2017, 24(12):1758-1762.[DOI:10.1109/LSP.2017.2758862]

Aguilar E, Bolaños M, Radeva P. Food recognition using fusion of classifiers based on CNNs[C]//Proceedings of the 19th International Conference on Image Analysis and Processing. Catania, Italy: Springer, 2017: 213-224.[DOI: 10.1007/978-3-319-68548-9_20http://dx.doi.org/10.1007/978-3-319-68548-9_20]

Martinel N, Foresti G L, Micheloni C. Wide-slice residual networks for food recognition[J]. arXiv: 1612.06543, 2016.

Chen J J, Ngo C W. Deep-based ingredient recognition for cooking recipe retrieval[C]//Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, the Netherlands: ACM, 2016: 32-41.[DOI: 10.1145/2964284.2964315http://dx.doi.org/10.1145/2964284.2964315]

Chen X, Zhu Y, Zhou H, et al. ChineseFoodNet: a large-scale image dataset for chinese food recognition[J]. arXiv: 1705.02743, 2017.

Hassannejad H, Matrella G, Ciampolini P, et al. Food image recognition using very deep convolutional networks[C]//Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management. Amsterdam, The Netherlands: ACM, 2016: 41-49.[DOI: 10.1145/2986035.2986042http://dx.doi.org/10.1145/2986035.2986042]

Liu X, Xia T, Wang J, et al. Fully convolutional attention networks for fine-grained recognition[J]. arXiv: 1603.06765, 2016.

Zhou B L, Khosla A, Lapedriza A, et al. Object detectors emerge in deep scene CNNs[C]//Proceedings of 2015 International Conference on Learning Representations. arXiv: 1412.6856, 2015.

Bau D, Zhou B L, Khosla A, et al. Network dissection: quantifying interpretability of deep visual representations[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA: IEEE, 2017: 3319-3327.[DOI: 10.1109/CVPR.2017.354http://dx.doi.org/10.1109/CVPR.2017.354]

He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9):1904-1916.[DOI:10.1109/TPAMI.2015.2389824]

Wei X S, Luo J H, Wu J X, et al. Selective convolutional descriptor aggregation for fine-grained image retrieval[J]. IEEE Transactions on Image Processing, 2017, 26(6):2868-2881.[DOI:10.1109/TIP.2017.2688133]

Zhang X F, Zhou F, Lin Y Q, et al. Embedding label structures for fine-grained feature representation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 1114-1123.[DOI: 10.1109/CVPR.2016.126http://dx.doi.org/10.1109/CVPR.2016.126]

Yanai K, Kawano Y. Food image recognition using deep convolutional network with pre-training and fine-tuning[C]//Proceedings of 2015 IEEE International Conference on Multimedia&Expo Workshops. IEEE, 2015: 1-6.[DOI: 10.1109/ICMEW.2015.7169816http://dx.doi.org/10.1109/ICMEW.2015.7169816]

Bossard L, Guillaumin M, Gool L V. Food-101: Mining discriminative components with random forests[C]. Computer Vision-ECCV2014. Zurish, Switzerland: Springer, 2014.[DOI: 10.1007/978-3-319-10599-4_29http://dx.doi.org/10.1007/978-3-319-10599-4_29]

文章被引用时，请邮件提醒。

提交

结合图像块比较与残差图估计的人脸伪造检测

用于遥感舰船细粒度检测与识别的关键子区域融合网络

选择并融合粗细粒度特征的细粒度图像识别

多分支深度特征融合的中医脑卒中辅助诊断

CGAN样本生成的遥感图像飞机识别