目的 食物图片具有结构多变、背景干扰大、类间差异小、类内差异大等特点，使其比普通细粒度图片的识别难度更大。目前在食物图片识别领域中，食物图片的识别与分类仍存在精度低、泛化性差等问题。为了提高食物图片精确识别与分类精度，充分利用食物图片的局部细节信息，并且避免之前方法仅关注食物图片全局信息的问题。本文提出了一个多级卷积特征金字塔的细粒度食物图片识别模型。方法 该模型由整体到局部逐级提取特征，将干扰较大的背景信息丢弃，只针对食物目标区域提取特征。模型主要由食物特征提取网络、注意力区域定位网络以及特征融合三部分组成，分别负责特征提取、细粒度局部区域定位与全局局部特征融合。由于单级食物特征提取网络无法同时获得食物图片全局与局部特征，因此采用三级食物特征提取网络级联的结构实现了特征由全局到局部的转移。此外，针对食物图片尺度变化大的特点，在每级食物特征提取网络的特征图之间构建了特征金字塔网络，使模型对目标大小的鲁棒性提高。结果 本文模型在目前主流公开的食物图片数据集Food-101、ChineseFoodNet与Food-172上进行实验，分别获得了91.4%、82.8%、90.3%的Top-1正确率，与现有方法相比提高了1%-8%。结论 本文提出了一种多级卷积神经网络食物图片识别模型，可以自动定位食物图片区分度较大的区域，融合食物图片的全局与局部特征实现了食物图片的细粒度识别，有效的提高了食物图片的识别精度。实验结果表明，该模型在目前主流食物图片数据集上取得了最好的结果。
Fine-grained food image recognition of multi-level convolution feature pyramid
Objective As food images have a special characteristics, uncertainties of food appearances, complex background, inter-class similarities and intra-class differences, which make it more difficult to identify than ordinary fine-grained pictures. Traditional food image recognition mainly used manual design features, includes color, HOG, LBP, etc., and then uses the classifier (e.g., SVM) to deal with features. However, the manual design features cannot establish the connection between the various features, even some integrated feature methods which are only superimposed the numerous features, so the recognition accuracy on each food image data set is up to 70%. Compared with weak expression ability of manual design features, deep learning has stronger feature representation ability. They use of large-scale labeled food images to train multi-level convolutional neural network models for food image recognition has improved the recognition accuracy. However, in the current method of using the sonorous convolutional neural network for food image classification, the food picture is directly input into the convolutional neural network to extract features. Since the food picture has a relatively complicated background information, which will have a critical influence on the recognition result. We proposed a model, named multi-level convolution feature pyramid for fine-grained food image recognition, to improve the accuracy of food picture recognition and take full advantage of the local details. Method In our model, we extracted feature from the whole to local, which not only avoids the shortcoming of the baseline methods, but also retains the global information and local details. We extracted feature only from the target areas of food, and discarded the background information with large interference. The multi-level convolution feature pyramid model consists of three main parts: food feature extraction network, attention-localization network and feature fusion network. At the same time, since the single-level feature extraction network can not obtain the global and local features of the food image. We proposed the structure of the three-level food feature extraction network by cascading, which can transfer features from global to local. And a feature pyramid network was constructed between the feature maps of each food feature extraction network, to deal with the large variation of food image scale. In order to automatically locate the network to the fine-grained area, an attention area localization network is designed between each level of feature extraction network, and the feature extraction range is reduced from global to local. Then, the fine-grained area of the original picture is cropped and enlarged and input to the next-level feature extraction network. Finally, the features extracted by each level of feature extraction network are sent to the feature fusion network for feature fusion. The merged features include both the global features of the food image and the detailed features of the food target. In our model, two loss functions are used to optimize the feature extraction network, feature fusion network and attention-localization network. For the feature extraction network and the feature fusion network, the Softmax loss function is used, which is referred to as the classification loss function, and the inter-stage loss function is used for the attention area positioning network. Result We adopt the method of step-by-step training and alternating training to train the feature extraction network, the attention-localization network and the cascade model separately. We have experiments on the current mainstream datasets of the food images, our model obtained the TOP-1 accuracy rates with 91.4%,82.8%,and 90.3% on the Food-101, ChineseFoodNet,and Food-172.The implemented models showed the best performance compared with baseline methods for food pictures recognition, with 1%-8% improvement of recognition accuracy. Besides, We have trained the model in the Food-202 datasets constructed by ourselves to fully verify the performance of our model. Food-202 is a food image datasets of category 202, and the number of food images per class is more than 1000, including Chinese food and Western food. And the result show that the accuracy of the model with feature pyramid network is increased by 2.4%. Conclusion We proposed a fine-grained food images recognition model with multi-level feature pyramid convolutional neural network, which can automatically locate the areas with large discrimination of food images, and integrate the global features and local features of food images to achieve the fine-grained recognition. Our model can effectively enhance the accuracy of food recognition and the robustness of the target size. Experimental results show that the model has achieved the best results in the current mainstream food image datasets.