Current Issue Cover


摘 要
目的 与文本相比,手绘图像的一个特点是可以明确地捕获对象的外观和结构。传统的手绘图像检索方法主要集中在检索相同类别的图像,忽略了手绘图像的细粒度特征。针对此问题提出了一种新的结合细粒度特征与深度卷积网络的手绘图像检索方法。不止注重通过深度跨域实现整体匹配,也实现细粒度细节匹配。方法 首先构建多通道混合卷积神经网络,对手绘图像和自然图像分别进行不同的处理;其次,通过在网络中加入注意力模型来获取细粒度特征;最后将粗细特征融合,进行相似性度量,得到检索结果。结果 在不同的数据库上进行实验,与五种基准方法比较,传统的SIFT、HOG,以及深度手绘模型Deep SaN,Deep 3D和Deep TSN,实验选取了top-1和top-10,在鞋子数据集上,本文方法top-1正确率得到12%的提升,在椅子数据集上,本文方法top-1正确率得到11%的提升。Top-10上也得到了3%的提升,所以与传统的手绘检索方法相比,我们的方法能得到更高的准确率。在实验结果中,通过手绘图像,本文方法能在第一张检索出绝大多数的目标图像,达到了本文进行实例级别手绘检索的目的。结论 本文提出了一种新的手绘图像检索方法,为手绘图像和自然图像的跨域检索提供了一种新思路,进行实例级别的手绘检索,与原有的方法相比,检索精度得到明显提升,证明了本文方法的可行性。
Sketch-Based Image Retrieval Based on Fine-Grained Feature and Deep Convolutional Neural Network

Li Zongmin,Liu Xiuxiu,Liu Yujie,Li Hua(College of Computer and Communication Engineering,China University of Petroleum,Qingdao;Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing)

Objective Content-based image retrieval or text-based retrieval has played a major role in practical computer vision applications. In some scenarios, however, if example queries are not available or it is difficult to describe them with keyword, it is a problem. but a characteristic of sketches, compared with text, rests with their ability to intrinsically capture object appearance and structure. Sketches are incredibly intuitive to humans and descriptive in nature. They provide a convenient and intuitive way to specify object appearance and structure. As a query modality, they offer a degree of precision and flexibility that is missing in traditional text-based image retrieval. Closely correlated with the explosion in the availability of touch-screen devices, sketch-based image retrieval has become an increasingly prominent research topic in recent years. Conventional sketch-based image retrieval (SBIR) principally focuses on retrieving images of the same category, neglecting the fine-grained feature of sketches. However, SBIR is challenging since humans draw free-hand sketches without any reference but only focus on the salient object structures. As such, the shapes and scales in sketches are usually distorted compared to natural images. To deal with this problem, some studies have attempted to bridge the domain gap between sketches and natural images for SBIR. These methods can be roughly divided into two groups: hand-crafted methods and cross-domain deep learning-based methods. SBIR first generates approximate sketches by extracting edge or contour maps from the natural images. After that, hand-crafted features are extracted for both sketches and edgemaps of natural images, which are then fed into “Bag-of-Words” methods to generate the representations for SBIR. The major limitation of hand-crafted methods is that the domain gap between sketches and natural images cannot be well remedied, as it is difficult to match edge maps to non-aligned sketches with large variations and ambiguity. For this problem, we proposed a novel sketch-based image retrieval method based on fine-grained feature and deep convolutional neural network. This FG-SBIR approaches focus not only on coarse holistic matching via deep cross-domain, but also the explicitly accounting for fine-grained details matching. The proposed deep convolutional neural network is designed for the sketch-based image retrieval. Method Most existing SBIR works focus on category-level sketch-to-photo retrieval. A bag-of-words representation combined with some form of edge detection from photo images are often employed to bridge the domain gap. The previous work that attempted to address the fine-grained SBIR problem is based on deformable part-based model and graph matching. However, their definition of fine-grained is very different from ours - a sketch is considered to be a match to a photo if the objects depicted look similar. In addition, these hand-crafted feature based approaches are inadequate in bridging the domain gap as well as capturing the subtle intra-category and inter-instance differences, as demonstrated in our experiments. Our methods are demonstrated as follow: Firstly, we constructed a multiple branch of confusing deep convolutional neural network to do the different deal with sketch and natural image; There are three different branches: one sketch branch, and two nature image branches. The sketch branch has four convolutional layers and two pooling layers, but the natural image branch has five convolutional layers and two pooling layers. By adding a convolutional layer to get the more abstractive natural image features, solving the problem of abstraction level inconsistency. The different branches design can reduce domain differences. Secondly, we extract the detail information by adding the attention model in the neural network; Most attention models learn an attention mask which assigns different weights to different regions of an image. Soft attention is the most commonly used one because it is differentiable thus can be learned end-to-end with the rest of the network. Our attention model is specifically designed for FG-SBIR in that it is robust against spatial mis-alignment through the shortcut connection architecture. Thirdly, we combines coarse and fine semantic information to achieve retrieval. By combining the information to get the more robust features. In the last, we use the deep triplet loss to get the better result. The loss is defined using the max-margin framework. Result The experiment on different benchmark dataset: shoe dataset and chair dataset. We use two traditional hand-crafted features based models: SIFT and HOG. Among the other three baseline models: Deep SaN, Deep 3D and Deep TSN, which use the deep features designed for the sketch. We use the ratio of correctly predicting the true match at top-1 and at top-10 as the evaluation metrics. We compare performance of our full model and the five baseline. The results proves that the proposed method result gets higher retrieval precision than the traditional methods. Our model performs the best overall on each metric and on both datasets. The improvement is particularly clear at top-1, which around 12% increase. On the chair dataset, we get around 11% increase. And we get around 3% increase at top-10. On the other words, we can get the right result on the first image. In the proposed method, we want to get the instance-level retrieval. So, the proposed model gets the better results on the FG-SBIR task. Conclusion The proposed sketch-based image retrieval provide a new way of thinking for cross-domain retrieval of sketch and natural image. This sketch convolutional neural network gets the better result for the sketch-based image retrieval. This task is more challenging than the well-studied category-level SBIR task, but it is also more useful for commercial SBIR adoption. Achieving fine-grained retrieval across the sketch/image gap requires a deep network learned with triplet annotation requirements. We demonstrated how to sidestep these requirements in order to achieve good performance at this new and challenging task. By introducing attention modelling and sketch convolutional neural network, it is able to concentrate on the subtle differences between local regions of a sketch and photo images and compute deep features containing both fine-grained and high-level semantics. The proposed sketch neural network is suitable for the FG-SIBR.