Current Issue Cover


摘 要
目的:人群密度估计任务是通过对人群特征的提取和分析,估算出密度分布情况和人群计数结果。一种常用的策略是通过多尺度的卷积神经网络(CNN)来提取不同尺度的人群特征,平均融合后得到最终的密度估计结果。然而由于CNN中的下采样操作会丢失部分人群信息,且融合方式会使多尺度效应平均化,该策略并不一定能得到准确的估计结果。为了解决上述问题,本文提出一种新的基于对抗式扩张卷积的多尺度人群密度估计模型。 方法:提出的背景模型是以解决图像语义分割问题提出的扩张卷积作为基础的。一方面,扩张卷积可以在不损失分辨率的情况下对输入图像进行特征提取,提高了密度估计的精确度;且不同的扩张系数可以聚集多尺度上下文信息,解决了人群分布差异大和透视扭曲导致的尺度变化问题。另一方面,模型中采用的对抗式损失函数将网络中提取的不同尺度的特征信息以合作式的方式融合,不同尺度信息取长补短,共同得到准确的密度估计结果。 结果:本文在4个主要的数据集上进行了实验,基于计数结果的平均绝对误差(MAE)和均方误差(MSE)的评价准则,验证了该模型优于已提出的人群密度估计算法。 结论:本文提出了一种新的基于对抗式扩张卷积的多尺度人群密度估计模型。实验结果表明,在人群分布差异较大的场景中构建的算法模型有较好的自适应性,能根据不同的场景提取特征估算密度分布,并对人群进行准确计数。
Multi-Scale Crowd Counting via Adversarial Dilated Convolutions

liusiqi,langcongyan,fengsonghe(Beijing Jiaotong University)

Objective Crowd counting is a task that estimates the counting results and density distribution of crowd by extracting and analyzing the crowd features. A common strategy extracts the crowd features of different scales through the multi-scale convolution neural networks (CNNs), then fused to yield the final density estimation results. However, there will be losing crowd information due to down-sampling operation in CNN, and the model averaging effects in multi-scale CNNs induced by fusion method, the strategy does not necessarily get the accurate estimation results. In order to solve these problems, we proposed a novel model named multi-scale crowd counting via adversarial dilated convolutions. Method Our background modeling is based on dilated convolution model proposed by solving the problem of image semantic segmentation. In the domain of image segmentation, the most common method is to use CNN to solve the problem. Firstly, we enter the images into the CNN, and the network will do the convolution operation to images, then do the pooling operation, so the image size is reduced and the receptive field of the network is increased. However, the image segmentation is a pixel-wise problem, we need to upsampling the smaller image after pooling to original image size to prediction (generally use deconvolution operation to realize upsampling). Therefore, there are two key points in image segmentation. One is that pooling reduces the size of the image and increases the receptive field, and the other is upsampling enlarging the image size. In the process of reducing and resizing, there must be some loss of information. To solve this issue, dilated convolution which could make the scale of the network consistent is proposed. The specific principle is to remove the pooling layer in the network that reduces the resolution of the feature map, but the model could not learn the global vision of the images, and increases the convolution kernel scale will make the computation increase sharply, the memory will be overload. In order to enlarge the scale of convolution kernel and increase the receptive field, we can increase the original convolution kernel to a certain expansion coefficient and fill the empty position with 0. In this way, the receptive field becomes larger due to the expansion of convolution kernel, and because the effective computation points in the convolution kernel remain unchanged, the computation remains unchanged. Besides, the scale of each feature is invariable, so the image information is preserved. The model proposed in the paper is based on adversarial dilated convolutions. On the one hand, the dilated convolution could extract features of input image without losing resolution, and the module uses different dilated convolutions to aggregate multi-scale context information. On the other hand, the adversarial loss function improves the accuracy of estimation results in a collaborating way to fuse different scales information. Result The proposed method reduces the mean absolute error (MAE) and the mean square error (MSE) to the 60.5 and 109.7 on the Part_A of ShanghaiTech dataset, otherwise, the method that reduces the MAE and MSE to the 10.2 and 15.3 on the Part_B. Compared with the existing methods, the MAE has improved 7.7 and 0.4 respectively. By analysing synthetically the five sets of video sequences on the WorldExpo''10 database, the average prediction result has increased 0.66 compared to the classical algorithm. On the UCF_CC_50 dataset, MAE and MSE has improved 18.6 and 22.9 respectively, which proves that the estimation accuracy is improved, having a noticeable effect on the environment with complex number of scenes. However, the MAE reduces to 1.02 on the UCSD database, the MSE did not improve. This explains that adversarial loss function limits the robustness of crowd counting with the low density environment. Conclusion A new learning strategy named multi-scale crowd counting via adversarial dilated convolutions is proposed in this paper. In order to save the image information as much as possible, the network uses the dilated convolution, and the dilated convolutions with dilated coefficients of different sizes to aggregate multi-scale contextual information, which solves the problem of counting the head of different scales in crowd scene images due to the difference of angles. And the adversarial loss function takes advantage of the image feature extracted by the network to estimate crowd density. The experimental results show that the algorithm model constructed in the scene with large population distribution has better adaptability. It can estimate the density distribution according to different scenes and count the population accurately.