摘要:Image blurring refers to the loss of clarity and detail in an image during its capture or transmission due to factors such as motion of the lens or camera, lighting conditions, and other environmental variables. This loss of quality and usability can significantly influence the overall visual impact of the image. The technique of image deblurring has been developed to mitigate such effects. Its purpose is to predict the clear version of an image automatically by constructing computer mathematical models that measure the blurriness of the image. The research and development of image deblurring algorithms have not only provided convenience for other tasks in the field of computer vision, such as object detection, but have also offered assurance in various aspects of life, including security monitoring. Depending on its cause, blurring can mainly be divided into motion blur, out-of-focus blur, and Gaussian blur. Out-of-focus and Gaussian blurs are less prevalent and relatively easier to handle, whereas motion blur is more likely to occur in situations such as road traffic cameras, pedestrian movement, and fast-moving vehicles, making it a more critical issue to be addressed. After an image is deblurred, evaluating the quality of the results becomes essential, which is carried out using methods for image quality assessment (IQA), categorized as either subjective or objective. Objective evaluation methods can be divided into three types: full-reference, reduced reference, and no reference. Owing to constraints in resources, objective evaluation methods make up the majority of IQA approaches. The process of image blurring can be represented as the convolution of a clear image with a blur kernel, accompanied with greater or lesser degrees of noise. Therefore, image deblurring comprises two types: non-blind image deblurring (NBID) and blind image deblurring (BID). Non-blind deblurring involves the restoration of an image with a known blur kernel, requiring prior knowledge of the blur kernel’s parameters. On the contrary, blind deblurring aims to restore images with unknown blur kernels or unknown clear images, posing a more challenging problem to solve because of the increased number of unknown factors. In light of these considerations, we systematically and critically review the recent advancements in image deblurring. A comprehensive and systematic introduction of image deblurring is presented from the following two aspects: 1) the evolution of traditional image deblurring and 2) the development of deep learning-based image deblurring. From the perspective of traditional image deblurring, the existing image deblurring methods can be divided into two categories: non-blind deblurring and blind deblurring. Specifically, traditional NBID algorithms rely on prior knowledge of the blur kernel for the restoration process. Common methods include denoising- and iteration-based methods. Traditional BID methods primarily involve estimating the blur kernel first and transforming it into an NBID problem afterward. The kernel and clear image are often estimated iteratively until satisfactory results are obtained. The emerging deep learning methods extract blur image features through training a neural network and employing logistic regression to update models. Unlike traditional methods that require prior knowledge of the degree of image blur, deep learning-based methods are capable of directly processing blurry images without the need for prior estimation of the blur degree. From the perspective of network architecture, deep learning-based image deblurring algorithms can be classified into convolutional neural network (CNN)-based, recurrent neural network (RNN)-based, generative adversarial network (GAN)-based, and Transformer-based networks. CNN-based methods can learn the mapping between blurry and clear images by training on a large number of image pairs, which enables them to perform blind deblurring. These algorithms take advantage of parameter sharing and local receptive fields, reducing the number of model parameters and improving the accuracy of image feature extraction. Image deblurring based on RNN is a type of neural network model that can handle sequential data through learning the relationship between sequential data. GAN-based deblurring approaches define image deblurring problems as an adversarial game between generators and discriminators. Transformer-based methods employ a self-attention mechanism to encode global dependencies between different spatial positions, thereby capturing the global information of an entire image. Our critical review focuses on the main concepts and discussions of the characteristics of each method for image deblurring from the perspective of network architecture. Particularly, we summarize the limitations of different deblurring algorithms. We also briefly introduce popular public datasets. Then, we review some image deblurring literature from two aspects: traditional methods and deep learning-based methods. The capability of representative algorithms is analyzed using peak signal-to-noise ratio and structural similarity evaluation indexes in terms of GoPro Labs, human-aware motion deblurring, and other datasets. Furthermore, this review critically analyzes the conclusion, highlighting the challenges in image deblurring.
摘要:ObjectiveOwing to the lack of sufficient environmental light, images captured from low-light scenes often suffer from several kinds of degradations, such as low visibility, low contrast, intensive noise, and color distortion. Such degradations will not only lower the visual perception quality of the images but also reduce the performance of the subsequent middle- and high-level vision tasks, such as object detection and recognition, semantic segmentation, and automatic driving. Therefore, the images taken under low-light conditions should be enhanced to meet subsequent utilization. Low-light image enhancement is one of the most important low-level vision tasks, which aims at improving the illumination and recovering image details of dark regions with lighting noise and has been intensively studied. Many impressive traditional methods and deep learning-based methods have been proposed. The methods achieved by traditional image processing techniques mainly include value mapping (such as histogram equalization and gamma correction) and model-based methods (such as Retinex model and atmospheric scattering model). However, they only improve image quality from a single perspective, such as contrast or dynamic range, and neglect such degradations as noise and scene detail recovery. On the contrary, with the great development of deep neural networks in low-level computer vision, deep learning-based methods can simultaneously optimize the enhancement results from multiple perspectives, such as brightness, color, and contrast. Thus, the enhancement performance is significantly improved. Although significant progress has been achieved, the existing deep learning-based enhancement methods have drawbacks, such as underenhancement, overenhancement, and color distortion in local areas, and the enhanced results are inconsistent with the visual characteristics of human eyes. In addition, given the high distortion degree of extremely low-light images, recovering scene details and suppressing noise amplification during enhancement are usually difficult. Therefore, increased attention should be paid to low-light image enhancement methods. To this end, a low-light image enhancement algorithm based on illumination and scene texture attention map is proposed in this paper.MethodFirst, unlike in normal-light images, the illumination intensity of RGB channels is obviously different in low-light images, leading to apparent color distortion. Color equalization processing is performed for low-light images to reduce the influence of color distortion on the estimation module of attention map. We implement color equalization using the illumination intensity of RGB channels estimated with the dark channel prior to make the light intensity of each channel similar. Second, considering that the minimum channel constraint map has the characteristics of noise suppression and texture prominence, we estimate the illumination and texture attention map of normal-exposure images on the basis of the minimum channel constraint map of low-light images and provide information guidance for the subsequent enhancement module. Thus, an attention map estimation module based on the U-Net architecture is proposed. Third, an enhancement module is developed to improve image quality from the perspectives of the whole image and local patches. In the global enhancement module, the estimated illumination and scene texture attention map is used to guide the illumination adjustment and noise suppression. The attention mechanism can enable the network to allocate different attention to various brightness areas in low-light images during the training process to help the network focus on useful information effectively. The global enhanced result is divided into small patches to deal with the problems of underenhancement and overenhancement in local areas to improve the results further.ResultTo verify the effectiveness of the proposed method, we compare it with six state-of-the-art enhancement methods, including two traditional methods: semi-decomposed decomposition (SDD) and plug-and-play Retinex model (PnPR), and four deep learning-based methods: EnlightenGAN, zero-reference deep curve estimation (Zero-DCE), Retinex-based deep unfolding network (URetinex-Net), and signal-to-noise-ratio aware low light image enhancement (SNR-aware). We use digital images from commercial cameras (DICM), low-light image enhancement (LIME), multi-exposure image fusion (MEF), and 9 other datasets to construct 178 low-light images for testing. These low-light images do not have normal-exposure image for reference. Quantitative and qualitative evaluations are performed. For the quantitative evaluation, natural image quality evaluator (NIQE), blind tone-mapped quality index (BTMQI), and no-reference image quality metric for contrast distortion (NIQMC) are used to assess image quality. NIQE examines the image with the designed natural image model. BTMQI evaluates image perception quality after tone mapping by analyzing the naturalness and structure. For NIQE and BTMQI, the lower the value is, the higher the natural quality of the image is. NIQMC evaluates image quality by calculating the contrast between the local properties and the related properties of the blocks in the image. The higher the score is, the better the image quality is. On the VV dataset, which is a challenging dataset, our method obtains the best results for the BTMQI and NIQMC indicators. Experiments on the 178 low-light images show that our method achieves suboptimal values for the BTMQI and NIQMC metrics, but the advantages of texture prominence and noise suppression are significant.ConclusionExperimental results indicate that the enhanced results by our method achieve expected visual effects in terms of brightness, contrast, and noise suppression. In addition, our method can realize expected enhancement results for extremely low-light images.
摘要:ObjectiveImage super-resolution aims to enhance the resolution and quality of low-resolution images, making them more visually appealing and suitable for human or machine recognition. By utilizing a series of degraded low-resolution images with coarse details, the objective is to reconstruct high-resolution images with finer details. The applications of super-resolution algorithms are vast and encompass areas, such as object detection, medical pathological analysis, remote sensing satellite images, and security monitoring. The promising prospects of these applications have led to an increased recognition of the importance of image super-resolution algorithms among researchers. With the advancement of deep learning in computer vision, this method has been successfully applied to image super-resolution, leading to significant achievements. However, the substantial number of parameters and the computational requirements of super-resolution models result in slow running speeds, limiting their practicality in real-world development and generation, particularly in mobile and edge devices. To address this issue, several lightweight super-resolution models have been proposed. Among these models, the Transformer-based approach stands out because it provides rich detail information in reconstructed images. However, this type of model still suffers from computational redundancy and large model size. To overcome these challenges, this study presents a novel lightweight super-resolution network based on the Transformer architecture.MethodA blueprint separable convolution Transformer network (BSTN) is proposed for lightweight image super-resolution. BSTN is divided into three parts: shallow feature extraction, deep feature extraction, and image reconstruction. In the shallow feature extraction stage, a 3 × 3 standard convolution operation is employed to extract low-level features from the input image. This initial feature extraction step helps capture basic image information, which is directly transmitted to the tail of the network to provide residual information via the long skip connection. The deep feature extraction component is composed of four successive residual attention Transformer groups (RATGs). The key elements within this stage are the shift channel attention module (SCAB) and the blueprint multi-head self-attention block (BMSAB). SCAB and BMSAB are combined to form the hybrid attention Transformer module (HATB). Two HATBs are connected together with a residual connection, and a standard convolution operation is applied to follow the two HATBs to construct the RATG. The blueprint feed-forward neural network is first designed for effectively suppressing low-information features and retaining only relevant and useful information. Then, the blueprint feed-forward neural network is introduced into the two aforementioned attention modules to efficiently extract the significant deep features for super-resolution. SCAB consists of three major components: shift convolution, contrast-aware channel attention, and blueprint feed-forward neural networks. Shift convolution reduces the number of network parameters and performs spatial information aggregation, enabling effective information fusion across different regions of the image. The contrast-aware channel attention mechanism focuses on important channel information, enhancing the representation of crucial features. BMSAB consists of a blueprint multi-head self-attention and a blueprint feed-forward neural network. This module allows for the extraction of self-attention with reduced computational complexity while suppressing low-information features through the blueprint feed-forward neural network. Finally, the shallow features extracted in the earlier stage and the deep features obtained from the RATGs are added together. The combined features are then processed using pixel shuffle, a technique that rearranges features to increase their spatial resolution. This final step generates the reconstructed high-resolution image with improved quality and detail. By utilizing the designed architecture and specific components, the proposed lightweight super-resolution network achieves effective feature extraction, self-attention calculation, and image reconstruction, addressing the challenges of parameter redundancy and large model size commonly encountered in Transformer-based super-resolution models. Our method is implemented using PyTorch on NVIDIA RTX 3090 GPU. The training datasets used in this study are DIV2K and Flicr2K, which consist of 800 and 1 000 images, respectively. Batch size is set to 32, and the patch size of the training data is set to 48 × 48 pixels. The initial learning rate is set to 5×10-4 and updated with an Adam optimizer by using a cosine descent strategy, while the total iteration is 106.ResultThe proposed method is compared with 11 state-of-the-art approaches on 4 datasets. In accordance with the quantitative results, the proposed method has achieved varying degrees of improvement in different magnifications and datasets, while parameter size and floating-point operations are at low levels. When the magnification factor is 2, the peak signal to noise ratio(PSNR) of this model is ranked first place on Set5, Set14, BSD100, and Urban100. It performs well on Set5 and Set14, surpassing the second-best model by 0.11 dB and 0.08 dB, respectively. When the magnification factor is 3, the PSNR also ranks first place, surpassing Set5 and Urban100 by 0.16 dB and 0.06 dB, respectively. When magnification is 4, it still ranks first place and outperforms the second-place models by 0.17, 0.05, and 0.04 dB on Set5, BSD100, and Urban100, respectively. In accordance with the qualitative results, the reconstructed image of the proposed method is clear, the blurred area is small, and details are rich.ConclusionA large number of comparative experiments and ablation studies demonstrate that the proposed EBST not only achieves state-of-the-art super-resolution results with excellent quantitative and visual performance, but it also has fewer parameters and floating-point operations. In particular, the proposed blueprint separable multi-head self-attention can effectively perform self-attention in Transformer blocks through a concise structure. The proposed blueprint feed-forward neural network can focus on helpful information and filter out useless information for super-resolution, resulting in high efficiency and low cost. It can be seamlessly integrated into other modules. Although our method performs well, its advantages in terms of lightweight models are in evident and should be further enhanced.
摘要:ObjectiveConvolutional neural networks (CNNs) and self-attention (SA) have achieved great success in the field of multimedia applications for dynamic association learning of SA and convolution in image restoration. However, owing to the intrinsic characteristics of local connectivity and translation equivariance, CNNs have at least two shortcomings, 1) limited receptive field and 2) static weight of sliding window at inference, unable to cope with content diversity. The former prevents the network from capturing long-range pixel dependencies, while the latter sacrifices the adaptability to input contents. As a result, they are far from meeting the requirement in modeling global rain distribution and generate results with obvious rain residue. Meanwhile, because of the global calculation of SA, its computational complexity grows quadratically with the spatial resolution, making it infeasible to apply to high-resolution images. In view of the advantages and disadvantages of these two architectures, this study proposes an association learning method to utilize the advantages of the two methods comprehensively and suppress their respective shortcomings to achieve high-quality and efficient inpainting.MethodThis study combines the advantages of CNN and SA architectures, particularly by fully utilizing CNNs’ local perception and translation invariance in specific local context and global structural representations, as well as SA’s global aggregation ability. We take inspiration from the observation that rain distribution reflects the degradation location and degree, in addition to rain distribution prediction. Therefore, we propose to refine background textures with the predicted degradation prior in an association learning manner. We accomplish image deraining by associating rain streak removal and background recovery, in which an image deraining network and a background recovery network are specifically designed for these two subtasks. The key part of association learning is a novel multi-input attention module (MAM). It generates the degradation prior and produces the degradation mask according to the predicted rainy distribution. Benefiting from the global correlation calculation of SA, MAM can extract informative complementary components from the rainy input (query) with a degradation mask (key) and then help realize accurate texture restoration. SA tends to aggregate feature maps with SA importance, but convolution diversifies them to focus on local textures. Unlike Restormer equipped with pure Transformer blocks, the design paradigm is promoted in a parallel manner of SA and CNNs, and a hybrid fusion network is proposed. The network involves one residual Transformer branch (RTB) and one encoder-decoder branch (EDB). The former takes a few learnable tokens (feature channels) as input and stacks multihead attention and feed-forward networks to encode global features of the image. The latter, conversely, leverages the multiscale encoder-decoder to represent contexture knowledge. We propose a lightweight hybrid fusion block to aggregate the outcomes of RTB and EDB to yield a final solution to the subtask. In this way, we construct our final model as a two-stage Transformer-based method, namely, ELF, for single image deraining.ResultAn ablation experiment is conducted on the Test1200 dataset to validate the effectiveness of various parts of the algorithm. The experimental results show that the fusion of CNN and SA can effectively improve the model’s expression ability. At the same time, the elimination of degraded disturbances and background repair association learning can effectively improve the overall repair effect. The method proposed in this paper is compared with over 10 new methods on the synthetic and real data of three inpainting tasks, and the proposed method achieves significant improvement. In the task of image rain removal, the ELF method improves the peak signal-to-noise ratio (PSNR) value by 0.9 dB compared with multi-stage progressive image restoration network (MPRNet) on the synthetic dataset Test1200. In the underwater enhancement task, ELF exceeds Ucolor by 4.15 dB on the R90 dataset. In the low-illumination image enhancement task, ELF achieves a 1.09 dB improvement compared with the LLFlow algorithm.ConclusionWe rethink image deraining as a composite task of rain streak removal, texture recovery, and their association learning and propose an ELF model for image deraining. Accordingly, a two-stage architecture and an associated learning module are adopted in ELF to account for the two goals of rain streak removal and texture reconstruction while facilitating the learning capability. The joint optimization promotes the compatibility while maintaining the model compactness. Extensive results on image deraining and joint detection tasks demonstrate the superiority of our ELF model over state-of-the-art techniques. The method proposed in this paper possesses efficiency and effectiveness and is superior to representative methods in common tasks such as image rain removal, low-light image enhancement, and underwater enhancement.
摘要:ObjectiveResearch on super-resolution image reconstruction based on deep learning techniques has gained exceptional progress in recent years. In particular, when the development of traditional convolutional neural networks reached a bottleneck, Transformer, which performs extremely well in natural language processing, was introduced to approximate super-resolution image reconstruction. However, the computational complexity of Transformer is related to the square of the width and height of the input image, leading to the inability to migrate Transformer to low-level computer vision tasks fully. Recent methods, such as image restoration using Swin Transformer (SwinIR), have achieved excellent performance by dividing windows, performing self-attention within the windows and interacting the information between the windows. However, this method of dividing windows increases the computational burden as the window size increases. Moreover, the window division method cannot model the global information of images completely, resulting in partial loss of information. To solve the above problems, we model the long-range dependencies of images by constructing a Transformer block while maintaining a moderate level of the number of parameters. Excellent super-resolution reconstruction performance is achieved by constructing global dependencies of features.MethodThe proposed super-resolution network based on self-attention (SRTSA) consists of four main stages: a shallow feature extraction module, a deep feature special extraction module, an image upsampling module, and an image reconstruction module. The shallow feature extraction part consists of a 3 × 3 convolution. The deep feature extraction part mainly consists of a global and local information extraction block (GLIEB). Our proposed GLIEB performs simple relational modeling through a sufficiently lightweight nonlinear activation free block (NAFBlock). Although dropout can improve the robustness of the model, we discard the dropout layer to avoid losing other information before modeling the feature information globally. In the global modeling of feature information using the transposed self-attention mechanism, we keep the features with positive effects on image reconstruction and discard the features with negative effects by replacing the softmax activation function in the self-attention mechanism with the ReLU activation function, which makes the reconstructed global dependencies more robust. Given that an image includes global and local information, a residual channel attention module is used to supplement the local information and enhance the expressive ability of the model. Furthermore, a new dual-channel gating mechanism is introduced to control the flow of information in the model to improve the modeling capability of the model for features and its robustness. The image upsampling module uses subpixel convolution to expand the features to the target dimension, and the reconstruction module employs a 3 × 3 convolution to obtain the final reconstruction results. For the loss function, although many loss functions have been proposed to optimize model training, to demonstrate the advancement and effectiveness of our model, we use the same L1 loss function as that of SwinIR to supervise the model training. The L1 loss function can provide a stable gradient that allows the model to converge quickly. In the image training phase, 800 images from the DIV2K dataset are used for training. The 800 training images are randomly rotated or horizontally flipped to expand the dataset, and 16 LR image blocks of size 48 × 48 pixels are used as input in each iteration. The Adam optimizer is used for training.ResultWe test on five datasets commonly used in super-resolution tasks, namely, Set5, Set14, Berkeley segmentation dataset 100(BSD100), Urban100, and Manga109, to demonstrate the effectiveness and robustness of the proposed method. We also compare the proposed method with SRCNN, VDSR, EDSR, RCAN, SAN, HAN, NLSA, and SwinIR networks in terms of objective metrics. These networks are supervised using the L1 loss function during the training process. The peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are calculated on the Y channel of the YCbCr space of the output image to measure the image reconstruction effect. Experimental results show that the PSNR and SSIM values obtained our method are both optimal. In the ×2 super-resolution tasks, compared with those of SwinIR, the PSNR of the proposed method is improved by 0.03 dB, 0.21 dB, 0.05 dB, 0.29 dB, and 0.10 dB, and the SSIM is enhanced by 0.000 4, 0.001 6, 0.000 9, and 0.002 7 on four datasets, except Manga109. The reconstruction effect demonstrates that SRTSA can recover more detailed information and more texture structure compared with most methods. From the attribution analysis of the model using local attribution maps (LAM), SRTSA uses a larger range of pixels in the reconstruction process compared with other methods, such as SwinIR, which fully illustrates the global modeling capability of SRTSA.ConclusionThe proposed super-resolution image reconstruction algorithm based on a transposed self-attention mechanism can fully model the global relationship of feature information without losing the local relationship of features by converting the global relationship modeling in the spatial dimension into a channel dimension for global relationship modeling. It also contains global and local information, which effectively improves the image super-resolution reconstruction performance. The excellent PSNR and SSIM on five datasets and the significantly high quality of the reconstructed images with rich details and sharp edges fully demonstrate the effectiveness and advancedness of the proposed method.
摘要:ObjectiveThe large amount of data obtained by various terminal devices often results in incomplete data due to missing information or is frequently plagued by degradation issues. Low-rank tensor completion has received significant attention for recovering contaminated data. Tensor decomposition can effectively explore the essential features of tensors, but the tensor rank function induced by traditional tensor decomposition methods cannot explore the correlation between different modes of tensors. In addition, traditional tensor completion methods typically impose the total variational constraint on the overall tensor data, which cannot fully utilize the smoothing prior for the low-dimensional subspace of tensors. To address the above two problems, this study proposes a low-rank tensor recovery algorithm using sparse prior and multimode tensor factorization. The traditional low-rank tensor completion models based on tensor rank minimization restore tensors by directly minimizing the tensor rank, in which the tensor rank can be Tucker rank and tensor nuclear norm (TNN). However, extensive research has shown that a correlation exists among different modes of tensor data. The Tucker rank induced by Tucker decomposition and the TNN induced by tensor singular value decomposition cannot flexibly handle multimode correlations within tensors. Therefore, we consider introducing multimode tensor decomposition via mode-n product, incorporating multimode tensor decomposition into the tensor rank minimization model. In the process of continuous iteration to complete the overall tensor, our model can effectively explore the characteristics of mutual correlation between different modes of the tensor, which can address the limitation of traditional TNN in inadequately capturing the intermode correlations within the tensor. Each factor matrix obtained from the multimode tensor decomposition framework encapsulates latent information corresponding to its respective mode, revealing valuable correlated auxiliary information within and across modes, such as the local sparsity exhibited by natural tensor data. By showing that the majority of factor gradients in the factor gradient histogram are zero or close to zero, we can demonstrate that the factors in multimode tensor decomposition exhibit local sparsity. Therefore, on the basis of the assumption of tensor subspace, we consider introducing the local sparsity prior to preserve the similarity in local segments.MethodThe method incorporates multimode tensor factorization and local sparsity of decomposed factors based on the tensor rank minimization model. First, the nuclear norm constraint is imposed on the original tensor to capture the global low rankness of the tensor, which makes the model robust when dealing with tensor completion tasks. Second, multimode tensor factorization is used to decompose the tensor into a series of low-dimensional tensors and a series of factor matrices along each mode, which explores the correlation between different modes. The factor gradient sparsity regularization constraint is imposed on the factor matrices to explore the local sparsity of the tensor subspace, which further improves the tensor recovery performance. Specifically, after tensor decomposition, first-order differencing is applied, and the norm smoothness constraint is leveraged. Combining multimode tensor decomposition with tensor subspace sparsity, a robust tensor completion model is developed. The proposed model is optimized through the alternating direction method of multipliers (ADMM) framework, which is achieved by iteratively updating various variables to accomplish tensor completion and tensor decomposition simultaneously.ResultThe method in this paper is quantitatively and qualitatively compared with eight other restoration methods at three loss rates on hyperspectral images, multispectral images, YUV (also known as YCbCr) videos, and medical imaging data. The restoration effect of our method is basically the same as that of the deep learning GP-WLRR method, but it has no computational burden at all. Compared with six other tensor modeling methods, our method achieves the best results in terms of mean peak signal-to-noise ratio (MPSNR) and mean structural similarity (MSSIM) metrics. It exhibits superior recovery performance even at high loss rates up to 95%. This finding demonstrates the effectiveness of the proposed model in tensor data recovery.ConclusionThe low-rank tensor completion algorithm based on sparse prior and multimode tensor decomposition proposed in this paper can simultaneously exploit the global low rankness and local sparsity of a tensor and effectively recover contaminated multichannel visual data.
摘要:ObjectivePanoramic movie technology has experienced notable advancements to enrich the audiovisual experience for viewers, resulting in a heightened sense of immersion within the visual environment. Nevertheless, the production of high-quality images poses a challenge for conventional rasterization techniques, necessitating the exploration of alternative approaches. Monte Carlo path tracing algorithms have been proven effective in generating high-quality images, offering exceptional visual fidelity in various rendering applications. However, the computational overhead associated with this algorithm remains challenging. Thus, reducing the number of pixels sampled in Monte Carlo path tracing is a common approach to optimize computation. However, this reduction often introduces noticeable noise in the resulting images, compromising their overall quality. This paper aims to address the issue of image noise in Monte Carlo path tracing by exploring and proposing advanced techniques for denoising. Two main denoising approaches are commonly used in the domain of Monte Carlo rendering. The first approach utilizes traditional filtering methods with artificially designed filters to remove image noise. This approach is versatile, but its effectiveness in noise removal may be limited, often resulting in residual noise. The second approach involves deep learning-based denoising methods, which can effectively eliminate noise but may exhibit performance limitations on specific image types. Most existing image denoising algorithms are currently developed and studied for ordinary flat images, with limited research dedicated to denoising algorithms specifically designed for panoramic images. Panoramic images possess unique characteristics, including a 360° field of view in the horizontal direction, a 180° field of view in the vertical direction, distorted edges, and varying prominence of equatorial and polar pixels as perceived by human observers. Conventional flat image denoising methods often fail to fully account for these panoramic image characteristics, leading to excessive blurring or residual noise in the equatorial, polar, and distorted edge regions after the denoising process. Therefore, this paper proposes a visual saliency-driven non-local means (VSD-NLM) filtering denoising algorithm explicitly tailored for Monte Carlo rendering of panoramic images. The algorithm aims to leverage the distinctive characteristics of panoramic images, such as the 360° field of view, distorted edges, and varying pixel prominence, to effectively reduce noise while preserving the essential features of panoramic images. Through comprehensive experimentation and evaluation, the proposed algorithm demonstrates its efficacy in enhancing the image quality of Monte Carlo-rendered panoramic images, providing a valuable contribution to the field of panoramic image denoising.MethodThis paper presents the design and optimization of the VSD-NLM filtering algorithm for denoising Monte Carlo-rendered panoramic images. The proposed algorithm comprises two key components aimed at effectively removing noise and enhancing image quality in panoramic scenes. The first component focuses on enhancing the non-local means filtering process specifically tailored for panoramic images. Initially, a panoramic image saliency detection model is utilized to generate a saliency image, incorporating an equatorial bias to improve saliency accuracy. Subsequently, the saliency image is employed to delineate saliency and non-saliency regions within the panoramic image. In the saliency region, the deviation value of the gradient magnitude similarity between image blocks is calculated to refine the weights used in non-local means filtering. For the non-saliency region, parallel algorithms for non-local means filtering are devised to accelerate the filter reconstruction process. Finally, denoising results from the saliency and non-saliency regions are combined to produce the final denoised panoramic image. The second component of the algorithm focuses on optimized noise reduction, specifically addressing the distorted edge regions of the panoramic image. Improvements are made to the Canny algorithm to obtain a highly accurate edge gradient image. Such improvements involve optimizing the weights for the 45° and 135° directions of the image, generating adaptive high and low thresholds using an improved Otsu method, and enhancing the local thresholds to optimize the performance of the Canny operator. Subsequently, anisotropic diffusion filtering is combined with guided filtering by utilizing the gradient image as a guide to filter and enhance the combined images. The optimizations of the proposed algorithm collectively contribute to effective noise reduction in the distorted edge regions of panoramic images, resulting in enhanced image quality and reduced noise artifacts.ResultThis paper presents a comprehensive performance evaluation of the proposed denoising algorithm for panoramic images and utilizes structural similarity (SSIM) and FLIP metrics as objective evaluation indicators. The performance of the VSD-NLM algorithm is compared with other algorithms such as non-local means filtering, multifeature non-local means filtering, and progressive denoising algorithms to assess its effectiveness in reducing noise and improving the visual quality of panoramic images. Experimental results reveal that the proposed algorithm outperforms the comparison algorithms in terms of objective evaluation indicators. The average FLIP value achieved by the proposed algorithm is 15.2% lower compared with other algorithms. Similarly, the average SSIM value attained by the proposed algorithm is 14.7% higher than other algorithms, indicating its enhanced SSIM preservation. Furthermore, the visual effects of the algorithm are assessed, demonstrating its capability to mitigate blurring artifacts in panoramic images and enhance visual perception quality. This paper also presents an experimental verification of the effectiveness of two denoising algorithms: gradient magnitude similarity deviation assisted non-local means (GMSDA-NLM) and parallel non-local means (P-NLM). The GMSDA-NLM algorithm combines the strengths of non-local mean filtering and gradient magnitude similarity deviation to achieve superior noise reduction capabilities while maintaining the integrity of image details. This algorithm effectively identifies and suppresses noise while preserving the essential image features. The P-NLM algorithm exhibits a notable average speed increase of approximately six times compared to the nonparallel algorithm, facilitating real-time or near-real-time noise reduction applications. The SSIM value between P-NLM and the image generated by the nonparallel algorithm can reach 0.996.ConclusionThis paper introduces a specialized denoising algorithm tailored for panoramic images, specifically addressing the unique challenges associated with denoising in this domain. From a practical perspective, the proposed algorithm holds substantial value for panoramic film production. The algorithm enhances the visual quality and fidelity of panoramic films by markedly reducing noise in panoramic images. The remarkable results obtained through the proposed algorithm contribute to immersive visual storytelling, elevating the overall cinematic experience and capturing the attention of audiences. Overall, the exceptional results achieved through the algorithm present valuable theoretical advancements and provide practical implications for panoramic film production, enhancing the quality and impact of visual narratives in the realm of immersive cinematography.
关键词:panoramic image;non-local means filter;gradient magnitude similarity deviation(GMSD);guided filtering;image denoising
摘要:ObjectiveThe halftoning method represents continuous-tone images by using two levels of color, namely, black and white; it is commonly used in digital image printing, publishing, and displaying applications because of cost considerations. Compared with continuous-tone images, a halftone image has only two values. The halftoning method can save considerable storage space and network transfer bandwidth, so it is a feasible and important image compression method. Image inverse halftoning is a classic image restoration task, aiming to recover continuous-tone images from halftone images with only bilevel pixels. However, owing to the loss of original image content in halftone images, inverse halftoning is also a classic ill-problem. Although existing inverse halftoning algorithms have achieved good performance, their reconstruction results indicate lost image details and features, causing varying degrees of curvature and roughness in some high-frequency regions and resulting in poor visual reconstruction results, which still cannot meet the requirements for high detail and texture of images. Therefore, inverse halftoning remains a challenge in recovering high-quality continuous-tone images. Many previous methods focused on model design to improve performance, ignoring the important impact of training strategies on model optimization, which led to poor model performance. To solve these problems, we propose an inverse halftone network to improve the quality of halftone image reconstruction and explore different training strategies to optimize model training.MethodIn this paper, we propose an end-to-end multiscale progressively residual learning network (MSPRL), which is based on the UNet architecture and takes multiscale input images. To make full use of different input image information, we design a shallow feature extraction module to capture the attention features of different-scale images. We divide our model into an encoder and a decoder, where the encoder focuses on restoring content information, and the decoder receives the aggregation features of the encoder to strengthen deep feature learning. The encoder and the decoder are composed of residual blocks (RBs). We design our MSPRL to comprise three levels, each level receiving the input halftone images of different scales. To collect the encoder features and transmit them to the decoder, we use the Concat operation and a convolutional kernel as the feature fusion module (FF) to aggregate the feature maps of different-level encoders. In our overall model, input halftone images are progressively learned from the left encoder to the right decoder. We systematically study the effects of different training strategies for model training and reconstruction performance. For example, the performance of using pixel patch size is slightly lower than that of using pixels patch size, but its training speed is significantly reduced by about 65% during the model training phase. Adding fast Fourier transform loss can further improve the model performance compared with the use of a single loss. We also compare different feature channel dimensions, feature extraction blocks, and activation functions. Experimental results demonstrate that effective learning strategies can optimize model training and significantly improve performance.ResultThe experimental results are compared with the results of six methods on seven datasets, including a denoising convolutional neural network, VDSR, an enhanced deep super-resolution network, a progressively residual learning network (PRL), a gradient-guided residual learning network, a multi-input multi-output UNet, and a retrained PRL (PRL-dt). On the Places365 and Kodak datasets, compared with that of the second-best-performing model PRL-dt, the peak signal-to-noise ratio (PSNR) of our MSPRL is increased by 0.12 dB and 0.18 dB, respectively. On the other five commonly used test datasets (Set5, Set14, BSD100, Urban100, and Manga109) for image super-resolution, compared with that of the second-best-performing model PRL-dt, the PSNR of MSPRL is increased by 0.11 dB, 0.25 dB, 0.08 dB, 0.39 dB and 0.35 dB, respectively. Based on our training strategies, PRL-dt has an average PSNR improvement of 1.44 dB compared with the unoptimized training PRL on the seven test datasets. Extensive experiments demonstrate that MSPRL achieves significant reconstruction results in image details and textures.ConclusionIn this paper, we propose an inverse halftone network to solve the problem of low-quality reconstruction for inverse halftoning. Our MSPRL contains an SFE, an FF, and an encoder and a decoder with RBs as the core. It combines the advantages of the UNet architecture and multiscale image information and chooses appropriate training strategies to improve image reconstruction quality and the visual effects in terms of details and textures. Extensive experiments demonstrate that our MSPRL outperforms previous approaches and achieves state-of-the-art performance.
摘要:ObjectiveImages are often taken under sub-optimal lighting conditions and are disturbed by backlight, uneven illumination, and weak light due to unavoidable environmental and technical limitations, such as insufficient lighting and limited exposure time. The quality of such images will be affected, and the information transmission for high-level tasks, such as object tracking, recognition, and detection, is also unsatisfactory. Various methods have been proposed, but they often fail to produce attractive results visually. These images have problems such as unclear details, low contrast, and color distortion. Existing deep learning methods have better accuracy, robustness, and speed than the traditional methods. However, the generalization performance is generally poor due to synthetic data sets. For example, the supervised learning method requires pairs of low- and normal-light images, and the visual effect of the trained model applied to the real low-light image is remarkably poor. Considering the above problems, a low-light image enhancement method guided by semantic segmentation and HSV color space is proposed. This method does not require excessive computing resources while restoring the true color and detail texture of the object. Moreover, the generalization performance of the model is better than supervised learning because it is a nonreference training.MethodThe proposed framework is an end-to-end low-light image enhancement network based on seven convolutional layers with a symmetrical structure similar to U-Net. The input is a low-light image, and the output is a set of best-fit curve parameter graphs. Through iterative application of the curve, all pixels in the RGB channel of the input low-light image are mapped to obtain the final enhanced image. The curve in this study can automatically map the low-light image to the enhanced image, and the curve parameters are adaptive, only depending on the input image and learning through the network. After the network extracts the curve parameter diagram of the input image, the curve is repeatedly applied for image enhancement, and the enhancement results are evaluated and guided by a series of nonreference loss functions. Simultaneously, the result of the last step of iterative enhancement is sent into an unsupervised semantic segmentation network to preserve the semantic information of the image. The loss function includes the following: 1) spatial consistency loss is used to maintain the consistency of details between the enhanced and original images and address the problem of unclear details in most low-light image enhancement. The enhanced result and low-light image are divided into several small local regions to minimize the pixel differences between the corresponding local regions in the enhanced result and the surrounding one-pixel-wide local regions in the low-light image as much as possible. 2) HSV loss is used to restore the color information of the image. The enhanced result and the low-light image input are converted from RGB to the HSV color space, and the hue and saturation differences for each pixel between the enhanced result and the corresponding low-light image are then calculated. A small difference in hue and saturation indicates that the color is close to the original color of the low-light image. 3) Exposure loss is used to enhance brightness by providing each pixel with brightness close to a certain middle value, enhancing the overall brightness level of the final image. This middle value represents the ideal exposure value. 4) Semantic loss is used to retain semantic information. The unsupervised semantic segmentation network performs pixel-wise segmentation on the enhanced image, obtaining the predicted probability for each pixel and using this probability to design the semantic loss. 5) Total variation loss is used to maintain the difference between adjacent pixels of the image. The estimated curve parameter map is smoothed to ensure that the curve parameter values of adjacent pixels are close to each other and preserve the monotonicity of the curve as much as possible.ResultThe proposed method in this paper was compared with five methods, including low-light image enhancement (LIME), RetinexNet, EnlightenGAN, zero-reference deep curve estimation (Zero-DCE) and semantic-guided zero-shot learning (SGZ). The quality of enhanced images is objectively evaluated using full-reference evaluation metrics, such as peak signal-to-noise ratio (PSNR), structural similarity (SSIM), mean absolute error (MAE), and no-reference evaluation metric natural image quality evaluator (NIQE), while incorporating subjective visual effects for comprehensive evaluation. PSNR is used to measure the level of noise and distortion in an image. A high value theoretically indicates a small error between the enhanced and reference images, thus indicating high quality. SSIM is a perceptual model that aligns with human visual perception and is used to measure the similarity between the enhanced and reference images in terms of contrast, brightness, and structure. A high SSIM value indicates closeness between the enhanced and reference images. A small MAE value indicates a small deviation from the reference image. NIQE compares the image with a designed natural image model, and a low NIQE value indicates a high similarity to natural real images. On the peak signal-to-noise ratio, the proposed method is 0.32 dB higher than Zero-DCE; on natural image quality evaluation value, the method is higher than EnlightenGAN by 6%. From a subjective viewpoint, the proposed method in this paper addresses the existing problems of unclear details and color distortion in other methods and has better visual effects.ConclusionAn unsupervised semantic segmentation network was introduced in this paper to perform pixel-wise segmentation on the enhanced images, preserving the semantic information during the enhancement process. The color of low-light images was restored by designing a loss function in the HSV color space. The spatial consistency loss was designed to ensure that the enhanced images are as detail-consistent as possible with their corresponding low-light images. Subjective and objective evaluations were conducted to demonstrate the superiority of the proposed method over others. Experimental results show that the proposed enhancement method outperforms other methods in qualitative and quantitative aspects, effectively addressing the issues of unclear details and color distortion in low-light images and demonstrating its practical value.
关键词:image processing;low-light image enhancement;deep learning;semantic segmentation;HSV color space
摘要:ObjectiveIn the art field, an exquisite painting typically takes considerable effort from sketch drawing in the early stage to coloring and polishing. With the rise of animation, painting, graphic design, and other related industries, sketch colorization has become one of the most tedious and repetitive processes. Although some computer-aided design tools have appeared in past decades, they still require humans to accomplish colorization operations, and drawing an exquisite painting is difficult for ordinary users. Meanwhile, automatic sketch colorization is still a difficult problem. Therefore, both the academia and industry are in urgent need of convenient and efficient sketch colorization methods. With the development of deep neural networks (DNNs), DNN-based colorization methods have achieved promising performance in recent years. However, most studies have focused on grayscale image colorization for natural images, which is quite different from sketch colorization. At present, only a few studies have focused on automatic sketch colorization, and they typically require user guidance or are designed for certain types, such as anime characters. However, automatically understanding sparse lines and selecting appropriate colors remain extremely difficult and ill-posed problems. Thus, disharmonious colors frequently appear in recent automatic sketch colorization results, e.g., red grass and black sun. Therefore, most sketch colorization methods reduce difficulty through user guidance, which can be roughly divided into three types: reference image-based, color hints-based, and text expression-based. Although these semi-automatic methods can provide more reasonable results, they still require inefficient user interaction processes.MethodIn practice, we observe the phenomenon that colors used in paintings of a particular style are typically fixed and finite, rather than arbitrary colorization. Therefore, this study focuses on automatic sketch colorization with finite color space prior, which can effectively reduce the difficulty of understanding semantics and avoid undesired colors. In particular, a two-stage multi-scale colorization network is designed. In the first stage, a subnetwork is proposed to generate a dense grayscale image from the input sparse sketch. It adopts a commonly used U-Net structure. The U-Net structure can obtain a large receptive field, and thus, is helpful for understanding high-level semantics. In the second stage, a multi-scale generative adversarial network is designed to colorize the grayscale image in accordance with the input color palette. In this study, we adopt three scales. At the minimum scale, the input color palette prior is used to guide color reasoning, which contains several specified dominant colors. Then, the sketch content features extracted by the content encoder are converted into spatial attention and multiplied by the color features extracted by the color encoder to perceive preliminary color reasoning. The structure of two other scales is similar to the minimum scale. Color guidance information is gradually fused to generate higher-quality and higher-resolution results. Adversarial learning is adopted. Hence, three scales of discriminators that correspond to each generator and a discriminator for grayscale generation are used. Intermediate loss is designed to guide the generation of grayscale image in the first stage. Pixel-wise loss, adversarial loss, and TV loss are used in the second stage. Considering the application scenarios, we construct three datasets, namely, the Regong Tibetan painting dataset, the Thangka (Tibetan scroll painting) elements dataset, and a specific cartoon dataset. Each dataset contains images, color palettes, and corresponding sketches.ResultOur model is implemented with the Tensorflow platform. For the three datasets, all the images with the size of 256 × 256 pixels are divided into training and testing sets with a ratio of 9∶1. The proposed method is compared with the baseline model Pix2Pix and several automatic sketch colorization methods: AdvSegLoss, PaintsChainer, and Style2Paints V4.5. Notably, PaintsChainer and Style2Paints are colorization tools that can be used directly, and they require humans to input sketches manually. Hence, we only retrain Pix2Pix and AdvSegLoss with their official implementations on the same datasets for fair comparison. For quantitative evaluation, peak signal to noise ratio (PSNR), structural similarity index measure (SSIM), and mean squared error (MSE) are used to measure the objective quality of the output coloring images. Meanwhile, the colorfulness score is used to measure the color vividness of coloring images. The proposed method nearly achieves the highest PSNR and SSIM values and the lowest MSE values on the three datasets, and it obtains the best colorfulness scores on all datasets. Although quantitative evaluation has shown the effectiveness of the proposed method, comparing colorization results through human subjective feelings is more meaningful. Hence, subjective results and user studies are provided to evaluate the performance of the proposed method. For qualitative evaluation, the proposed method can reproduce harmonious colors and obtain better visual perception than comparison methods. Furthermore, owing to the decoupled color reasoning and fusion modules in our model, the color of the output image can be changed flexibly with different input color priors. In a user study, most users prefer our results on both aspects of color selection and overall image quality. Moreover, we conduct ablation experiments, including training a single-stage model; that is, coloring an image without generating a grayscale image first and using only single-scale in the second stage. The performance of the single-stage model is the poorest, proving that learning automatic colorization directly is an extremely difficult task. The performance of the single-scale model is poorer than that of the multi-scale model, verifying the effectiveness of the multi-scale strategy.ConclusionExperimental results on three datasets show that the proposed automatic sketch colorization method based on finite color space prior can colorize sketch effectively and easily generate various colorful results by changing color prior.
关键词:sketch colorization;finite color space;cartoon;painting;generative adversarial network
摘要:ObjectiveInformation hiding techniques, including watermarking and steganography, have currently become effective and important applications for secret communication. For a country or an organization, information hiding technology can be used to realize the communication of confidential information through watermarking technology to protect the copyright of software or digital publications. Information hiding technologies always realize information hiding by modifying the transmission carrier. However, with the continuous development of steganalysis technology, information hiding algorithms suffer from the risk of easy detection. Therefore, some scholars have introduced coverless information hiding algorithms. Coverless information hiding does not mean secret communication without transmission media and does not modify the transmission media, realizingsecret information transmission by transmitting the original media without modification. The coverless information hiding algorithm can be divided into selective and constructed information hiding. Selective information hiding encodes or extracts features from the images in the image library and chooses the images corresponding to the secret information. By contrast, constructive information hiding is a technology that uses secret information as the driver to generate secret communications media to realize information hiding. However, traditional constructive information hiding algorithms generally use visual signals such as pixels or patterns to hide the secret information directly in the image, resulting in a strong correlation between the image content and the secret information, which is easily identified by analyzing and detecting the image. Considering the widespread use of B-spline curves in computer graphics, this paper proposes a constructive information hiding algorithm based on B-spline to generate texture images. The position of the control points is modified in accordance with the secret information. Therefore, the secret information is indirectly hidden in the image special domain and has no direct relationship with the spatial characteristics.MethodIn the information hiding stage, the hider first divides the blank canvas into blocks and numbers and scrambles the subblocks for encryption. The midpoints of a group of subblocks are randomly selected to obtain a group of coordinate points as the initial control points of a B-spline curve. The control points of multiple groups of B-spline curves are then obtained by affine transformation of the initial control points. Consequently, the B-spline curves are drawn. The position of control points of each curve is changed in accordance with the secret information; that is, a texture image comprising stego curves is generated. Finally, the colors corresponding to the numbers of the selected subblocks are chosen from the color library to fill the texture image; that is, the construction of the color stego texture image is completed. In the information extraction stage, the extractor first blocks, numbers, and scrambles the image with the same key and then extracts the indexes of subblocks according to the colors of the stego image to calculate the initial control points. The edge detection of the stego texture image is conducted to obtain stego curves. Finally, the control points of stego curves are obtained by combining the stego curves and the initial control points, and then the secret information is extracted.ResultThe hidden capacity of the texture image generated by the algorithm in this paper can change with the variation of texture shape. In the comparison experiment of hidden capacity for the color image with the same size of 800 × 800 pixels, the algorithm in this paper can hide 2 870 bits of secret information, which is 6.7 and 3.4 times of the hidden capacity of the two other texture constructive information hiding algorithms. The proposed algorithm has good robustness to common image attacks and has strong anti-JPEG compression capability. The SRNet algorithm is used for steganalysis, and the detection error is close to 0.5 when the hiding capacity is low, which indicates that detecting the proposed algorithm using traditional steganalysis algorithm is difficult.ConclusionIn this paper, the control points of the B-spline curve are used as features to realize indirect information hiding in image spatial domain, which solves the problem of strong management of image content and secret information in traditional constructive information hiding algorithm. The hidden capacity of this texture image generation algorithm can be adjusted flexibly, demonstrating its robustness to common image attacks and strong steganalysis resistance.
关键词:constructive information hiding;B-spline;texture image;hiding capacity;security;robustness
摘要:ObjectiveTexture exemplar refers to the input samples or templates for texture synthesis or generation that contains the desired texture features and structures. Texture synthesis refers to the generation of new texture images by combining or duplicating one or more texture samples. In the texture synthesis task based on the texture exemplar, the diversity and texture structure of the texture exemplar play a decisive role that determines the effect of the texture synthesis task. In the field of computer vision, texture sample diversity is crucial in texture synthesis tasks, which can bring richer, diverse, and realistic appearance to synthesized textures. Simultaneously, it can provide greater creative inspiration and design ideas to artists and designers. At present, texture exemplars can be extracted from multiple sources, such as public texture datasets, Internet picture clips, or photography. That is, texture exemplars are mostly extracted via manual cutting and automatic algorithm extraction. However, not everyone is an artist, and extracting a good texture sample or cutting out a small texture exemplar from an existing image is difficult for ordinary people. In addition, manually cropping and extracting high-quality texture samples from a large number of images consumes considerable energy and time for texture artists, and this method is easily driven by subjectivity and limited in diversity. With the development of deep learning algorithms, the currently used state-of-the-art automatic texture exemplar extraction algorithm is the Trimmed T-CNN model based on a convolutional neural network (CNN). It can effectively extract a variety of texture exemplars from the input image. However, the model uses a selective search algorithm to generate a candidate region, and thus, this process is time-consuming and computationally complex, and the model suffers from slow inference speed. Considering the aforementioned reasons, this study is committed to using the rich image resources on the Internet to automatically, quickly, and accurately cut out ideal and diverse texture exemplars from various images, providing users with more choices, and to better meet the needs of texture synthesis task requirements.MethodOn the basis of the algorithm idea of object detection, we propose an automatic texture exemplar extraction algorithm that combines deep learning and broad learning. This algorithm generates candidate texture exemplar regions through CNN and then uses broad learning for classification. To obtain the ideal texture exemplar, this study first uses the residual feature pyramid network to extract feature maps from the original image, aiming to effectively identify and generate texture exemplar candidates from the input image and then using the region candidate network to automatically and quickly obtain a large number of multi-scale texture exemplar candidate regions. Subsequently, we leverage a broad learning system to classify the candidate regions of texture exemplars extracted in the previous step. Finally, to obtain the ideal texture exemplar, we designed a scoring criterion based on classification accuracy, distribution characteristics, and size, aiming to use the scoring criterion to score the classification results of the broad learning system to screen out the ideal texture exemplars.ResultTo verify the effectiveness of the proposed method, we first collect a large number of ideal texture exemplars with distinguishable and representative features as a training dataset and divide them into six classes based on size and regularity for experimental verification. A large number of qualitative and quantitative experiments are performed in this study. The experimental results show that the accuracy of the model developed in this study reaches 94.66%. Compared with the state-of-the-art method Trimmed T-CNN, the accuracy of the model in this study increases by 0.22% and speed is improved. In particular, for images with resolutions of 512 × 512 pixels, 1 024 × 1 024 pixels, and 2 048 × 2 048 pixels, the speed of the algorithm in this study is increased by 1.393 8 s, 1.864 3 s, and 2.368 7 s, respectively.ConclusionIn this study, we propose an automatic texture exemplar extraction algorithm based on deep learning and broad learning. This algorithm effectively combines the advantages of CNNs and broad learning classification systems. The experimental results show that our model outperforms several state-of-the-art texture exemplar extraction methods, making texture exemplar extraction results more accurate and efficient.
摘要:ObjectiveThe performance of traditional visual place recognition (VPR) algorithms depends on the imaging quality of optical images. However, optical cameras suffer from low temporal resolution and dynamic range. For example, in a scene with high-speed motion, continuously capturing the rapid changes in the position of the scene in the imaging plane is difficult for an optical camera, resulting in motion blur in the output image. When the scene brightness exceeds the recording range of the photosensitive chip of the camera, output image degradation of the optical camera such as underexposure and overexposure may occur. The blurring, underexposure, and overexposure of optical images will lead to the loss of image texture structure information, which will result in the performance reduction of visual scene recognition algorithms. Therefore, the recognition performance of image-based VPR algorithms is poor in high-speed and high dynamic range (HDR) scenarios. Event camera is a new type of visual sensor inspired by biological vision. This camera has the characteristics of low latency and HDR. Using event cameras can effectively improve the recognition performance of VPR algorithms in high-speed and HDR scenes. Therefore, this paper proposes a VPR algorithm fused with event cameras, which utilizes the low latency and HDR characteristics of event cameras to improve the recognition performance of VPR algorithms in extreme scenarios such as high speed and HDR.MethodThe proposed method first fuses the information of the query image and the events within its exposure time interval to obtain the multimodal features of the query location. The method then retrieves the reference image closest to the multimodal features of the query location in the reference image database. This method also extracts the features of the reference image with good quality using the image feature extraction module and then inputs query image and its events within the exposure time interval to the multimodal to compare the multimodal query information with the reference image. Multimodal fusion features are obtained by the multimodal feature fusion module, and the reference image most similar to the query image is finally obtained through feature matching retrieval, thereby completing visual scene recognition. The network training is supervised by a triplet loss. The triplet loss drives the network to learn in the direction where the vector distance between the query and positive features is smaller, and the vector distance between the negative feature is larger, until the difference between the negative distance and the positive distance is not less than the similarity distance constant. Therefore, distinguishing reference images with similar and different fields of view from the query image according to the similarity in the feature vector space is possible, further completing the VPR task.ResultThe experiments are conducted on the MVSEC and RobotCar datasets. The proposed method is compared in experiments with image-based method, event camera-based method, and methods that utilize image and event camera information. Under different exposure and high-speed scenarios, the proposed method has advantages over existing visual scene recognition algorithms. Specifically, on the MVSEC dataset, the proposed method can reach a maximum recall rate of 99.36%and a maximum recognition accuracy of 96.34%, which improves the recall rate and precision by 5.39% and 8.55%, respectively, compared with the existing VPR methods. On the RobotCar dataset, the proposed method can reach a maximum recall rate of 97.33%and a maximum recognition accuracy of 93.30%, which improves the recall rate and precision by 3.36% and 4.41%, respectively, compared with the existing VPR methods. Experimental results show that in the high-speed and HDR scene, the proposed method has advantages compared with the existing VPR algorithm and demonstrates a remarkable improvement in the recognition performance.ConclusionThis paper proposes a VPR algorithm that fuses event cameras, which utilizes the characteristics of low latency and HDR of event cameras and overcomes the problem of image information loss in high-speed and HDR scenes. This method effectively fuses information from image and event modalities, thereby improving the performance of VPR in high-speed and HDR scenarios.
关键词:visual place recognition(VPR);event camera;multi-modality;feature fusion;feature matching
摘要:ObjectiveCurve image is an important form of data presentation; however, querying the specific values embedded into curve image is difficult without the original data. The existing curve-to-data conversion methods require considerable manual assistance to remove such interference in curve images, such as grid lines and axes. Thus, they exhibit the disadvantage of being mechanically repetitive and labor-intensive. In addition, attacks, such as image compression and scaling, can degrade image quality, leading to the decrease of curve-to-data conversion accuracy. The curve has a certain line width, and the same X coordinate corresponds to multiple pixel points. Therefore, obtaining the exact position of the point to be measured in the curve is difficult. To solve the aforementioned problems, this study proposes a curve extraction and thinning based curve-to-data conversion neural network.MethodFirst, we propose the side structure guidance and Laplace convolution based curve extraction neural network (SLCENet). SLCENet uses ResNet as its backbone network and enhances curve extraction performance with side structure guidance. It uses deep supervision to make each layer of the network learn the details in the curve mask better. The side structure guidance contains four different scales, and each scale consists of four residual blocks. To obtain clearer curve details, we add the multi-scale dilation module (MDM) to enrich the multi-scale curve features and the noise reduction module (NRM) to reduce the noise in the feature map. Moreover, we specially design the Laplace module (LM) to enhance the curve extraction performance in side structure guidance. In general, the number of curve pixels is considerably smaller than the number of non-curve pixels, and thus, this study uses the cross-entropy loss with weights to balance the penalty of the loss function for the curve and non-curve pixels. Consequently, SLCENet solves the problem in which the pooling operation in existing curve extraction methods lead to blurred curve edges, improving curve extraction accuracy. Second, to reduce the error caused by the curve line width on curve-to-data conversion and balance computational complexity and curve thinning accuracy, we design 10 features that can reflect the curve trend and propose a curve trend feature and MLP based curve thinning method (CMCT), which achieves curve thinning results with high accuracy. Finally, PaddleOCR is used to identify the coordinate labels on the coordinate axes and establish the coordinate transformation formula between axis coordinates and pixel coordinates.ResultA huge amount of experimental results show that our algorithm achieves superior accuracy and speed. In curve extraction, SLCENet achieves the optimal dataset scale (ODS) of 0.985 and only takes 0.043 seconds for an image with a resolution of 640 × 480. For the curve images degraded by JPEG compression, scaling, and noising attack, SLCENet still achieves the ODS of 0.902. Although the speed of our SLCENet is slightly slower than holistically-nested edge detection (HED), richer convolutional features for edge detection (RCF), and dense extreme inception network (DexiNed), they fail to achieve high curve extraction accuracy. Therefore, when combining accuracy and running speed, SLCENet achieves the best performance. In curve-to-data conversion, our algorithm obtains the normalized mean error (NME) of 0.79 and a running speed of 0.83 seconds per image. In model size, SLCENet achieves high accuracy with a lightweight model which is only about 17MB. To balance curve thinning accuracy and computational costs, this study compares typical machine learning methods for the curve thinning task. The experimental results show that decision tree exhibits the best performance in the curve-to-data conversion accuracy. Nevertheless, considering curve-to-data conversion accuracy, model size, and running speed, MLP is chosen with the best comprehensive performance.ConclusionOur algorithm achieves the goal of fully automatic curve-to-data conversion with high accuracy and exhibits greater advantages over existing methods in curve images with JPEG compression, image scaling, and noise attacks. Compared with existing methods, our algorithm is free from the limitation of requiring considerable manual interaction assistance for curve-to-data conversion, which has high accuracy and fast running speed.
摘要:ObjectiveThe fully supervised semantic segmentation method based on deep learning has made remarkable progress, promoting practical applications such as automatic driving and medical image analysis. However, the fully supervised semantic segmentation method depends on the complete pixel-wise annotation, and the construction of large-scale pixel-wise annotation datasets requires a considerable amount of human labor and resources. Researchers have recently attempted to study semantic segmentation based on convenient supervisions, such as bounding boxes, scribbles, points, and image-level labels, to reduce the reliance on accurate annotations. Weakly supervised semantic segmentation based on image-level labels only uses category labels to train the segmentation network, which can significantly reduce the annotation cost. Most of the existing weakly supervised semantic segmentation methods use class activation map (CAM) to locate target objects. On the one hand, the CAM generated by classification networks is sparse and can only focus on the most discriminative areas of objects. Some misactivated pixels are observed in the CAM, which may provide improper guidance for the subsequent segmentation task. On the other hand, the performance of the segmentation network depends on the quality of the pseudo labels. Thus, obtaining the accurate pseudo label also requires the shape and boundary of the object. However, this information cannot be directly and accurately obtained in image-level labels, and guaranteeing the quality of pseudo labels is difficult. A new saliency-guided weakly supervised semantic segmentation algorithm is proposed in this paper to improve the performance of the segmentation model to obtain complete CAMs.MethodFirst, research shows that randomly hiding the target in the image can enhance the capability of the network to locate the complete target. However, part of the image information cannot be used when directly hiding the image at random. By contrast, the complementary hiding method can use all the image information. However, guaranteeing that the target object can be hidden as expected is difficult due to the randomness of the hiding method. Only the background area is randomly hidden in some cases. A saliency-guided object complementary hiding method is proposed in this paper. Through the foreground information provided by the saliency map, the complementary random hiding of the object in the image is performed to obtain the complementary image pairs. The CAMs of the complementary image pairs are then fused as supervision to improve the capability of the network to obtain complete CAMs. Second, the convolution operation in the classification network used to generate CAMs can lead to a local receptive field, which may cause some differences in the corresponding features of the same class objects with changes in scale, illumination, and viewing angle. These differences may result in intra-class inconsistency, negatively affecting the activation and leading to mis-activation in the CAM. In addition, the classification network itself has weak capability to extract complete objects, and achieving good effects in expanding the object area using the object complementary hiding method guided only by saliency is still difficult. Therefore, a dual attention refinement module is introduced to further correct the CAM by the global information, and the obtained CAM is used to generate the pseudo label to train the segmentation network. Prediction results of the segmentation network will have higher accuracy than the original pseudo labels. However, the prediction results also have some noise, which cannot guarantee the performance improvement of the segmentation model by directly using iterative training. Finally, this paper uses the label iteration refinement strategy, combines the initial prediction of the segmentation network, CAM, and saliency map to generate pseudo labels, and iteratively trains the segmentation network to further improve the performance of the segmentation network. Saliency maps can effectively distinguish between foreground and background objects but cannot identify the object categories. CAMs can accurately locate the object categories but lack information regarding the complete shape of the objects. Segmentation network prediction can provide relatively complete information regarding the object boundary but may contain misclassification pixels. The impact of pixel misclassification is markedly reduced by fully utilizing the information provided by the three types of maps to refine the pseudo labels.ResultThe experiment is divided into two parts to verify the effectiveness of the algorithm. In the first part, the proposed CAM generation algorithm in this paper is verified and compared with other methods. In the second part, the proposed method is compared with several classical weakly supervised semantic segmentation algorithms, and the effectiveness of the modules in the proposed model is analyzed by ablation experiment. The experiments are initially conducted on the PASCAL VOC 2012 dataset. By contrast, the CAM generated by this algorithm is more complete, and its mean intersection over union (mIoU) is improved by 10.21% compared with the baseline. The segmentation network produced better prediction results compared with the six methods, demonstrating a 6.9% improvement over the baseline. Thus, the proposed method outperforms the other methods in 13 categories. With an mIoU value of 92% in the background category, the proposed method achieved the highest performance among other methods, indicating its effective utilization of saliency maps in training. Multi-objective semantic segmentation experiment is also conducted on the COCO 2014 data set. Compared with PASCAL VOC 2012, this dataset has richer categories and contains a larger number of images with multiple object categories, indicating a high demand on the performance of the algorithm. Experimental results show that the value of mIoU is improved by 0.5% on COCO 2014.ConclusionThis algorithm can obtain a complete CAM, effectively alleviate the problem of insufficient supervision information, and improve the accuracy of weakly supervised semantic segmentation models.
摘要:ObjectiveHuman motion recognition and deep learning have become a research hotspot in the field of computer vision because of their extensive applications in video surveillance, virtual reality, and human computer intelligent interaction. Deep learning theory has made excellent achievements in the feature extraction of static images and has been gradually extended to the research of behavior recognition in other directions. Traditional research on human behavior recognition focuses on depth image sequence under 2D information. Depth image cannot only capture 3D information successfully, but can also provide depth information. Depth information represents the distance between the target and the depth camera within the visual range, disregarding the influence of external factors, such as lighting and background. Although depth image can capture 3D information, most depth image algorithms use the multi-view method to extract behavior features. The extraction effect of spatiotemporal features is affected by the angle and number of multiple views, considerably affecting the utilization rate of 3D structural information, and the spatiotemporal structure information of 3D data is largely lost. With the rapid development of 3D acquisition technology, 3D sensors are becoming increasingly accessible and affordable, including various types of 3D scanners and LiDAR. The 3D data collected by these sensors can provide rich geometry, shape, and scale information. 3D data have many applications in different fields, including autonomous driving, robotics, remote sensing, and healthcare. Point cloud representation is a commonly used 3D representation; it retains the original geometric information in 3D space without any discretization. Therefore, it is the preferred representation for understanding related applications in many scenarios, such as autonomous driving and robotics. However, the deep learning of a 3D point cloud still faces major challenges, such as small dataset size.MethodIn this study, the depth map sequence is first converted into a 3D point cloud sequence to represent human behavior information, and the large and authoritative datasets in the depth dataset are converted into point cloud datasets to compensate for the shortcoming of the small size of point cloud datasets. Given the huge amount of point cloud data, the traditional point cloud deep learning network will use a sampling algorithm to sample the point cloud before feature extraction. The most commonly used algorithm is random subsampling, which will inevitably lead to the destruction of point cloud structural information. To improve the utilization rate of temporal and spatial structure information and compensate for the loss of such information during the random subsampling of a point cloud, a point cloud human behavior recognition network that combines coordinate transformation and spatiotemporal information injection is proposed for motion recognition in this study. The network consists of two modules: the feature extraction module and the spatiotemporal information injection module. The feature extraction module extracts the deep appearance contour features of the point cloud through operations, such as the abstraction manipulation layer, multilayer perceptron, and maximum pooling. Among which, the abstraction manipulation layer includes the sampling, grouping, convolutional block attention module(CBAM), and PointNet layers. In the spatiotemporal information injection module, time sequence and spatial structure information are injected for abstract features. When timing information is injected, the sine and cosine functions of different frequencies are used as time position coding, because sine and cosine functions are unique and robust in the position of each vector in the disordered direction. During spatial structure information injection, the abstract features after location coding are multiplied with a group of learnable normal distribution random tensors and projected onto the corresponding dimension space. Then, the coefficients of the random tensors are learned through the network to find the optimal projection space that can better focus on the structural relations between point clouds. Subsequently, the feature enters the interpoint attention mechanism module to further learn the structural relationship between point cloud data points and points through the interpoint attention mechanism. Finally, the multilevel features in feature extraction and information injection are aggregated and inputted into the classifier for classification.ResultA large number of experiments are performed on three common datasets, and the proposed network structure exhibits good performance. Accuracy on the NTU RGB+d60 datasets is 1.3% and 0.2% higher than those of PSTNet and SequentialPointNet, respectively, considerably exceeding the recognition accuracy of other networks. Although the accuracy of the NTU RGB+d120 dataset is 0.1% lower than that of SequentialPointNet, it remains in a leading position compared with other networks. The network recognition accuracy proposed in this study is 1.9% higher than that of PSTNet. The NTU dataset is one of the largest human action datasets. To ensure the robustness of the network model, the effect of the point cloud human behavior recognition network that combines coordinate transformation and spatiotemporal information injection on small datasets is verified, and experimental comparison was performed on small datasets of MSR Action3D. The recognition accuracy of the network proposed in this study was 1.07% higher than that of SequentialPointNet, and considerably higher than those of other networks.ConclusionIn this study, we propose a point cloud human behavior recognition network that combines coordinate transformation and spatiotemporal information injection for behavior recognition. Through coordinate transformation, the depth map sequence is converted into 3D point cloud sequence for the characterization of human behavior information, compensating for the shortcomings of insufficient depth information, spatial information, and geometric features, and improving the utilization rate of spatiotemporal structure information. The network proposed in this study not only obtains static point cloud contour features, but also integrates dynamic temporal and spatial information to compensate for the temporal and spatial losses caused by sampling during feature extraction.
关键词:human behavior recognition;coordinate transformation;point cloud sequence;feature extraction;spatiotemporal information
摘要:ObjectiveCancer is the second leading cause of death worldwide, with nearly one in five patients dying from lung cancer. Many cancers have a high chance of cure through early detection and effective therapeutic care. However, atypical early symptoms of lung cancer can easily lead to missed optimal treatment time. Treatment procedures can be utilized to reduce the risk of death with the successful identification of benign and malignant cancer. Manual determination of lung cancer is a time-consuming and error-prone process, and effective and accurate lung cancer detection techniques are becoming increasingly important in computer-aided diagnosis.MethodComputed tomography is a common clinical modality for examining lung conditions by localizing lesion structures through anatomical information, and positron emission computed tomography can reveal the pathophysiological features of lesions by detecting glucose metabolism. Combining positron emission computed tomography (PET)/computed tomography (CT) has been shown to be effective in cases where conventional imaging is inadequate, identifying lesions while pinpointing them, which improves accuracy and clinical value. However, in PET/CT images of lung cancer, adhesion of cancer to surrounding tissues leads to blurred edges and low contrast, and problems such as small lesion areas and uneven size distribution are encountered. A cross-modal attention YOLOv5 (CA-YOLOv5) model for lung cancer detection is proposed in this paper to address the above problems. This model focuses on the following: First, a two-branch parallel self-learning attention is designed in the backbone network to learn the scaling factor using instance normalization and also calculate the amount of information contained in each feature using the difference between feature and average values. Self-learning attention enhances cancer features and improves contrast. Second, cross-modal attention is designed to facilitate the interactive learning of multimodal features to fully learn the multimodal dominant information of 3D multimodal images. Transformer is designed to model the long-range interdependence of deep and shallow layer features and learn key functional and anatomical information to improve lung cancer recognition. Third, a dynamic feature enhancement module is established to address the problem of small lesion areas and uneven size distribution using multibranch grouped dilated and deformable convolution with different sensory fields, enabling networks to mine multiscale semantic information of lung cancer features fully and efficiently.ResultIn a comparison test with 10 other methods, CA-YOLOv5 obtained the best performance with 97.37% precision, 94.01% recall, 96.36% mean average precision(mAP), and 95.67% F1 score on the PET/CT dataset of lung cancer, and the training time on the same device is the shortest. Compared with YOLOv5, each index improved by 2.55%, 4.84%, 3.53%, and 3.49%, respectively. On the PR curve with precision and recall as the horizontal and vertical axes, respectively, the curve area of the proposed model is optimal on each category, and the area enclosed under the curve of this model is the largest on the F1 curve with F1 score at high confidence level. The heat map of the proposed model not only identifies all the labels but also focuses on accuracy. In the LUNA16 dataset, the proposed model obtained the highest performance of 97.52% accuracy and 97.45% mAP, and the overall coverage was the largest in the precision-recall(PR) curve.ConclusionThis paper established CA-YOLOv5, a lung cancer detection model. Lightweight and effective self-learning attention mechanisms are designed to enhance cancer features and improve contrast. Transformer is also created at the end of the backbone network to explore the advantages of convolution and self-attention mechanisms and extract local and global information of deep and shallow layer features. Dynamic feature enhancement modules at the feature enhancement neck are constructed to mine multiscale semantic information of lung cancer features fully and efficiently. Experimental results of the two datasets show that the proposed model in this paper has superior lung cancer recognition and strong network characterization capabilities, which effectively improve detection accuracy and reduce leakage rate. Thus, this model effectively facilitates computer-aided diagnosis and improves the efficiency of preoperative preparation. The effectiveness and robustness of the model are further verified using heat map visualization technique and LUNA16 dataset, respectively.
关键词:YOLOv5 detection;self-learning attention;cross-modal attention;dynamic feature enhancement module;PET/CT lung cancer datasets
摘要:ObjectiveBreast cancer recognition based on deep learning methods is a challenging task due to the large size of breast histopathology images (single image size is approximately 1 GB). Thus, these images must be cut and then identified due to the current computational power limitations. Current research on breast cancer recognition focuses on single-scale networks, ignoring the characteristics of multiple magnifications and pyramidal structure storage of breast histopathology images. Several studies on multiscale networks only input images of different magnifications into the network model and concatenate or aggregate various features after multilayer convolutional layer operations. The feature fusion is simple and ignores the correlation between images of different scales as well as the guidance between images of different scales when extracting their texture features in the shallow part of the network model. Therefore, problems such as low feature utilization and lack of information interaction exist between images of different magnifications.MethodThis paper proposes a convolutional neural network improvement strategy based on multiscale and group attention mechanisms to address the above problems. The strategy mainly includes the following two modules: information interaction and feature fusion modules. The first module extracts clear cell morphological structure and global context information from high- and low-magnification images, respectively, through a spatial attention mechanism. The feature information with high relevance to the classification target of the main branch will be given additional weight. Finally, these features are weighted and accumulated, and the results are fed back to the original branch for dynamic selection to achieve feature interaction and circulation. The second module considers that the number of channels on the feature map will multiply as the depth of the network increases, and the general channel attention encounters problems of large computation and low feature activation rate. Therefore, this paper proposes group attention based on group convolution and combines it into the feature fusion module. In addition, a difference in the receptive field of the images is observed at different magnifications (i.e., the actual length of each pixel is different). Thus, this paper uses a feature pyramid to eliminate the perceptual domain difference in the feature fusion process.ResultIn this paper, the above strategy is applied to a variety of convolutional neural networks and compared with the latest methods. A fivefold cross-validation experiment is conducted on the Camelyon16 public dataset, and the mean and standard deviation are calculated for each evaluation metric. Compared with the single-scale convolutional network, the introduced method in this paper demonstrated 0.9%—1.1% improvement in accuracy and 1.1%—1.2% in F1-score. Compared with the best-performing TransPath network in the single-scale network, the enhanced DenseNet201 in this paper demonstrated a 0.6% improvement in accuracy, 0.8% in precision, 0.6% in F1-score, and the standard deviation of the indicators is lower than that of TransPath, indicating that the network incorporating the strategy has a better stability.ConclusionOverall, the proposed strategy in this paper can compensate for the shortcomings of general multiscale networks and has certain generality to obtain superior performance in breast cancer image classification. Thus, this strategy is useful for future multiscale research and feature extraction for downstream tasks.
关键词:classification of breast pathological images;dense convolutional network;multiscale;attention;fusion of features
摘要:ObjectiveAlzheimer’s disease (AD) is a neurodegenerative disease commonly occurring in middle-aged and elderly populations and is accompanied by cognitive impairment and memory loss. With the increasing trend of global population aging, timely diagnosis of AD and visualization of pathological regions and its accurate localization are of considerable clinical importance. In current research, one conventional approach is to extract patch-level features based on voxel morphology and prior knowledge for detecting structural changes and identifying AD-related voxel structures. Another approach is to learn AD-related pathological regions by focusing the network on specific brain regions of interest (e.g., cortical and hippocampal regions) based on regional features. However, these approaches ignore other pathological locations in the brain and fail to obtain accurate global structural information for the diagnosis of AD. A joint learning model for localization and diagnosis of AD pathological regions is proposed using the idea of counterfactual reasoning to obtain a convincing model architecture and increase interpretability of the output, highlighting the information of pathological regions.An attention-guided cycle generative adversarial network(ACGAN) is constructed based on foreground-background attention mask.MethodIn the vast majority of image classification methods, the network model aims to find which part of the input is X and which part influences the decision of the classifier to determine the final result as Y. From another viewpoint, in a hypothetical scenario where the input X is C, would the result be Z instead of Y? This idea is defined as counterfactual reasoning. The AD classification model was first trained as a classifier in the hypothetical scenario to construct its output, and the pathological features of AD were then obtained. The hypothetical scenario was constructed using a generative adversarial network to learn the mapping of images from the source domain to the target field. However, achieving good results by directly generating image to image transformation is difficult due to the complexity of whole brain structural magnetic resonance imaging(sMRI) images and the considerable amount of information in 3D space. Drawing inspiration from two models, namely CycleGAN and AttentionGAN, the image can be mapped from the source to the target domain by changing the region in the original image that affects the category judgment and using the foreground-background attention to guide the model to focus on the dynamic change region, which reduces the complexity of the model and facilitates easy model fitting. Therefore, this paper proposes an attention-guided recurrent generative adversarial network to construct a counterfactual mapping model for AD, thereby outputting the corresponding pathological regions. If a counterfactual map conditional on the target label (i.e., hypothetical scenario) is generated, then this counterfactual map is added to the input image to diagnose the transformed image as the target type. For example, when the counterfactual map is added to the sMRI image of a subject with AD, modifying the corresponding region changes the input sMRI image and facilitates the diagnosis by the classifier as a normal subject. The pathological regions represented by the counterfactual map were used as privileged information (i.e., the location information of the counterfactual map influenced the category determination) to further optimize the diagnostic model. Therefore, the diagnostic model focused on learning and discovering disease-related discriminative regions to combine the pathological region generation and AD diagnostic models.ResultThe proposed model was evaluated against traditional convolutional neural network(CNN) models and several highly advanced AD diagnostic models on a publicly available ADNI dataset using quantitative evaluation metrics, including accuracy(ACC), F1-score, and area under curve(AUC). Experimental results showed that the model improved ACC, F1-score, and AUC by 3.60%, 5.02%, and 1.94%, respectively, compared with the best performing method. The generated pathological region images were also qualitatively and quantitatively evaluated, and the normalized correlation scores and peak signal-to-noise ratios of the pathological region images obtained by the method were better than those of the compared methods. More importantly, the proposed AD diagnostic model visualized the global features and finegrained discriminative regions of the pathological regions compared with the benchmark model, and the average accuracy after three iterations was improved by +4.90%, +11.03%, and +11.08% compared with the benchmark method.ConclusionCompared with existing methods, this ACGAN model can learn the transformation of sMRI images between source and target fields and accurately capture global features and pathological regions. The learned knowledge of the pathological region is used for the improvement of AD diagnosis models. Therefore, the classification diagnosis model achieves excellent performance.