摘要:Deep learning has achieved great success in many fields. However, the solid data-fitting ability of deep learning hides the unexplained phenomenon of “shortcut learning”, which leads to the vulnerability of the deep model. Many studies have shown that if an attacker adds slight perturbations to normal data that human beings cannot perceive, the model may produce catastrophic wrong output, which severely limits the application of deep learning in security-sensitive fields. Therefore, to deal with the threat of malicious attacks, an antagonistic defense should be set up, and the robustness of the model should be improved. In this regard, researchers have proposed a variety of adversarial defense methods. The existing defense methods for deep neural networks can be divided into three categories, namely, modifying-input-data-based methods, directly-enhancing-network-based methods, and adversarial-training-based methods. Modifying-input-data-based defense methods aim to alter the input in advance and reduce the attack intensity at the input end via denoising or image transformation, among others. Despite showing a certain anti-attack ability, this method is not only limited by the attack intensity but also faces the problem of over-correcting the normal input data. The former limitation hinders this method from dealing with slight disturbances that human beings cannot perceive, while the latter problem exposes this method to the risk of making wrong judgments on normal data, thereby reducing its classification accuracy. Directly-enhancing-network-based methods directly improve the anti-attack capability of the network by adding subnetworks and by changing the loss function, activation function, batch normalization layer, or network training process. Adversarial-training-based methods are typical heuristic defense methods compared with the other two. These methods inject the adversary attack and adversary defense into a framework, wherein adversarial examples are initially generated by attacking the existing models. Afterward, these adversarial examples are used to train the target model to produce an accurate output for these examples and enhance its robustness. Therefore, this paper primarily focuses on adversarial training. Apart from showing certain ability to defend against attacks, adversarial training also improves the robustness of the model at the cost of reducing its classification or recognition accuracy for normal data. Many researchers find that the more robust the model is, the lower is its classification or recognition accuracy for normal examples. In addition, the defense effect of the current adversarial training remains unsatisfactory for strong adversarial attacks with diversified attack modes. To address this issue, recent studies have improved the standard adversarial training from different perspectives. For instance, some studies have generated adversarial examples with high diversity or portability in the attack stage. To enhance model robustness, many scholars have combined adversarial training with network enhancement to resist an adversarial attack. This process involves network structure modification, model parameters adjustment, and adversarial training acceleration, which help the model resist different types of attacks. The standard adversarial training only considers the classification of adversarial examples in the defense stage and ignores the classification accuracy for the original examples. In this connection, many works not only introduce the spatial or semantic consistency constraints between the original and adversarial examples but also require the model to produce an accurate output with respect to the latter, thus ensuring that the model considers both robustness and accuracy. To enhance the transferability of the model, curriculum learning, reinforcement learning, metric learning, and domain adaptation technologies are integrated into adversarial training. This paper then comprehensively reviews adversarial training technologies. First, the basic framework of adversarial training is elaborated. Second, typical methods and critical technologies for the generation of adversarial samples are reviewed. We summarize the adversarial examples generation methods based on image space, feature space, and physical space attacks. To improve the diversity of adversarial examples, we also introduce interpolation- and reinforcement-learning-related adversarial example generation strategies. Given that standard adversarial training is extremely time consuming, we briefly describe optimization strategies based on temporal, spatial, and spatiotemporal mixed momentum, which are conducive to improving training efficiency. Defense is the fundamental problem of adversarial training that is devoted to absorbing the generated adversarial examples for training via loss minimization. Therefore, we briefly review the technologies typically used in the defensive training stage, including the loss regularization term, model enhancement mechanism, parameter adaptation, early stop, and semi-supervised or unsupervised expansion strategies. To evaluate the robustness of the model, we summarize the popular datasets and typical attack methods. After sorting out relevant adversarial training technologies, we still face challenges in dealing with multi-disturbance integrated attacks and the low efficiency of the model. We put forward these problems as directions for future research on adversarial training.
摘要:Skeleton-based human action recognition aims to correctly analyze the classes of actions from skeleton sequences, which contain one or more actions. Skeleton-based human action recognition has recently emerged as a hot research topic in the field of computer vision. Due to the fact that actions can be used to handle tasks and express human emotions, action recognition can be widely applied in various fields, such as intelligent monitoring systems, human-computer interaction, virtual reality, and smart healthcare. Compared with RGB-based human action recognition, skeleton-based human action recognition methods are less affected by interference factors, such as background and human appearance, and have higher accuracy and robustness. In addition, these methods require a small amount of data and show a high computational efficiency, thereby increasing their prospects in practical applications. In this case, comprehensively and systematically summarizing and analyzing skeleton-based human action recognition methods become critical. Compared with other reviews on skeleton-based action recognition, our contributions are as follows: we provide a more comprehensive summary of skeleton-based action datasets; we provide a more comprehensive summary of skeleton-based action recognition methods, including the latest Transformer technology; we offer a more instructive classification of graph convolutional methods; and we not only summarize the existing problems but also forecast the prospects for future research. First, we introduce nine datasets that are commonly used for skeleton-based action recognition, including the MSR Action3D, MSR Daily Activity 3D, 3D Action Pairs, SYSU 3DHOI, UTD-MHAD, Northwestern-UCLA, NTU RGB+D 60, Skeleton-Kinetics, and NTU RGB +D 120 datasets. In order to highlight the characteristics of these datasets prominently, we divide them into single-view and multi-view datasets from the data collection perspective and then explore the traits and uses of each category. Second, based on the backbone network used by the models, we categorize the skeleton-based action recognition methods into those based on handcrafted features, based on recurrent neural network (RNN), based on convolutional neural network (CNN), based on graph convolutional network (GCN), and based on Transformer. Before the rise of deep learning methods, traditional algorithms (handcrafted features) were often used to model human skeleton data. The key problem in using such methods is how to create an effective feature representation of human skeleton sequences. However, after the rise of deep learning methods, which demonstrate excellent performance in various fields, such as face recognition, image classification, and image super-resolution, researchers have begun using deep learning networks to model skeleton data. Among them, RNN effectively processes data in the form of continuous time series and is adept at learning temporal dependencies information in sequence data, while CNN can effectively learn high-level semantic information of skeleton data. Training a CNN-based model requires lower computational costs than RNN. Unlike RNN-based methods, before using CNN, the skeleton data should be reshaped into pseudo-images. The columns of the pseudo-image represent the features of all joints in one frame, while the rows represent the features of a certain joint across all frames. However, when RNN or CNN methods are used to model skeleton data, the topological structure of the human skeleton is ignored. Transforming the skeleton data into sequence vectors of joint coordinates or a 2D grid cannot accurately describe the dynamic skeleton of the human body. Previous studies show that graph convolution has a powerful ability to model topological graph structures, making this method particularly suitable for modeling the human skeleton. Given their successful application, graph convolutional methods have been widely used in skeleton-based action recognition. This paper specifically adopts a novel inductive approach and provides a comprehensive review of GCN-based methods. These GCN-based methods are further classified according to the problems targeted in the literature with an aim to provide researchers with additional ideas and methods. These studies can be divided into optimization of the graph structure, network lightweighting, optimization of temporal and spatial features, and optimization of missing and noisy joints. This paper also provides a comprehensive summary of the issues faced by the currently available methods. This paper not only points out the limitations and challenges faced by these methods but also evaluates the future development trend and provides insightful prospects for the field. By doing so, this review not only helps readers gain a deep understanding of the current state of this task but also provides valuable guidance for future research in this area.
摘要:Image composition has always been a research hotspot in the field of image processing and has a wide range of application prospects. This process involves accurately extracting the foreground objects in an image and compositing them with a new background image. However, traditional image compositions methods are often time consuming and labor intensive. Users not only need to manually complete the accurate extraction and reasonable placement of foreground objects but also need to adjust the lighting conditions, saturation, edge details, shadows, and other information of foreground objects to make the image quality close to that of the real image. With the development of deep learning technology in recent years, image composition technology has attracted increasing applications and has demonstrated its efficiency. To promote the research and development of image composition technology based on deep learning, this paper expounds four main problems faced in current image composition tasks. First, the foreground object adaptation problem mainly involves foreground object size adjustment, spatial position placement, blurred edge detail processing of foreground objects, and unreasonable mutual occlusion of foreground and background. The current deep learning methods for solving this problem include appropriate bounding box prediction for foreground objects in background images, spatial transformation networks, foreground object location distribution prediction and adversarial training, image fusion technology, and guided placement based on domain information. Second, the foreground object harmonization problem mainly concerns the non-uniformity in the visual information, such as illumination, color, saturation, and contrast, of the foreground and background images after image composition. The current deep learning methods for solving this problem include the attention-based guidance mechanism, domain-information-based verification and discrimination methods, codecs, context-dependent capabilities of Transformers, assisting input with high dynamic range (HDR), and borrowing methods, such as style transfer. Third, the foreground object shadow harmonization problem mainly involves generating shadows of missing foreground objects in composite images. The current deep learning methods for solving this problem include methods based on image rendering, shadow generation using generative adversarial networks, relying on background ambient lighting assistance, and attention-based methods and mechanisms. Fourth, the habitat adaptation problem between the foreground object and background mainly focuses on biological information matching, which should be considered when compositing foreground objects and background images. Whether foreground objects, such as animals and plants, can be composited in background images is the first problem that should be considered in image composition tasks. The background image selection of an object cannot deviate from its corresponding habitat information. For instance, seagulls do not appear in the desert, and flowers do not grow from ice and snow. The foreground object adaptation problem can be regarded as the key problem in image composition. As long as the foreground objects are correctly and reasonably composited, the subsequent optimization task of the composite image can be performed efficiently. Effectively solving the visual harmonization problem of foreground objects can further improve the authenticity of composite images from the perspective of users. The most important problem to be considered is the adaptation of the foreground and background habitats. Objects and background images cannot be chosen arbitrarily but need to satisfy the logical relationship of reality, that is, to satisfy habitat adaptation, which can be regarded as the primary task in an image composition task. If the habitat information does not fit, then the foreground object and background scenes lose their logical authenticity, and all subsequent tasks fail to make the composite image realistic. This study summarizes the current deep learning methods, publicly available datasets, and evaluation indices for each of the above problems, compares the different deep learning methods, and introduces the application of image synthesis techno-logy. A composite image not only reduces the cost of real data acquisition but also improves the generalization ability of the model. The shortcomings of image composition technology based on deep learning are also analyzed, feasible research suggestions are put forward, and the future development direction of image synthesis technology is forecasted.
摘要:ObjectiveImage inpainting is a process of reconstructing the missing regions of corrupted images, which can make the images visually complete and semantically plausible. This process is widely used in many applications, such as object removal, old photo restoration, and image editing. Until now, deep-learning-based inpainting methods have achieved good performance on natural and human face images. Nevertheless, the methods used to ensure consistency in the image texture and structure have limitations in text image inpainting because they do not focus on the text itself. Meanwhile, studies on text images have mainly concentrated on text image super-resolution, text detection, and text recognition. However, many ancient documents contain broken text regions, which present an obstacle for downstream detection or recognition tasks and for the digital protection of ancient literature. Therefore, reconstructing broken text on images is worthy of further study. This paper proposes a novel text image inpainting model guided by text structure prior to solve the above problem.MethodFirst, the model proposes a structure prior reconstruction network. Given that the text skeleton contains important text stroke information and that the text edge contains texture and structure information, the network chooses both of these priors to guide the inpainting. Due to the limitation of convolutional neural network (CNN) receptive fields, the network applies Transformer to capture the long-term dependency of the text image and reconstructs robust and readable text skeleton and edge image based on the useful feature information extracted from the masked RGB image, the masked text skeleton, and the masked edge image. To reduce the computational cost caused by self-attention in Transformer, the network first downsamples the input image and then sends the compressed features to sequential Transformer layers. The network then upsamples these features to recover the prior images. To construct an accurate text skeleton, the network is trained by the combination of binary cross-entropy loss function and Dice loss function. Second, to explore the sequence feature information of the text itself on the images, this paper designs a static-to-dynamic residual block (StDRB). The text image inpainting network adopts an encoder-decoder as the main architecture and integrates sequential StDRBs to enhance the inpainting performance. The text skeleton image and edge image contain significant text stroke and structure information about the whole image, and the StDRB module can make use of the prior information to effectively help the inpainting. In the first place, the input image is sent to the CNN encoder to obtain the static fused features. Then StDRB can convert the static fused features into dynamic text sequence features. By assuming that the text follows a pseudo-dynamic process from left to right and top to down, StDRB uses bi-directional gated recurrent units from the horizontal and vertical directions in parallel to extract useful text semantic information. The residual block also deepens the network and facilitates network convergence. Finally, the CNN decoder recovers the missing regions from the features to obtain the inpainting results. To make the restored text images visually realistic and semantically explicit, the network uses presetting parameters to combine several loss functions, such as adversarial, pixel reconstruction, perceptual, and style losses. Given that the aim of text image inpainting is to reconstruct the text stroke, the network also introduces gradient prior loss as one of the joint losses. The gradient prior loss uses the gradient field between the inpainted and ground truth images to restrict the network to generate a sharp text stroke contrast with backgrounds. The training set consists of Tibetan and English text images that are randomly synthesized using corpus and noisy background images. All the input images are resized to 256 × 256 pixels for training. The model is implemented in PyTorch and accelerated using an NVIDIA GeForce GTX 1080Ti GPU. The model trains the structure prior reconstruction and text image inpainting networks in two stages to obtain the inpainting results.ResultDue to the limited number of studies on text image inpainting, we compare our model with four natural image and face image inpainting models qualitatively and quantitatively. Both of the codes are official versions. From the perspective of human vision, the proposed model obtains better holistic inpainting results than the other methods and achieves more detailed and accurate text reconstruction results in large missing regions. As quantitative evaluation metrics, this paper not only uses image quality evaluations that are widely used in previous inpainting methods but also uses optical character recognition (OCR) results for comparison. These results can effectively show the inpainting effect of broken text on images. Our model has a peak signal-to-noise ratio(PSNR) and structural similarity(SSIM) of 42.31 dB, 98.10% on average in the Tibetan dataset, and 39.23 dB, 98.55% on average in the English dataset. The character accuracy of Tesseract OCR for the Tibetan dataset is 62.83%, and the character accuracies of Tesseract OCR, convolutional recurrent neural network (CRNN), and attentional scene text recognizer (ASTER) for the English dataset are 85.13%, 86.04%, and 76.71%, respectively. Our model obviously outperforms the other algorithms on both datasets.ConclusionThis paper proposes a structure prior guided text image inpainting model that aims to reconstruct and use priors to guide text image inpainting. To obtain accurate priors, we use Transformer to improve the quality of our results. In the inpainting process, StDRBs that are integrated into the network extract useful text sequence information and boost the text inpainting performance. The model is also trained by using effective joint loss functions to improve its results. The results on Tibetan and English datasets prove the effectiveness of the proposed model.
关键词:image inpainting;text image inpainting;structure prior;static-to-dynamic residual block (StDRB);joint loss
摘要:ObjectiveWith the rapid development of the Internet and imaging devices, the security of digital image storage and file sharing has become an important concern. Robust watermarking techniques can be used to solve these problems. The general idea of these techniques is to embed watermark information, such as copyright labels and user identification, into the to-be-protected image imperceptibly and then extract the watermark from the watermarked image even after undergoing some attacks. The two most important properties of robust watermarking are the robustness and visual quality of the watermarked image. Therefore, the watermarked image should be robust against different kinds of attacks and show satisfactory visual quality. As a typical robust watermarking technique, screen-shooting robust watermarking can resist the noises involved during the screen-shooting procedure. In other words, watermark information can still be accurately extracted from the watermarked image after screen-shooting.MethodIn this paper, we propose an effective, end-to-end network framework based on deep learning for screen-shooting robust watermarking. In this framework, a screen-shooting noise layer, including a Moiré pattern simulation, is introduced to simulate the noise within the screen-shooting channel so as to learn how to enhance the robustness of the network against realistic noise during the screen-shooting procedure through network training. In order to further improve the visual quality of the generated watermarked image, we define and introduce a just noticeable distortion (JND) loss function that aims to control the strength of the residual image containing the watermark information by supervising the visual perceptual loss between the JND maps of the original and residual images. We also propose two automatic localization methods for watermarked images. The first method locates the watermark of an image in a screen-shooting scenario, wherein the obtained screenshot may not only contain the image displayed on the screen but also some background information, which can affect the result of watermark extraction at the decoding end and render this result useless. To address this problem, this paper proposes the second method, namely, a region localization method that combines deep learning with traditional image processing. This method assumes that the image region that needs to extract the watermark accounts for most of the pixels in the screen-shooting result and that the background color is relatively uniform with no obvious mutation. The localization of the image region containing the watermark can be equated to the problem of foreground extraction in this case. We apply this method to the watermarking of images under digital attack. The robustness of the watermarking algorithm should not be limited to the robustness of the screen-shooting process but also to attacks in the digital environment, such as image filtering, image noise addition, and digital cropping. While the vast majority of the digital attacks can be equated by the noise introduced by the screen-shooting process, digital cropping attacks cannot be regarded as a kind of screen-shooting noise. For this reason, this paper introduces an anti-crop region localization method based on symmetric noise templates. This method divides the image into four sub-images, namely, top-left, bottom-left, top-right, and bottom-right. A two-channel watermark information residual map is generated and embedded in the green and blue channels to create four copies of the same watermark information in one image. Additionally, a symmetric noise template is embedded in the red channel for anti-crop localization. Even when the watermarked image suffers from cropping attacks, the localization method can still accurately extract the watermark information as long as more than 1/4 of the image area exists.ResultExperimental results show that after introducing the JND loss function and embedding watermark, the visual quality of watermarked image is improved, and the average peak signal-to-noise ratio(PSNR) and structural similarity (SSIM) reach 30.937 1 dB and 0.942 4, respectively. After adding the Moiré noise simulation layer, the bit error rate of the proposed scheme is reduced to 1%~3%, which demonstrates the ability of this scheme to resist the noise generated from the screen shooting. This scheme also effectively resists strong cropping attacks by embedding the anti-cropping template into the R channel of the image. The total running time of embedding and extracting a single image is less than 0.1 s, which is suitable for deployment in application scenarios with real-time requirements. Meanwhile, the performance of the proposed algorithm is compared with that of state-of-the-art screen-shooting robust watermarking algorithms across various experimental settings, including screen shooting and digital attack settings. Results of the bit error rate comparison demonstrate that the proposed algorithms not only help the network simulate screen-shooting noise with a high level of robustness against actual screen-shooting noise but also equip the network with the ability to withstand specific digital cropping attacks.ConclusionThis paper proposes an end-to-end embedding-extraction network for robust watermarking against screen shooting. In this network, a Moiré noise simulation layer and a JND loss function module are introduced to enhance the robustness and visual quality of the watermarked images generated by the network. We also design two watermark localization methods to address two realistic scenarios, namely, screen shooting and digital cropping. Our experimental results demonstrate that our proposed scheme achieves a satisfactory embedding capacity and visual quality of the generated watermarked image and that the robustness of our scheme under different shooting distances, angles, and capturing/displaying devices is better than those of some state-of-the-art schemes. In our future research, we aim to investigate the decoding of watermarks when only a portion of the screen image is captured, which is a more intricate process than mere digital cropping and improving the visual quality of watermarked images in scenarios with high embedding capacity.
摘要:ObjectiveWhen taking an image, the image tends to come out blurred due to interferences from shaky or out of focus lenses, dust, or atmospheric light. Image blurs can be categorized into atmospheric, defocus, and motion blurs. The most common type of blur is motion blur, which is caused by jittering objects during shooting and has a great impact on computer vision tasks, such as image classification, object detection, text recognition. Meanwhile, methods for motion deblurring can be grouped into non-blind and blind deblurring. On the one hand, non-blind deblurring has a known fuzzy kernel and can output a sharp image due to effect of deconvolution with fuzzy image directly after removing noise. This method is simple and effective, but its premise is that the fuzzy kernel is known. On the other hand, blind deblurring faces the serious problem where the fuzzy kernel is unknown. This method needs to calculate the value of the fuzzy kernel and then deblur using the same procedure as the non-blind deblurring method. The method is complicated, computationally expensive, and time consuming. However, the fuzzy kernel is unknown in real scenarios, and most cases of deblurring are actually blind deblurring. Therefore, this paper focuses on blind deblurring. Owing to its powerful feature extraction ability, a convolutional neural network can deblur an image after obtaining its fuzzy features from a dataset of sharp and blurred image pairs, hence presenting a breakthrough for the task of image deblurring.MethodTo address motion blur, a two-scale network based on deep feature fusion attention is proposed in this paper. First, a two-scale network is designed to extract different scales of spatial information. During the transformation from high to low scale, the state of blurred features in the motion blurred image changes from smooth to sharp. Therefore, the network pays further attention to those fuzzy areas in a low-scale image, hence allowing the network to obtain fuzzy features. This network not only improves the capability of recovering frequency details from the original scale image but also effectively uses the spatial information of the blurred image to enhance the deblurring effect of the model. Second, the deep feature fusion attention module is constructed. The main structure of the network is very similar to that of U-Net. After the encoding and decoding structure, the deep feature fusion attention module is constructed to obtain the best fusion feature information. The encoding and decoding structure can obtain multi-level information about blurred images. Such information is then fed into the module of full-scale features and squeeze-and-excite in the deep feature fusion attention module to produce full-scale features, which in turn are spliced and fused with the decoded features of the same level to further enhance the recovery performance of the network. Third, in order to make the network recover the high-frequency details effectively, the function of perception loss is replaced by the function of frequency loss. The loss function in this paper is composed of two parts, namely, content loss and frequency loss. The function of content loss uses the mean absolute error and combines multi-scale knowledge to calculate the absolute disparity of each pixel between the sharp and restored images and obtain the value of content loss. This procedure also marks the first step in calculating the 2D Fourier transform of the restored and sharp images and in measuring the frequency loss. After calculating the Fourier transform, the average absolute error between the restored and sharp images is determined. The multi-scale frequency loss is obtained by multiplying the multi-scale weight by the average absolute error. Our loss function can improve the sensitivity of the network to fuzzy features and enhance the recovery ability of the model in frequency detail.ResultWe compare the performance of our model with that of 12 other methods on 3 different datasets. We evaluate the performance of our model on the GoPro dataset by utilizing the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) between the restored and sharp images. Compared with the scale-recurrent network, the proposed method obtains 2.29 dB higher PSNR, thus allowing this method to recover detailed information. We also compare the optimal results of each method. We test the generalization performance of the proposed method on the Kohler dataset, where this method achieves the highest PSNR of 29.91 dB without retraining. Meanwhile, we compare the deblurring performance of these methods for real blurred images on the Lai dataset and determine the best results via subjective comparison.ConclusionTo improve the quality of motion deblurring, a two-scale network based on deep feature fusion attention is proposed in this paper. This work offers three contributions, namely, a novel two-scale network, a deep feature fusion attention module, and multi-scale loss. Objective and subjective experimental results show that the proposed deblurring model can efficiently integrate image spatial information and feature semantic information, thus improving its deblurring performance, PSNR, and SSIM.
关键词:deep feature fusion attention;double-scale network;motion image deblurring;full-scale feature fusion;loss function with multi-scale
摘要:ObjectiveResearch on single image super-resolution reconstruction based on deep learning technology has made great progress in recent years. However, to improve reconstruction performance, previous studies have mostly focused on building complex networks with a large number of parameters. How to effectively reduce the complexity of the model while improving the reconstruction performance to meet the needs of low-cost and real-time applications has become an important research direction. While state-of-the-art lightweight super-resolution methods are mainly based on convolutional neural networks, only few methods have been designed with Transformer, which show an excellent performance in image restoration tasks. To solve these problems, we propose a lightweight super-resolution network called image super-resolution with channel-attention-embedded Transformer (CAET), which can achieve excellent super-resolution performance with a small number of parameters.MethodCAET involves four stages, namely, shallow feature extraction, hierarchical feature extraction, multi-layer feature fusion, and image reconstruction. The hierarchical feature extraction stage is performed by a basic building block called channel-attention-embedded Transformer block (CAETB), which adaptively embeds channel attention (CA) into Transformer and convolutional features, hence not only taking full advantage of the convolutional network and Transformer in image feature extraction but also adaptively enhancing and fusing the corresponding features. Convolutional layers provide stable optimization and extraction results during early vision feature processing, and co-solution layers with spatially invariant filters can enhance the advection equivalence of the network. The stacking of convolutional layers can effectively increase the perceptual field of the network. Therefore, three cascaded convolutional layers are placed in front of CAETB to receive the features output from the previous module, and the LeakyReLU activation function is used to activate them. The features extracted by convolution layers are embedded with channel attention. To effectively adjust the channel attention parameters, we adopt a linear weighting method to combine channel attention with features from different levels. These features are then inputted into the swin Transformer layer (SwinIR) for further deep feature extraction. Given that increasing the network depth leads to saturation, we set the number of CAETB to 4 to maintain a balance between model complexity and super-resolution performance. The hierarchical information at different stages is helpful in interpreting the final reconstruction results. Therefore, CAET combines all the low- and high-level information from the deep feature extraction and multi-level feature fusion stages. In the image reconstruction phase, we use a convolution layer and the pixel shuffle layer to upsample the features to the corresponding dimensions of a high-resolution image. During the training stage, we use 800 images from the DIV2K dataset to train CAET, and we augment all training images by randomly flipping them vertically and horizontally to increase the diversity of the training data. For each mini-batch, we randomly crop image patches to a size of 64 × 64 pixels as our low-resolution (LR) images. We then optimize our network using the Adam algorithm and apply L1 loss as our loss function.ResultWe conduct experiments on five public datasets, namely, Set5, Set14, Berkeley segmentation dataset (BSD) 100, Urban100, and Manga109, to compare the performance of our proposed method with that of six state-of-the-art models, including super-resolution convolutional neural network (SRCNN), cascading residual network (CARN), information multi-distillation network (IMDN), super-resolution with lattice block (LatticeNet), and image restoration using swin Transformer (SwinIR). We measure the performances of these methods using peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) as metrics. Given that humans are highly sensitive to the brightness of images, we measure these metrics in the Y channel of an image. Experiment results show that the proposed method receives the highest PSNR and SSIM values and recovers more detailed information and more accurate texture compared with the state-of-the-art methods at ×2, ×3, and ×4 amplification factors. At the ×4 amplification factor, the PSNR of the proposed method is improved by 0.09 dB on the Urban100 dataset and by 0.30 dB on the Manga109 dataset compared to that of SwinIR. In terms of model complexity, CAET achieves a better performance with fewer parameters and multiply-accumulator operations compared to SwinIR, which also uses Transformer as the backbone of the network. Although CAET consumes more parameters and multiply-accumulator operations compared to IMDN and the LatticeNet, this method achieves significantly higher performance in terms of PSNR and SSIM.ConclusionThe proposed CAET can effectively improve the image super-resolution reconstruction performance by fusing convolution and Transformer features and applies adaptive embedding channel attention to enhance the features. CAET effectively improves the image super-resolution performance while controlling the complexity of the whole network. Experiment results on several public experimental datasets verify the effectiveness of our method.
摘要:ObjectiveSemantic segmentation is a fundamental task in computer vision and image processing whose aim is to assign a class label to each pixel. However, the training of the segmentation model often relies on dense pixel-wise annotations, which are both time consuming and labor intensive to collect. To eliminate the dependence of pixel-level labels, weakly supervised semantic segmentation (WSSS) has been widely applied due to its weak/cheap supervision of points, scribbles, image-level labels, and bounding boxes. Image-level and pseudo labels are the weakest/easiest and most difficult labels to obtain, respectively. The main challenge in WSSS based on a convolutional neural network with image-level supervision lies in the naive gap between classification and segmentation tasks, which reduces the activation of the target regions, thus failing to satisfy the requirements of segmentation tasks. Despite activating most of the foreground objects, the classifier using the Transformer also introduces much background noise, thus decreasing the quality of the pseudo masks. In order to take full use of the advantages of these two types of feature extraction networks and combine the attention features of different levels of the Transformer, a self-attention fusion and modulation network is constructed in this paper for weakly supervised semantic segmentation.MethodTo make full use of the local features extracted by the convolutional neural network and the global features extracted by the Transformer, this paper turns to the convolution-enhanced Transformer (Conformer) as the feature extraction network that can encode the image comprehensively and obtain the initial class activation maps. The attention maps learned by the Transformer branch differ between the shallow and deep layers. Influenced by the convolution information, the attention maps in shallow layers tend to capture detailed information of the targets regions, while those maps in deeper layers prefer mining the global information. Meanwhile, the noises in the background regions are caused by the attention maps in deeper layers owing to the incorrect relation between the background and foreground. Therefore, adding different attention maps directly is a suboptimal choice. We propose a self-attention adaptive fusion module that assigns a weight to each layer to balance their importance. On the one hand, we argue that the attention maps in shallow layers are more accurate than those in deeper layers, so large and small weights are distributed to the maps in the shallow and deep layer maps, respectively, to reduce the influence of the noise caused by deep layers. On the other hand, we consider the discrete activation value of the attention map, and a map with a larger discrete activation value has greater importance. The fused self-attention can effectively suppress background noises and describe the similarity between pixel pairs. In order to further increase the activation response of foreground pixels, we design a self-attention modulation module. We initially normalize the attention map before mapping it via the exponential function to measure the importance of each pixel pair. Given that the target object pixels are relatively similar and that the attention value of the pixel pair may be larger than that of others, we increase this connection via a large modulation parameter. When a pixel pair has a small attention value, these pixels may not have a close relation, thus introducing some noise. Therefore, we reduce this connection via a small modulation parameter. After modulating the attention map, the distance between the foreground and background pixels becomes large, and the attention maps pay more attention to the foreground regions than the background ones.ResultOur experiment results demonstrate that our model can achieve state-of-the-art performance. Our model obtains a 70.2% mean intersection over union (mIoU) in the validation set, 70.5% mIoU in the test set of the most popular PASCAL VOC 2012 dataset, and 40.1% mIoU in the validation set of COCO 2014. We do not utilize the saliency maps to provide the background cues, and our results are comparable to those works using saliency maps. Our model outperforms the state-of-the-art multi-class token Transformer (MCTformer) model, which uses the Transformer structure for feature extraction, by 2% and 2.1% in the validation and test sets, respectively, in terms of mIoU. Compared to TransCAM, which directly uses attention to adjust the class activation maps, our model obtains a 0.9% performance boost both in the validation and test sets, hence demonstrating that our model can effectively reduce noise in background regions. Our model also outperforms IRNet, SEAM, AMR, SIPE, and URN, which use the convolutional neural network as their backbone, by 6.7%, 5.7%, 1.4%, 1.4%, and 0.7%, respectively, in the validation set, thus confirming that our dual branch feature extraction structure is effective and feasible. Given that we extract our features from both the local and global aspects, we also conduct an ablation experiment to show the importance of the completement of information. If we only use the information of the convolution branch, then we obtain 27.7% mIoU of the class activation map (CAM). However, when fused with the global feature generated by the Transformer branch, we obtain 35.1% mIoU, thereby indicating that both local and global information are helpful in generating CAM.ConclusionThe proposed self-attention adaptive fusion and modulation network in this paper is effective for image-level weakly supervised semantic segmentation tasks.
摘要:ObjectiveVideo object segmentation is a basic computer vision task that is widely used in video editing, video synthesis, autopilot, and other fields. This paper studies the problem of semi-supervised video object segmentation, that is, when the real label mask of the target in the first frame of the video is given, the segmentation mask of the target specified by the first frame in the remaining frame is predicted. First, in the video sequence, the target object undergoes great changes in appearance due to continuous motion and variable camera viewing angle. Second, if there is occlusion of other objects, then the target object may disappear from this frame. Third, similar targets of the same category increase the difficulty of segmenting specific targets. Therefore, although annotations are provided in the first frame, semi-supervised video object segmentation (VOS) remains a challenge. Recently, the algorithm based on memory network has become mainstream in video object segmentation. Space-time memory VOS (STMVOS) uses the memory network to store additional feature information of historical frames. When segmenting each frame, STMVOS uses memory information to match the feature information of the current frame of the video pixel by pixel. While STMVOS outperforms all previous methods, this algorithm suffers from slow segmentation speed because of its high computational complexity. Unlike STMVOS, fast and robust models (FRTM) also uses the memory network to store historical frame information yet uses memory information to update its proposed target model. The target model takes the feature information from the backbone network as input and outputs the rough mask of the target. This mask is then used as the input of the subsequent refinement and segmentation of the network, and the fine segmentation mask of the target is eventually outputted. After processing each frame, FRTM stores the features and mask of the frame in the memory module for subsequent updates of the target model. The speed of FRTM is 3.5 times higher than that of STMVOS while achieving competitive accuracy. However, FRTM faces several problems. First, after processing each frame, FRTM stores the corresponding feature information and mask in the memory module, which undoubtedly generates too much repetitive and redundant information in this module. Second, when storing memory frames, FRTM only mechanically gives a fixed proportion of weight to the latest stored feature information without considering the quality of the current frame, which is obviously disadvantageous in training a target model with strong discrimination.MethodTo solve the above problems, this paper proposes a video object segmentation algorithm based on memory module and adaptive weight update. First, given that the benchmark algorithm simply uses the linear update method to give the nearest frame the highest weight and does not consider the quality of the feature itself, in order to achieve a reasonable weight distribution of the benchmark algorithm, this study proposes a feature quality discrimination method based on mask mapping that takes into account inter-frame connection and feature quality when calculating the weight for each feature to be stored in the memory module. The corresponding weight is then given adaptively. Second, the benchmark algorithm stores the features and corresponding masks of each frame in the memory module, resulting in a certain degree of information redundancy. In order to remove redundant historical frame information, improve the running speed of the algorithm, and reduce the memory consumption of the algorithm by optimizing the information storage strategy, a lightweight memory module is constructed.ResultOn the DAVIS2016 dataset, the region similarity of the proposed algorithm is 85.9%, its contour accuracy is 85.7%, its average is 85.8%, and its speed is 13.5 frame/s. The average of the proposed algorithm is two orders of magnitude (8.2% and 5.6%, respectively) higher than those of MaskTrack and OSVOS. The proposed algorithm also outperforms the other mainstream algorithms introduced from 2017 to 2021 in terms of average . Specifically, the proposed algorithm outperforms FRTM and G-FRTM in terms of average by 2.3% and 1.5%, respectively. FRTM is also inferior to the proposed algorithm in terms of speed. On the DAVIS2017 dataset, the proposed algorithm has a region similarity J of 75.5%, contour accuracy of 81.1%, average of 78.3%, and speed of 9.4 frame/s. This algorithm outperforms the early classical algorithms MaskTrack and OSVOS in terms of average by 24% and 18%, respectively, and in terms of speed by two orders of magnitude. The proposed algorithm also outperforms the mainstream algorithms introduced from 2017 to 2021 in terms of average. Specifically, the average of FRTM and G-FRTM are 1.6% and 1.9% lower than those of the proposed algorithm, respectively, and this algorithm even has a higher speed than FRTM.ConclusionIn this paper, a video object segmentation algorithm based on memory module and adaptive weight update is proposed. First, to capture the target area accurately and reduce the influence of noise information on the target model, the proposed algorithm assigns the corresponding weight after evaluating the quality of the stored feature information. Second, this algorithm uses a lightweight memory module to store the relevant information of the history frame. In some challenging scenarios, the proposed algorithm can still generate an accurate and robust segmentation mask of the target, which also proves its effectiveness.
摘要:ObjectiveThe style transfer algorithm can transfer the style from the art image to the original natural image. The style image provides certain features, such as style texture and stroke, while the content image provides the contour structure. The goal of the style transfer algorithm is to synthesize a new stylized image with the texture stroke of the style image and the contour structure of the content image. The early face style transfer algorithm applies mathematical modeling to build a filter that counts the local features of the target image to understand its style. This algorithm then establishes a statistical model to describe the image style. However, the face style transfer algorithm only generates a single style, the resulting image style is not obvious, and needs to be modeled manually, thereby limiting its efficiency. With the rise of deep learning, the style transfer algorithm has started using the deep learning model as its core. Given that generative adversarial network (GAN) can generate images that satisfy certain distribution laws, we can generate a target image that is similar to the real image by training GAN. Therefore, GAN has been widely used in image style transfer algorithms. The main image style transfer algorithms are divided into two categories. The algorithms in the first category only improve GAN without using a pre-encoder, such as pix2pix and CycleGAN, while those in the second category use a pre-encoder. Due to the addition of encoders before the GAN structure, the resulting network structure becomes complex yet achieves highly realistic results, such as StyleGAN and StarGAN. To overcome the shortcomings of some face style transfer algorithms, such as StarGAN and MSGAN, which have poor detail style learning, insignificant style transfer effect, and generation of distorted images, we present a face style migration algorithm called multi-layer StarGAN (MStarGAN) with controllable style intensity.MethodFirst, we construct the pre-encoder through the feature pyramid network (FPN) to generate multi-layer feature vectors containing image detail features. Compared with the original 1 × 64 feature vector, the pre-encoder constructed by FPN can output a 6 × 256 feature vector, which contains additional details of the original image. Therefore, the generated image can learn the detailed style of the style image during style transmission. Second, we use the pre-encoder to generate style vectors for the original and style images and then combine these vectors. We then use the combined style vector for style transmission. We can also adjust the number of layers of this vector so that the style of the generated image is biased to either the original or style image, hence resulting in different style transfer intensities for the generated image. Third, we introduce a new loss function to maintain balance in the style of the generated image and ensure that the image will not be too biased toward either the original or style image. Fourth, we apply the weight demodulation algorithm as our style transmission module in the generator. The traditional method AdaIN has been proven to distort the generated image. By replacing the normalization operation on the feature map with the operation of convolution weight, we eliminate the feature artifacts in the feature map and reduce the distortion in the generated image.ResultWe implement our model in Python and test it on the Celeba_HQ dataset with RTX2080Ti. Our model not only generates high-quality random face images but also makes the generated images learn the style of style images, including hair and skin color. Compared with the multimodal unsupervised image-to-image translation), diverse image-to-image translation, MSGAN, and StarGAN V2 algorithms, in the latent-guided synthesis experiment, the Frechét inception distance (FID) index of the proposed algorithm is reduced by 18.5, 39.2, 20.2, and 0.8, respectively, while its learned perceptual image patch similarity (LPIPS) index is increased by 0.181, 0.366, 0.155, and 0.092 respectively. In the reference-guided synthesis experiment, the FID index of the proposed algorithm is reduced by 86.4, 32.6, 18.9, and 3.1, respectively, while its LPIPS index is increased by 0.23, 0.095, 0.094, and 0.018, respectively. In sum, our algorithm can generate result images with different styles and intensities.ConclusionThe proposed algorithm can transmit the detail style of the image, control the style intensity of the output image, and reduce the distortion of the generated image.
关键词:face style transfer network;StarGAN;style intensity;feature pyramid network (FPN);weight demodulation
摘要:ObjectiveTarget detection models based on deep convolutional neural networks are susceptible to complex environments (e.g., occlusion, illumination, long distance, and small targets), thereby leading to missed detection, false detection, and blurred target contour features. Moreover, the existing models cannot be easily generalized to small target detection tasks in aerial photography scenarios. To effectively solve these problems, this paper proposes a small-target vehicle detection algorithm called non-adjacent hop network you only look once version 5s multi-scale residual edge contour feature extraction strategy(NHN-YOLOv5s-MREFE) that fuses non-adjacent hopping and multi-scale residual structures.MethodFirst, four different scales of detection layers are designed, which are targeted to be responsible for the detection of vehicles of different sizes according to their perceptual field size. Second, drawing on the idea of DenseNet dense hopping, a non-adjacent hopping feature pyramid structure is constructed, and through the hopping summing strategy, additional unaffected original information is fused while strengthening the information interaction of non-adjacent layers, thereby addressing the problem where the location information is gradually diluted during the transmission process and effectively reducing the false detection rate of the model. Third, under the premise of reducing feature loss, a deconvolution and parallelism strategy is introduced to expand small target detail information by means of parameter learning to achieve pixel filling and to break the amount of information in each dimension. Fourth, a multi-scale residual edge contour feature extraction strategy is designed to follow the principle of gradual feature refinement, build a multi-scale residual structure, and capture multi-scale information at different levels using a two-branch parallel approach. Fifth, a multi-scale residual structure is constructed following the principle of gradual feature refinement. This structure captures multi-scale information at different levels using a two-branch parallel approach, achieves image edge feature extraction based on the pixel-by-pixel difference between the high semantic information and the initial shallow information at multiple scales, and assists the network model in completing target classification. Finally, the K-Means++ algorithm is used to decentralize the clustering centers to drive the results to the global optimum and accelerate the convergence of the model.ResultExperimental results show that the multimodal fusion strategy of the non-adjacent hopping and multi-scale residual structures effectively improves the accuracy and robustness of small target detection while enhancing the model operation efficiency and reducing the consumption of the model computational resources. The generalization ability of the model in different scenarios is strengthened through the enhancement of sample data in multiple scenarios, time periods, and perspectives. Finally, NHN-YOLOv5s-MREFE outperforms four mainstream target detection methods on an aerial image dataset containing multiple vehicle types in dual scenarios of intersections and along lanes. Compared with the benchmark model (YOLOv5s), the Precision, Recall, and mean average precision of NHN-YOLOv5s-MREFE are improved by 13.7%, 1.6%, and 8.1%, respectively.ConclusionThe proposed NHN-YOLOv5s-MREFE can balance detection speed and accuracy and significantly improve detection accuracy at the cost of increasing the number of parameters by a very small amount. This algorithm can also adapt to complex traffic environments to meet the real-time requirements of small target vehicle detection in aerial photography scenarios.
摘要:ObjectiveVideo action quality assessment aims to evaluate the execution and completion quality of specific actions in a video. Automated action quality assessment can effectively reduce losses in human resources and generate accurate and fair evaluations of video content. Meanwhile, traditional video action quality assessment task methods mainly suffer from three problems. First, most of these methods exhibit problems involving multi-scale spatial and temporal features. Specifically, the spatial and temporal location of the action in a video is critical for action quality assessment, and the sample video contains much information unrelated to the action. Thus, the current video action quality assessment methods encounter multi-scale spatial feature issues, in which different videos may have varying subject scale sizes in the spatial dimension, thus introducing challenges in capturing action information. In addition, action quality assessment confronts problems involving multi-scale temporal features, in which different durations and execution rates may exist in the temporal dimension and where the correlations between various time segments and labels are different. Second, the existing methods ignore problems related to the inherent ambiguity of labels caused by cognitive differences. These methods tend to focus on individual score labels and ignore the inherent ambiguity of score labels, the possibility of different judges providing different scores, and the subjectivity behind the given scores. For example, diving scores are presented by seven different judges and are not determined by a single label. Third, the current attention mechanisms faces redundancy in their self-attention heads. For instance, previous studies have employed many self-attention mechanism heads, but these heads exhibit redundancy during training. Moreover, removing the majority of these heads does not significantly affect the model performance. Experiments show that increasing the number of heads only worsens the effect of action quality assessment. To address these problems, this paper proposes self-attention and label distribution learning (SALDL), an action quality assessment model that focuses on different spatio-temporal locations in video sequences and generates fine-grained labels.MethodThis paper designs a new video action quality assessment model called SALDL that focuses on action information at different spatio-temporal locations in video sequences and generates fine-grained labels via the label distribution learning method, thus effectively addressing label ambiguity. SALDL comprises three main parts, namely, the video representation, pos-neg temporal attention (PNTA), and label distribution learning (LDL) modules. In the video representation module, SALDL employs an inflated 3D ConvNet (I3D) network structure with multi-receptive field convolution kernels to extract the spatial features within video clips. This model also proposes an Attention-Inc module that utilizes embedding, multi-head self-attention (MHSA), and multi-layer perceptron (MLP) to progressively incorporate the self-attentive mechanism into the Inception module, hence enabling the model to obtain contextual information between convolutional features at different scales. In the PNTA module, a temporal attention module with positive and negative attention heads is used to fully exploit temporal attention features through PNTA loss, thereby reducing the redundancy of self-attentive heads and extracting attention features from different time segments. In the LDL module, the SALDL model uses label distribution learning to generate fine-grained action quality labels, thereby resolving the inherent ambiguity of the tags. We also introduce a priori knowledge that the score label fits a certain distribution and then apply label enhancement methods to convert single labels into label distributions. The predicted label distribution is approximated via the Kullback-Leibler divergence loss function to the ground truth label distribution.ResultExtensive comparison experiments are performed on the multitask learning-action quality assessment (MTL-AQA) and JHU-ISI gesture and skill assessment working set (JIGSAWS) datasets. The Spearman rank correlation coefficient (Sp.Corr) was 0.941 6 in the MTL-AQA datasets 0.836 4, 0.866 0, and 0.753 1, all of which achieved state-of-the-art results. Extensive ablation experiments were also performed for the PNTA, LDL, and Attention-Inc structures in the SALDL model. The experimental regression-based SALDL model, with the output dimension of the fully connected layer, changed to 1, and with the exclusion of the softmax function, SALDL directly generated a prediction score with an Sp.Corr of 0.932 0. SALDL-w/o PNTA, which represents the SALDL model without using the PNTA module, obtained an Sp.Corr of 0.938 4, while SALDL-w/o Attention-Ins, which represents the SALDL model without using the Attention-Inc structure, obtained an Sp.Corr of 0.939 9. Experimental results highlight the enhancement of each module for SALDL. We also conducted ablation experiments on the selection of a segmentation strategy and distribution function. Results show that the selection of a segmentation strategy and distribution function should be dynamic and in accordance with the dataset type. Therefore, future research should determine the ideal distribution function, the fusion of different distribution functions, and other methods to achieve adaptive label enhancement.ConclusionThe proposed SALDL model addresses problems that involve multi-scale spatio-temporal features by fully mining spatio-temporal features at different scales. This model also solves the intrinsic ambiguity of labels and the redundancy of self-attention heads by introducing a priori knowledge where labels conform to a certain distribution for enhancement and achieve label distribution learning. The proposed SALDL model achieves state-of-the-art performance on several action quality assessment datasets, hence fully validating its effectiveness.
摘要:ObjectiveThe goal of the image style transfer algorithm is to render the content of one image with the style of another image. Image style transfer methods can be divided into traditional and neural style transfer methods. Traditional style transfer methods can be broadly classified to stroke-based rendering (SBR) and image analogy (IA). SBR simulates human drawings with different sizes of strokes. Meanwhile, the main idea of IA is as follows: given a pair of images A (unprocessed source image) and A′ (processed image) and the unprocessed image B, the processed image B′ is obtained by processing B in the same way as A to A′. Meanwhile, neural style transfer methods can be categorized into slow image reconstruction methods based on online image optimization and fast image reconstruction methods based on offline model optimization. Slow image reconstruction methods optimize the image in the pixel space and minimize the objective function via gradient descent. Using a random noise as the starting image, the pixel values of the noise images are iteratively changed to obtain a target result image. Given that each reconstruction result requires many iterative optimizations in the pixel space, this approach consumes much time and computational resources and requires a high time overhead. In order to speed up this process, fast image reconstruction methods are proposed to train the network in advance in a data-driven manner using a large amount of data. Given an input, the trained network only needs one forward transmission to output a style transfer image. In recent years, the seminal works on style transfer have focused on building a neural network that can effectively extract the content and style features of an image and then combine these features to generate highly realistic images. However, building a model for each style is inefficient and requires much labor and time resources. One example of this model is the neural style transfer (NST) algorithm, which aims at transferring the texture of the style image to the content image and optimizing the noise at the pixel level step by step. However, hand-painted paintings comprise different strokes that are made using different brush sizes and textures. Compared with human paintings, the NST algorithm only generates photo-realistic imageries and ignores paint strokes or stipples. Given that the existing style transfer algorithms, such as Ganilla and Paint Transformer, suffer from loss of brush strokes and poor stroke flexibility, we propose a novel style transfer algorithm to quickly recreate the content of one image with curved strokes and then transfer another style to the re-rendered image. The images generated using our method resemble those made by humans.MethodFirst, we segment the content image into subregions with different scales via content mask according to the customized number of super pixels. Given that we do not pay attention to the background, we segment the image background into small subregions. To preserve large amounts of details, we save the image foreground as much as possible via segmentation into small subregions. The segmentations for the image foreground are twice greater than those for the image background. For each subregion, four control points are selected in the convex hull of a subregion, and then the Bezier equation is used to generate thick strokes in the background and thin strokes in the foreground. The image rendered with strokes is then stylized with the style image by using the style transfer algorithm to generate a stylized image that retains the stroke traces.ResultCompared with the arbitrary style transfer (AST) and Kotovenko’s method, the deception rate of the proposed method is increased by 0.13 and 0.04, respectively, while its human deception rate is increased by 0.13 and 0.01.Compared with Paint Transformer and other stroke-based rendering algorithms, our proposed method can generate thin strokes in the texture-rich foreground region and thick strokes in the background, thus preserving large amounts of image details.ConclusionUnlike whitening and coloring transforms (WCT), AdaIN, and other style transfer algorithms, the proposed method uses an image segmentation algorithm to generate stroke parameters without training, thus improving efficiency and generating multi-style images that preserve the stroke drawing traces of stylized images with vivid colors.
摘要:ObjectiveThe main task of aspect-level multimodal sentiment analysis is to determine the sentiment polarity of a given target (i.e., aspect or entity) in a sentence by combining relevant modal data sources. This task is considered a fine-grained target-oriented sentiment analysis task. Traditional sentiment analysis mainly focuses on the content of text data. However, with the increasing amount of audio, image, video, and other media data, merely focusing on the sentiment analysis of single text data would be insufficient. Multimodal sentiment analysis surpasses traditional sentiment analysis based on a single text content in understanding human behavior and hence offers more practical significance and application value. Aspect-level multimodal sentiment analysis (AMSA) has attracted increasing application in revealing the fine-grained emotions of social users. Unlike coarse-grained multimodal sentiment analysis, AMSA not only considers the potential correlation between modalities but also focuses on guiding the aspects toward their respective modalities. However, the current AMSA methods do not sufficiently consider the directional effect of aspect words in the context modeling of different modalities and the fine-grained alignment between modalities. Moreover, the fusion of image and text representations is mostly coarse grained, thereby leading to the insufficient mining of collaborative associations between modalities and limiting the performance of aspect-level multimodal sentiment analysis. To solve these problems, the aspect-level multimodal co-attention graph convolutional sentiment analysis model (AMCGC) is proposed to simultaneously consider the aspect-oriented intra-modal context semantic association and the fine-grained alignment across the modality to improve sentiment analysis performance.MethodAMCGC is an end-to-end aspect-level sentiment analysis method that mainly involves four stages, namely, input embedding, feature extraction, pairwise graph convolution of cross-modality alternating co-attention, and aspect mask setting. First, after obtaining the image and text embedding representations, a contextual sequence of text features containing aspect words and a contextual sequence of visual local features incorporating aspect words are constructed. To explicitly model the directional semantics of aspect words, position encoding is added to the context sequences of text and visual local features based on the aspect words. Afterward, the context sequences of different modalities are inputted into bidirectional long short-term memory networks to obtain the context dependencies of the respective modalities. To obtain the local semantic correlations of intra-modality for aspect-oriented modalities, a self-attention mechanism with orthogonal constraints is designed to generate semantic graphs for each modality. A textual semantic graph representation containing aspect words and a visual semantic graph representation incorporating aspect words are then obtained through a graph convolutional network to accurately capture the local semantic correlation within the modality. Among them, the orthogonal constraint can model the local sentiment semantic relationship of data units inside the modality as explicitly as possible and enhance the discrimination of the local features within the modality. A gated local cross-modality interaction mechanism is also designed to embed the text semantic graph representation into the visual semantic graph representation. The graph convolution network is then used again to learn the local dependencies of different modalities’ graph representations, and the text embedded in the visual semantic graph representation is inversely embedded into the text semantic graph representation so as to achieve a fine-grained cross-modality association alignment, thereby reducing the heterogeneous gap between modalities. Aspect mask settings are designed to select the aspect node features in the respective modalities’ semantic graph representation as the final sentiment representation, and cross-modal loss is introduced to reduce the differences in cross-modal aspect features.ResultThe performance of the proposed model is compared with that of nine latest methods on two public multimodal datasets. The accuracy (ACC) of the proposed model is improved by 1.76% and 1.19% on the Twitter-2015 and Twitter-2017 datasets, respectively, compared to those models with the second-highest performance. Experimental results confirm the advantage of using graph convolutional networks to model the local semantic relation interaction alignment within modalities from a local perspective and highlight the superiority of performing multimodal interaction in a cross-collaborative manner. The model is then subjected to an ablation study from the perspectives of orthogonal constraints, cross-modal loss, cross-coordinated multimodal fusion, and feature redundancy, and experiments are conducted on the Twitter-2015 and Twitter-2017 datasets. Experimental results show that the results of all ablation solutions are inferior to the performance of the AMCGC model, thus validating the rationality of each part of the AMCGC model. Moreover, the orthogonal constraint has the greatest effect, and the absence of this constraint greatly reduces in the effectiveness of the model. Specifically, removing this constraint reduces the ACC of the proposed model by 1.83% and 3.81% on the Twitter-2015 and Twitter-2017 datasets, respectively. In addition, the AMCGC+ BERT model, which is based on bidirectional encoder representation from Transformer(BERT) pre-training, outperforms the AMCGC model based on Glove. The ACC of the AMCGC+ BERT model is increased by 1.93% and 2.19% on the Twitter-2015 and Twitter-2017 datasets, thereby suggesting that the large-scale pretraining-based model has more advantages in obtaining word representations. The hyperparameters of this model are set through extensive experiments, such as determining the number of image regions and the weights of the orthogonal constraint terms. Visualization experiments prove that the AMCGC model can capture the local semantic correlation within modalities.ConclusionThe proposed AMCGC model can efficiently capture the local semantic correlation within modalities under the constraint of orthogonal terms. This model can also effectively achieve a fine-grained alignment between multimodalities and improve the accuracy of aspect-level multimodal sentiment analysis.
摘要:ObjectiveScene text recognition (STR) is a hot research field in computer vision that aims to recognize text information from natural scenes. STR is important in many tasks and applications, such as image search, robot navigation, license plate recognition, and automatic driving. Most of the early STR models usually comprise a rectification network and a recognition network, while recent STR models usually comprise a convolutional neural network (CNN)-based feature encoder and a Transformer-based decoder or a customized CNN module and Transformer encoder-decoder. These STR models usually have a complex model architecture, large computational load, and large memory consumption. A vision Transformer (ViT)-based STR model called ViTSTR maintains balance among accuracy, speed, and computational load. However, without data augmentation targeted for STR, ViTSTR requires improvement in its accuracy. One reason for its low accuracy is that the naive use of multihead self-attention in ViT does not guarantee that different attention heads capture distinct features, especially the diverse features in complex scene text images. To address this problem, this paper studies the application of orthogonality constraints in ViT based on ViTSTR and proposes novel orthogonality constraints for the multihead self-attention mechanism in ViT, which explicitly encourages diversity among multiple self-attention heads, improves the ability of multihead self-attention to capture information in different subspaces, and further improves the accuracy of the network while ensuring speed and computational efficiency.MethodThe proposed orthogonality constraints comprise two parts, namely, the orthogonality constraints for the features of query (Q), key (K) and value (V) on different self-attention heads and the orthogonality constraints for the linear transformation weights of Q, K, and V on different self-attention heads. Q, K, and V play important roles in the self-attention mechanism as input features of the attention head. The orthogonality of features on different attention heads clearly encourages diversity among multiple attention head features. The orthogonality constraints for the Q, K, and V features allow different self-attention heads to focus on features in different query, key, and value subspaces, hence explicitly enabling different self-attention heads to capture distinct features and guide the ViT model in improving its performance in text recognition tasks for very diverse scenes. Specifically, after normalizing the Q, K, and V features of each head, the orthogonality of Q, K, and V features between different heads is calculated, and the corresponding orthogonality is added as the regularization term after the loss function. The orthogonality of the Q, K, and V features between different heads is penalized by back-propagation, which acts as a constraint on the orthogonality of the corresponding features. Adding orthogonality constraints to the linear transformation weights of the Q, K, and V features on different self-attention heads provides an orthogonal weight space in the learning process of these features, hence triggering implicit regularization in network training and fully utilizing the feature and weight spaces of multihead self-attention. A similar approach is used for the Q, K, and V weights to constrain the orthogonality of the Q, K, and V weight spaces by penalizing the orthogonality of the corresponding weights. The feature and weight orthogonality constraints can produce improvements when used individually or in combination.ResultExperiment results show that compared to the benchmark method, the overall accuracy of the proposed method on the test dataset is improved by 0.447% when adding the feature orthogonality constraint. Meanwhile, when adding the weight orthogonality constraint, the overall accuracy of the proposed method is improved by 0.364%. When both feature and weight orthogonality constraints are added, the overall accuracy of the proposed method is improved by 0.513%. We then compare the orthogonality changes in different ablation experiments, including the orthogonality changes of the Q, K, and V features and weights among different self-attention heads. The proposed orthogonality constraint can lead to a significant improvement in the corresponding orthogonality. In addition, the feature orthogonality constraint favors the orthogonality of the weights, and the weight orthogonality constraint favors the orthogonality of the features, but these effects are small. We also produce attention maps of the model with added orthogonal constraints and the baseline model in the CUTE80(CT) dataset. These attention maps show that the model with added orthogonality constraints focuses on more information in the attention region compared to the baseline model, which is helpful for recognizing the correct results. We also compare the performance of our method and that of previous competitive methods on several popular benchmarks. Our proposed method shows improvements on both regular and irregular datasets. Compared with the baseline, the accuracy of the proposed method is improved by 0.5% on regular datasets IIIT5K-words(IIIT), street view text(SVT), and ICDAR2003(IC03) (860), by 0.5% and 0.8% on the irregular datasets ICDAR2015(IC15) (1811) and IC15 (2077), respectively, by 0.8% on SVT perspective(SVTP), and by 1.1% on CT. In sum, the proposed method shows an overall accuracy improvement of 0.5%.ConclusionThis paper proposes a novel orthogonality constraint for the multihead self-attention mechanism that explicitly encourages this mechanism to capture diverse subspace information. The Q, K, and V feature orthogonality constraints are used to improve the ability of the multihead self-attention mechanism in capturing the feature space information of the input sequence, and the Q, K, and V weight orthogonality constraints are used to provide orthogonal weight space exploration for the learning of features. Experiment results validate the effectiveness of the proposed plug-and-play orthogonality constraints in STR tasks, especially in improving the accuracy of the ViT model in irregular text recognition tasks. The code is publicly available:https://github.com/lexiaoyuan/XViTSTR.
关键词:scene text recognition (STR);vision Transformer (ViT);multihead self-attention;orthogonality constrained;computer vision
摘要:ObjectiveFace age synthesis is one of the most popular research fields in computer vision aiming at synthesizing face images of specified ages while maintaining high fidelity. With the continuous progress of science and technology, face age synthesis technology is being gradually applied in face recognition, film special effects, public security, and other fields with a very wide range of application scenarios. The generative adversarial network (GAN) is one of the most widely used deep learning models in face synthesis. The generator and discriminator of GAN fight each other to generate images that are real enough to be fake. While GAN and its variant models have achieved good synthesis results, some deficiencies remain unaddressed. First, in order to synthesize images that are close to the target age, the current face age synthesis models only limit the process of age change to texture information and ignore multi-scale features, such as contour, hair color, and texture, on the face. Second, the limited receptive field of the convolutional layer hinders the full convolutional network from extracting multi-scale features in the image. These problems greatly restrict the face age image synthesis effect of GAN. To solve these problems, this paper proposes a GAN composed of the parallel dilated convolution and channel-coordinate attention mechanism (PDA-GAN).MethodPDA-GAN proposes a parallel three-channel dilated convolutional residual block (PTDCRB) and a channel-coordinate attention mechanism (CCAM) based on generative adversarial networks. PTDCRB is introduced in the generator network of the baseline. Each PTDCRB comprises three parallel dilated convolution channels that extract features at the same time. The dilated convolutions on different branches set expansion coefficients of [1, 2, 3], respectively. Each branch of PTDCRB shares weights and reduces the amount of network parameters. The first layer of each branch in PTDCRB uses a 1 × 1 convolutional layer, the second layer is a dilated convolution with different expansion coefficients, and the third layer uses a 1 × 1 convolutional layer to reduce dimensionality and improve computational efficiency. Meanwhile, CCAM significantly screens the channel dimension of the feature vector, retains meaningful channel information in the feature, and learns the importance of different channels in order to avoid feature redundancy. CCAM then embeds the position information into the feature vector after channel attention and fuses them together after calculating the attention mechanism along the two orthogonal directions of length and width. The purpose of CCAM is to easily capture the dependencies of features at different positions.ResultAn experiment is conducted on the FFHQ dataset, samples in the Celeba-HQ dataset are selected as the test set, and PDA-GAN is qualitatively and quantitatively compared with the three latest face age image synthesis networks HRFAE, LIFE, and SAM to verify its effectiveness. Age accuracy and identity consistency are adopted as quantitative indicators. PDA-GAN achieves the best accuracy for synthetic age images, with an average prediction difference of 4.09. The identity confidence can reach 99.2% when synthesizing a 30-year-old face. In the age-independent attribute retention experiment, PDA-GAN outperforms the other models in both quantitative indicators, with a gender retention rate of 99.7% and emotion retention rate of 93.2%. An ablation experiment is performed to further prove the effectiveness of each module of PDA-GAN, where PTDCRB is introduced into different layers of the generator backbone network. Experimental results show that PTDCRB-3 significantly improves identity confidence and age estimation accuracy. Four PTDCRB expansion coefficient sets are then established to train the network, and an expansion coefficient of [1, 2, 3] needs to be achieved to confirm the optimality of model identity confidence and predicted age distribution. The standard generator structure and the generator structure introducing the channel-coordinate attention mechanism are then tested for their performance on age synthesis accuracy and identity verification confidence. Experimental results show that the identity retention and age synthesis abilities are significantly improved after adding the channel-coordinate attention mechanism.ConclusionThis study proposes a parallel three-channel dilated convolution residual block with shared weights that captures feature information at each scale and enhances the richness of the model detail features. To enhance the expressiveness of the model on sensitive features, this paper proposes a channel-coordinate attention mechanism that learns features of the channel and spatial dimensions simultaneously. Under the combined effect of the parallel three-channel dilated convolution residual block and the channel-position attention mechanism, the identity preservation ability and age synthesis accuracy of the model for face images are improved. Experimental results show that the proposed method outperforms other popular methods for face age synthesis tasks and can synthesize natural and realistic face images of the target age with high fidelity and accuracy.
摘要:ObjectiveWith the application of 3D point cloud in many fields, such as automatic driving, navigation and positioning, AR house viewing, and model reconstruction, people have started to focus on point cloud research and application. However, given their disorderly and unorganized nature, using irregular point clouds in data processing or directly including them in network training presents a challenge because the standard deep neural network model needs to have rules input data for its structure. To this end, the PointNet network is proposed in this paper. This pioneering network learns to use the per-point features of the shared multilayer perceptron (MLP) and the global features using the symmetric pooling function. PointNet focuses on the global information network but ignores the local information of the point cloud and the neighborhood features. As an improvement, PointNet++ is built by adding the sampling feature extraction of local neighborhood information. This improved model has three main parts, namely, the sampling, grouping, and PointNet layers, which not only extract local information but also combine the advantages of PointNet extracting global information. PointNet++ also has its drawbacks. For instance, this model ignores the geometric relationship between points and is unable to capture local features. To solve these drawbacks, dynamic graph convolutional neural network (DGCNN) proposes EdgeConv, which enhances its data representation ability by establishing topological relationships between points. EdgeConv not only maintains the invariance of the arrangement of point clouds but also captures local geometric features. Most of the related researches are based on a large amount of supervised data, which are too time consuming and labor intensive to process. Given the limited application of few-shot learning in 3D data, this paper proposes a few-shot meta-learning algorithm to semantically segment 3D point cloud data. The prototype alignment algorithm, which can efficiently learn the information of the support set, is also used to split the query set, and the learning ability is adjusted in order for the model to complete the segmentation task even with minimal supervised data.MethodThis paper proposes a method for semantic segmentation of 3D point clouds that differs from the traditional deep learning model segmentation method based on a large amount of supervised information. The proposed method uses the few-shot learning mode to segment point clouds. In order to avoid using a large amount of labeled data for training, this paper adopts a small-sample meta-learning algorithm. Specifically, in the form of multiple N-way K-shot meta-tasks, the dataset is inputted into the network to learn meta-knowledge during the meta-training stage. The training mode of the support set training-query set validation is stopped and repeated until learning to recognize new classes, and then final point cloud segmentation is applied on new classes that have not been learned during the meta-test stage. To avoid overfitting, after several experiments, we use 2 EdgeConv layers and 6-multilayer perceptron to construct the DGCNN network as our feature extractor. Point clouds have an uneven density, with a closer distance corresponding to a higher density. Therefore, using farthest point sampling will lead to a large amount of calculations. Therefore, we use EdgeConv in the DGCNN network, apply k-nearest neighbors (KNN) to search for the nearest neighbors to construct a graph structure, extract the features for each edge using MLP, and aggregate the edge features via average pooling to dynamically update the features of the center point. Given that the comprehensively learned information can express the corresponding category, combined with the idea of the prototype network, the features obtained after the support set and the query set pass through the network are averagely pooled to obtain the prototype of each category. A prototype is used to represent a class, and the fusion of prototype alignment algorithms can efficiently obtain the prototype of the support set and reverse the process of support set training-query set verification. In this reversed process, the query set features and predicted segmentation mask constitute a new “support set” that learns its prototype and segments the original support set data to allow the model to learn the information of the support set, extract a robust prototype, and calculate the Euclidean distance between the point cloud feature of the query set and the prototype of the support set to implement point cloud segmentation.ResultPoint cloud semantic segmentation is performed on the S3DIS, ScanNet, and Minnan ancient buildings (collected by the researchers) datasets to verify the segmentation performance of the proposed model. Compared with the prototype network and the matching network, which is a classical network of few-shot learning, the mean intersection over union (mIoU) of the proposed method is comprehensively improved by 6%. For a single category of 1-way, the highest mIoU of the proposed method can reach 95%, which is 12% higher than that reached by the prototype network. Meanwhile, for 2-way, the mIoU of the proposed method is 4% higher than that of the prototype network. Compared with the matching network, the mIoU of the proposed method is comprehensively improved by 6%. The comparative experiment that use DGCNN and PointNet++ as feature extractors also confirm that DGCNN, as a feature extractor, has a superior learning effect. When segmenting the ceiling and floor categories, DGCNN improves the segmentation mIoU by 5% and 30%, respectively, compared with PointNet++, representing an overall increase of 17%. Meanwhile, the segmentation mIoUs of DGCNN on the ScanNet and Minnan ancient buildings datasets are 63% and 51%, respectively. These experimental results prove that the proposed algorithm can achieve better results compared with traditional prototype network algorithms in point cloud segmentation even with a small amount of labeled data.ConclusionCompared with previous models that are trained with a large amount of labeled data, the proposed point cloud segmentation algorithm can segment a new class with little supervision information, thus improving the generalization of the model. This algorithm thus saves manpower, material resources, and time costs in practical applications. When faced with the situation where the labeled data of some samples are difficult to obtain, few-shot learning can play a key role.
关键词:point cloud segmentation;few-shot learning(FSL);meta-learning;prototype alignment;Minnan ancient buildings
摘要:ObjectiveAs a harmful, high-prevalence disease, colorectal cancer is seriously threatening human life and health. Nearly 95% of colorectal cancer cases are caused by the development of early colon polyps. Therefore, if colorectal polyps can be detected in time and closely observed by specialists, then the incidence of colorectal cancer can be effectively reduced. However, artificial diagnosis often has a high rate of missing polyps. The use of deep learning technology can provide fine-grained information that is helpful for diagnosis, such as the location and shape of polyps, and assist doctors in screening, thus providing great value for the prevention and treatment of colorectal cancer. The rapid development of deep learning in recent years has introduced great breakthroughs in the use of computer-aided diagnosis technologies in the medical field. Several models, such as convolutional neural networks and vision Transformer(ViT), have demonstrated their excellent medical task processing capabilities, and the use of computer technology for auxiliary diagnosis has gradually become a trend. In view of the characteristics of colorectal polyp images, such as their excessive morphological differences and unclear edges, we propose a edge-probability-distribution-guided high-resolution network for colorectal polyp segmentation called HRNetED, which performs well in multiple colorectal polyp datasets and has good clinical application significance.MethodThe proposed HRNetED network takes the HRNet structure as its backbone to ensure a full exchange of multi-scale features and guarantee the accuracy of the model output by maintaining a high-resolution convolutional branch. A stack residual convolution (SRC) module is also designed to extract the output of each convolution kernel by splitting a single convolution into four subconvolutions and connecting them serially so as to obtain the characteristics of multi-receptive fields. Pointwise convolution is then applied for feature fusion, and residual connection is introduced to avoid model performance degradation. To a certain extent, SRC solves the limitation of insufficient receptive fields in a single convolution operation and significantly reduces the number of model parameters and improves model performance through convolution splitting. Given the different morphological sizes, large color differences, and inconsistent imaging quality of colorectal polyp images, we design a multi-scale decoder to simultaneously supervise and learn the output results of different scales and introduce edge detection tasks into the structure to strengthen the perception of polyp edges. To address the unclear edges of polyps, we use the edge probability distribution model based on Gaussian distribution to describe the polyp edge so that the model does not need to return the accurate edge position information but only needs to predict the heat map of the edge distribution, thus effectively reducing the difficulty of model convergence and improving the perception ability and robustness of the model in the edge semantic ambiguous region. In the dataset configuration, we follow the experimental steps of mainstream networks, such as Pra-Net. Specifically, we use 900 images from the Kvasir-Seg dataset and 550 images from CVC-ClinicDB as the training set, amounting to 1 450 images. All images from ETIS, CVC-ColonDB, and CVC-300 and the remaining images from Kvasir-Seg and CVC-ClinicDB are then combined as test sets. We scale all these images to 256 × 256 pixels simultaneously. In the model training part, and use FocalLoss and BCELoss for the supervised training of edge detection and polyp segmentation tasks, respectively. We also iteratively use the cosine annealing learning rate adjustment strategy and Adam optimizer. In the model testing phase, we evaluate our model using the Dice coefficient and the mean intersection over union (mIoU) metric.ResultWe test our method on five publicly available colorectal polyp datasets, namely, Kvasir-Seg, ETIS, CVC-ColonDB, CVC-ClinicDB, and CVC-300, and compare its performance with that of existing colorectal polyp segmentation algorithms, including HRNetv2, Pra-Net, UACANet, MSRF-Net, BDG-Net, SSFormer, and ESFPNet. The comparison results reveal that the Dice coefficient and mIoU of HRNetED on the CVC-ClinicDB and CVC-300 datasets are greater than those of other algorithms. Compared with the previous optimal model on the CVC-ClinicDB dataset, HRNetED achieves 1.25% and 1.37% improvements in Dice and mIoU, respectively. On the ETIS dataset, the Dice and mIoU of HRNetED are 82.41% and 71.21%, respectively, with the former being higher than that of the existing optimal algorithm. On the CVC-ColonDB dataset, the Dice and mIoU of HRNetED are 80.55% and 71.56%, respectively. In addition, the HD95 distance of HRNetED on the Kvasir-Seg, ETIS, and CVC-ColonDB datasets is 0.315%, 29.19%, and 2.95% lower than that of existing optimal algorithms. While HRNetEd shows good performance on the CVC-ClinicDB and CVC-300 datasets, this model only emerges as the second best-performing algorithm.ConclusionThe proposed HRNetED network performs well in colorectal polyp segmentation tasks. The subjective segmentation results show that this network performs stably in multiple datasets, has a good perception of small targets and fuzzy polyps, and has a strong ability to detect polyp contours. Results of ablation experiments show that the proposed stacked residual convolution module can greatly reduce the number of model parameters and improve model performance, whereas the edge probability distribution model proposed for the edge fuzzy region problem can effectively improve the performance of the network.
摘要:ObjectiveRed tide is a harmful ecological phenomenon in the marine ecosystem that seriously threatens the safety of the marine economy. The accurate detection of the occurrence and distribution area of a small-scale red tide can provide basic information for the prediction and early warning of this phenomenon. Red tide has a short duration and rapid change, and on-site observations can hardly meet the requirements for its timely and accurate detection. By contrast, remote sensing has become an important technology for red tide monitoring. However, the traditional method of exponential extraction based on spectral features is easily influenced by ocean background noise, and the threshold cannot be easily determined because the marginal watercolor of the red tide is not obvious. Deep-learning-based methods can extract red tide information end to end without setting the threshold manually yet treat low and high-frequency red tide information equally, thus hindering the representation ability of the convolutional neural network. To solve the problem of positioning and identifying small-scale red tide marginal waters, a semantic segmentation method for the remote sensing detection of small-scale red tide is proposed in this paper by combining the high-frequency feature learning of red tide with position semantics.MethodThe residual-in-residual (RIR) structure is used to extract the high-frequency characteristics of red tide marginal waters, and the residual branch is alternately composed of multiple residual groups and receptive fields. The residual group uses the coordinate attention and dynamic weight mechanisms to capture the position semantic information of red tide water bodies, while multi-receptive field structures are used to capture multi-scale information. A small-scale red tide detection network called RTDNet is then constructed to enhance the detailed information of red tide marginal waters and suppress useless features. In order to verify the validity of the model, experiments are conducted on the red tide dataset of GF1-WFV. Due to limitations in computing resources, the remote sensing images are cropped to 64 × 64 pixels, and data enhancement operations, such as flipping, translating, and rotating, are performed on the data. Through the above processing steps, a total of 1 050 samples are obtained. Adam is selected as the model optimizer with 0.000 1 learning rate, 2 batch size, 100 epoch rounds, and a binary cross-entropy loss function. The experiment is carried out under the Ubuntu 18.04 operating system with NVIDIA GeForce RTX 2080Ti GPU, and the network model is realized in Python 3.6 with the Keras 2.4.0 framework. The precision (P), recall (R), F1-score (F1), and intersection over union (IoU) of the model are comprehensively evaluated to quantitatively analyze its effects.ResultExperimental results on the GF1-WFV red tide dataset show that RTDNet is superior to SVM, U-Net, DeepLabv3+, HRNet, the red tide band exponential method GF1_RI, RDU-Net, and other general or special red tide detection models in both the qualitative and quantitative aspects. Results from RTDNet are similar to the ground truth, and its red tide marginal water extraction effect is better than that of other models. This model also has much less instances of false extraction and missing extraction compared to the other models. For the quantitative results, the F1-score and IoU of RTDNet reach 0.905 and 0.898 and 0.827 and 0.815 on the two test images, respectively. Compared with those of the second-best-performing model DeepLabv3+, the F1-score of RTDNet is increased by more than 0.02, while its IoU is increased by more than 0.05. However, the number of model parameters in DeepLabv3+ is only 2.65 MB, which is 13% of RTDNet. An ablation experiment is also carried out, and the results verify that each module in RTDNet helps improve the effect of red tide detection. The visualization results of some feature maps across different stages of the network show the gradual refining process of the network to extract red tide.ConclusionThis paper proposes a red tide small-scale remote sensing detection network model called RTDNet based on the residual-in-residual structure, multi-receptive field structure, and attention mechanism. This model effectively addresses the false and missing extractions caused by the inconspicuous watercolor at the edge of red tide, improves the accuracy and stability of red tide marginal water detection, and effectively reduces the calculation load. Experimental results show that RTDNet is superior to other methods and models in detecting small-scale red tide in remote sensing images. This method is suitable for remote sensing the accurate location and area extraction of early marine disasters (e.g., red tide, green tide, and golden tide) and has certain reference significance and applicability for other semantic segmentation tasks with fuzzy edges.
摘要:ObjectiveHyperspectral image (HSI) contains rich spectral information and has advantages over multispectral image (MSI) in accurately distinguishing different types of materials. Therefore, HSI has been widely used in many computer vision tasks, including vegetation detection, face recognition, and feature segmentation. However, due to the limitations in hardware equipment and the acquisition environment, an inevitable trade-off arises between spatial resolution and spectral resolution. Thus, HSIs under real scenes often have low spatial resolution, which negatively affects the performance of subsequent vision tasks. By fusing the low-resolution HSI (LR-HSI) with a high-resolution MSI (HR-MSI) under the same scene using the HSI super-resolution algorithm, the spatial resolution of HSIs can be effectively improved. Existing HSI fusion algorithms can be roughly classified into traditional-model-based and deep-learning-based methods. Traditional-model-based fusion methods employ various handcrafted shallow priors (e.g., matrix/tensor factorization, total variation, and low rank) to utilize the intrinsic statistics of observed spectral images. However, these methods lack generalization ability to complex real scenarios and consume much time in iteratively optimizing the designed prior. Meanwhile, deep-learning-based fusion methods can automatically learn the prior knowledge from large-scale datasets. Although these methods often achieve better fusion results compared with traditional-model-based fusion methods, they do not jointly explore the inner self-similarity of multi-source spectral images, where the LR-HSI shows high correlation in the spectral dimension and the HR-MSI shows spatial similarities in texture and edges. In addition, the weights of these convolution-based networks are learned during training but are fixed during testing, hence limiting the potential adaptability of networks. To effectively exploit the inner spatial and spectral similarity of spectral images, we propose an MSI and HSI fusion network with a joint self-attention fusion module and Transformer.MethodGiven that LR-HSI has reliable information in the spectral dimension, the critical task of the HSI fusion method is to fill the missing texture details in the spatial dimension without losing discriminable spectral information. Given the LR-HSI and its matching HR-MSI, our proposed method fuses these two spectral images to obtain the desired HR-HSI in three steps. First, the similarity information of LR-HSI and HR-MSI is extracted by the joint self-attention module. Specifically, the spectral similarity features from LR-HSI are extracted by the channel attention module, and the spatial similarity features from HR-MSI are extracted by the spatial attention module. The obtained similarity features are then used to guide the fusion process. Second, to achieve a deep representation and explore the long-range dependencies of the fusion features, the preliminary fusion features are fed into the deep Transformer network, which comprises a shift window attention module, LayerNorm, and multilayer perceptron. The convolution layer and skip connection are also included in the proposed Transformer fusion network to further enhance the model flexibility. Third, the fusion features from Transformer are mapped to the desired high-resolution HSI. The overall network is implemented by the Pytorch framework and trained in an end-to-end manner. To generate training data, the training images are cropped to the size of 96 × 96 × 31, resulting in approximately 8 000 training patches that are smoothed by a Gaussian blur kernel and spatially down-sampled to obtain LR-HSI. The MSI images are generated by the spectral response function of a Nikon D700 camera.ResultWe compare our method with seven state-of-the-art fusion methods, including one traditional-model-based method and six deep-learning-based methods. The peak-signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), erreur relative globale adimensionnelle de Synthèse (ERGAS), and spectral angle mapper (SAM) are utilized as quantitative metrics in evaluating the performance of these fusion methods. To verify the effectiveness of the proposed model, we perform experiments on two widely used HSI datasets, namely, the CAVE and Harvard datasets. For the CAVE dataset, the first 20 images are selected for training, and the last 12 images are used for testing. Similarly, for the Harvard dataset, the first 30 images are selected for training, and the last 20 images are used for testing. Experimental results under different scale factors show that the proposed method achieves better fusion results in terms of quantitative metrics and visual effects compared to the other state-of-the-art methods. Under a scale factor of 8, the PSNR, SAM, and ERGAS of the proposed method is improved by 0.5 dB, 0.13, and 0.2, respectively, compared to EDBIN, which is the second best-performing method on the CAVE dataset. Under a scale factor 16, the PSNR of the proposed method is improved by at least 0.4 dB compared to the other methods on the Harvard dataset. The visual results show that our proposed method outperforms the other methods in recovering both fine-grained spatial textures and spectral details. The ablation study also proves that the employed Transformer fusion network significantly improves the fusion process.ConclusionIn this paper, we propose a Transformer-based MSI and HSI fusion network with a joint self-attention fusion module, which can effectively utilize the spectral similarity of LR-HSI and the spatial similarity of HR-MSI to guide the fusion process through a 2D attention mechanism. The preliminary fusion results pass through the residual Transformer network to obtain a deep feature representation and to reconstruct the desired HR-HSI. Qualitative and quantitative experiments show that the proposed method has better spectral fidelity and spatial resolution compared to the state-of-the-art HSI fusion methods.