摘要:Image deblurring is a fundamental task in computer vision that holds significant importance in various applications, such as medical imaging, surveillance cameras, and satellite imagery. Over the years, image deblurring has garnered much research attention, leading to the development of numerous dedicated methods. However, in real-world scenarios, the imaging process may be subject to various disturbances that can lead to complex blurring. Certain factors, such as inconsistent object motion, camera lens defocusing, pixel compression during transmission, and insufficient lighting, can lead to a range of intricate blurring phenomena that further amplify the deblurring challenges. In this case, image deblurring in real-world scenarios becomes a complex ill-posed problem, and conventional image deblurring based on simulated blurry degradations methods often falls short when confronted with these real-world deblurring challenges. These limitations are ascribed to the extent of assumptions on which these conventional methods depend. These assumptions include but are not limited to 1) traditional methods often assume that the noise in the image follows a Gaussian distribution; 2) spatially invariant uniform blur assumption; and 3) independence of the blurring phenomena assumption. Although convenient for theoretical analysis and algorithm development, these assumptions prove to be restrictive when applied to the complex deblurring problems encountered in real-world scenarios. Consequently, there is a pressing need to conduct specialized research tailored to the challenges of real-world image deblurring and to enhance the effectiveness of image restoration methods. Real-world image deblurring is an intricate task that requires the development of innovative algorithms and techniques that are capable of accommodating the diversity of blurring factors and the complexities present in practical environments. This paper attempts to create unique deblurring solutions that can efficiently handle real-world scenarios and enhance the practical applicability of deblurring methods. One approach to addressing the above challenges is to design algorithms that are robust to various types of noise and are capable of handling non-uniform and coupled blurring effects. Additionally, machine learning and deep learning have emerged as powerful tools for addressing complex real-world deblurring problems. Deep learning models, such as convolution neural networks and generative adversarial networks, have shown remarkable adaptability in learning from diverse data and producing high-quality deblurred images. Furthermore, researchers are exploring the integration of multiple sensor inputs, including depth information, to improve deblurring accuracy and effectiveness. These multi-modal approaches leverage additional data sources to disentangle complex blurring effects and enhance deblurring performance. As real-world image deblurring continues to gain attention, the research community is expected to contribute valuable insights and develop innovative solutions to further improve image restoration in complex scenarios. The ongoing collaboration among researchers from different fields, including computer vision, machine learning, optics, and imaging, will likely yield breakthroughs in addressing real-world deblurring challenges. In conclusion, real-world image deblurring is a multifaceted problem that requires tailored solutions to overcome the limitations of conventional deblurring methods. By acknowledging the complexities of real-world blurring phenomena and harnessing the power of advanced algorithms, machine learning, and multi-modal approaches, researchers are working toward enhancing image restoration in practical, challenging environments. Despite the growing interest in real-world deblurring, there remains a dearth of comprehensive surveys on the subject. To bridge this gap, this paper conducts a systematic review of real-world deblurring problems. From the perspective of image degradation models, this paper delves into various aspects and breaks down the associated challenges into isolated blur removal methods, coupled blur removal methods, and methods for unknown blur in real-world scenarios. This paper also provides a holistic overview of the state-of-the-art research in this domain, summarizes and contrasts the strengths and weaknesses of various methods, and elucidates the challenges that hinder further improvements in image restoration performance. This paper also offers insights into the prospects and research trends in real-world deblurring tasks and offers potential solutions to the challenges ahead, including the following: 1) Shortage of paired real-world training data: Acquiring high-quality training data with blur and sharp images that accurately represent the diversity of real-world scenarios is a significant challenge. The scarcity of comprehensive, real-world datasets hinders the development of supervised deblurring tasks. To address the lack of real-world data, researchers are exploring data synthesis and unsupervised learning techniques. By generating synthetic data that simulate real-world scenarios, algorithms can be trained on highly diverse data, and unsupervised learning is particularly suitable for improving adaptability to real-world conditions. 2) Efficiency of complex models: Recent deblurring algorithms rely on complex deep learning models to achieve high-quality results. However, the models often result in computational inefficiency, making these algorithms impractical for real-time or resource-constrained applications. The computational overhead and memory requirements of these models also limit their deployment in practical scenarios. Researchers are developing highly efficient model architectures, such as lightweight neural networks and model compression techniques, to strike a balance between computational efficiency and deblurring performance, making them suitable for real-time applications and resource-constrained environments. 3) Overemphasis on degradation metrics: Many deblurring methods prioritize optimizing quantitative metrics related to the reconstruction of image details. While these metrics provide a quantitative measure of image quality, they may not align with the perceptual quality as perceived by the human visual system. Therefore, a narrow focus on these metrics may neglect the importance of achieving results that are visually realistic and aesthetically pleasing to human observers. There has also been a growing emphasis on perceptual quality metrics, which evaluate the visual quality of deblurred images based on human perception. Integrating these metrics into the evaluation process can help ensure that deblurred images are not only quantitatively accurate but also visually pleasing to humans. As research on real-world image deblurring continues, the above challenges are expected to be gradually addressed, thus leading to effective and practical deblurring solutions. This survey aims to provide a comprehensive understanding of the current landscape of research in real-world deblurring and offers a roadmap for further advancements in this critical area of computer vision.
摘要:Human pose estimation (HPE) is a prominent area of research in computer vision whose primary goal is to accurately localize annotated keypoints of the human body, such as wrists and eyes. This fundamental task serves as the basis for numerous downstream applications, including human action recognition, human-computer interaction, pedestrian re-identification, video surveillance, and animation generation, among others. Thanks to the powerful nonlinear mapping capabilities offered by convolutional neural networks, HPE has experienced notable advancements in recent years. Despite this progress, HPE remains a challenging task, particularly when facing complex postures, variations in keypoint scales, occlusion, and other factors. Notably, the current heatmap-based methods suffer from severe performance degradation when encountering occlusion, which remains a critical challenge in HPE given that diverse human postures, complex backgrounds, and various occluding objects can all cause performance degradation. To comprehensively delve into the recent advancements in occlusion-aware HPE, this paper not only explores the intricacies of occlusion prediction difficulties but also delves into the reasons behind these challenges. The identified challenges encompass the absence of annotated occluded data. Annotating occluded data is inherently complex and demanding. Most of the prevalent datasets for HPE predominantly focus on visible keypoints, with only a few datasets addressing and annotating occlusion scenarios. This deficiency in annotated occluded data during model training significantly compromises the robustness of models in effectively handling situations that involve a partial or complete obstruction of body keypoints. Feature confusion presents a key challenge for top-down HPE methods, where the reliance on detected bounding boxes extracted from the image leads to the cropping of the target person’s region for keypoint prediction. However, in the presence of occlusion, these detection boxes may include individuals other than the target person, thereby interfering with the accurate prediction of keypoints. This issue is particularly problematic because the high feature similarity between the target person and the interfering individuals prevents the model from distinguishing features effectively, thereby compromising the accuracy of keypoint predictions and emphasizing the need to develop strategies for addressing feature confusion in occluded scenes. Navigating the intricacies of inference becomes particularly challenging in the presence of substantial occlusion. The expansive coverage of occlusion leads to the loss of valuable contextual and structural information that is essential for accurately predicting the occluded keypoints. Contextual cues and structural insights play pivotal roles in the inference process, and their absence impedes the model’s ability to draw precise conclusions. The significant loss of contextual information also hampers the model’s capacity to glean necessary details from adjacent keypoints, which is crucial for making informed predictions about occluded keypoints. This, in turn, results in the potential omission of keypoints or the emergence of anomalous pose estimations. Besides, this paper systematically reviews representative methods since 2018. Based on the training data, model structure, and output results contained in neural networks, this paper categorizes methods into three types, namely, preprocessing based on data augmentation, structural design based on feature discrimination, and result optimization based on human body priors. Preprocessing based on data augmentation techniques, which generate data with occlusion, are employed to augment training samples, compensate for the lack of annotated occluded data, and alleviate the performance degradation of the model in the presence of occlusion. These techniques utilize synthetic methods to introduce occlusive elements and simulate occlusion scenarios observed in real-world settings. Through these techniques, the model is exposed to a diverse set of samples featuring occlusion during the training process, thereby enhancing its robustness in complex environments. This data augmentation strategy aids the model in understanding and adapting to occluded conditions for keypoint prediction. By incorporating diverse occlusion patterns, the model can learn a broad range of scenarios, thus improving its generalization ability in practical applications. This method not only helps enhance the model’s performance in occluded scenes but also provides comprehensive training to boost its adaptability to complex situations. Feature-discrimination-based methods utilize attention mechanisms and similar techniques to reduce interference features. By strengthening features associated with the target person and suppressing those related to non-target individuals, these methods effectively mitigate the interference caused by feature confusion. These methods rely on mechanisms, such as attention, to selectively emphasize relevant features, thereby allowing the model to focus on distinguishing the keypoint features of the target person from those of interfering individuals. By enhancing the discriminative power of features belonging to the target individual, the model becomes adept at navigating scenarios where feature confusion is prevalent. Methods based on human body structure priors optimize occluded poses by leveraging prior knowledge of the human body structure. The use of human body structure priors is particularly effective in providing valuable information about the structural aspects of the human body. These priors serve as constraints that improve the robustness of the model during the inference process. By incorporating these priors, the model is further informed about the expected configuration of body parts, even in the presence of occlusion. This prior knowledge helps guide the model’s predictions and ensures that the estimated poses adhere closely to anatomically plausible configurations. A comparative analysis is also conducted to highlight the strengths and limitations of each method in handling occlusion. This paper also discusses the challenges inherent to occluded pose estimation and offers some directions for future research in this area.
关键词:human pose estimation(HPE);occlusion;data augmentation;human structure a priori;insufficient occlusion labeling data
摘要:Cross-view geo-localization aims to estimate a target geographical location by matching images from different viewpoints. This method is usually viewed as an image retrieval task that has been widely adopted in various artificial intelligence tasks, such as person re-identification, vehicle re-identification, and image registration. The main challenge of this localization task lies in the drastic changes among different viewpoints, which reduce the retrieval performance of the model. Conventional techniques for cross-view geo-localization rely on manual feature extraction, which restricts precision when determining location. With the development of deep learning techniques, deep learning-based cross-view geo-localization methods have become the current mainstream technology. However, due to the involvement of multiple steps and the extensive transfer of knowledge in cross-view geo-localization tasks, only a few studies have been conducted in this field. In this paper, we present the first review of cross-view geo-localization methods based on deep learning. We analyze the various developments in data preprocessing, deep learning networks, feature attention modules, and loss functions within the context of cross-view geo-localization tasks. To address the challenges in this field, the data preprocessing phase involves feature alignment, sampling strategies, and data augmentation. Feature alignment serves as prior knowledge for cross-view geo-localization that contributes to improving the localization accuracy. The use of GAN networks has emerged as a prominent trend for feature alignment. Additionally, the discrepancy in sample quantities among satellite, ground, and drone images necessitates the use of effective sampling strategies and data augmentation techniques to achieve training balance. Deep learning networks play a critical role in extracting image features, and their performance directly impacts the accuracy of cross-view geo-localization tasks. In general, the methods that use Transformer as the backbone network have a higher accuracy than those that based on ResNet. Meanwhile, those methods that use the ConvNeXt network show the best performance. To further extract image features and enhance the discriminative power of the model, feature attention modules need to be designed. By learning effective attention mechanisms, these modules adaptively weight the input images or feature maps to improve their focus on task-relevant regions or features. Experimental results show that these modules can explore previously unattended feature information, further extract image features, and enhance the discriminative power of the model. Loss functions are used to improve the fit of the model to the data and to accelerate its convergence. Based on their results, these functions guide the training direction of the entire network based, thus enabling the model to learn better representations and further improve the accuracy of cross-view geo-localization tasks. Some of the most commonly used loss functions include contrastive loss and triplet loss. With the improvement in these loss functions, the number of samples extracted by the model evolves from one-to-one to one-to-many, thus allowing the model to cover all samples during training and further enhance its performance. By analyzing nearly a hundred pieces of influential literature, we summarize the characteristics and propose some ideas for improving cross-view geo-localization tasks, which can inspire researchers to design new methods. We also test 10 deep learning-based cross-view geo-localization methods on 2 representative datasets. This evaluation considers the backbone network type and input data size of these methods. In the University-1652 dataset, we evaluate the accuracy metrics R@1 and AP, the model parameters, and the inference speed. In the CVUSA dataset, we mainly evaluate four accuracy metrics, namely, R@1, R@5, R@10, and R@Top1. Experimental results show that a better backbone network and a large image data input size positively affect the performance of the model. Building upon an extensive review of the current state-of-the-art cross-view geo-localization methods, we also discuss the related challenges and provide several directions for further research on cross-view geo-localization.
摘要:ObjectiveInformational warfare has put forward higher requirements for military reconnaissance, and military target identification, as one of the main tasks of military reconnaissance, needs to be able to deal with fine-grained military targets and provide personnel with more detailed target information. Optical remote sensing image datasets play a crucial role in remote sensing target detection tasks. These datasets provide valuable standard remote sensing data for model training and objective and uniform benchmarks for the comparison of different networks and algorithms. However, the current lack of high-quality fine-grained military target remote sensing image datasets constrains research on the automatic and accurate detection of military targets. As a special remote sensing target, military vehicles have certain characteristics, such as environmental camouflage, shape and structural changes, and movement “color shadows” that make their detection particularly challenging. Fig. 1 shows the challenges posed by the fine-grained target characteristics of military vehicles in optical remote sensing images, which can be categorized into the following types according to the source of target characteristics: 1) target characterization as affected by satellite remote sensing imaging systems; 2) characterization of the vehicle target itself; 3) military vehicle target characterization; 4) characteristics affected by the combination; and 5) properties of fine-grained classification. To promote the development of deep-learning-based research on the fine-grained accurate detection of military vehicles in high-resolution remote sensing images, we construct a new high-resolution optical remote sensing image dataset called military vehicle remote sensing dataset (MVRSD). Using this model, we design an improved model based on YOLOv5s to improve the target detection performance for military vehicles.MethodWe construct our dataset using Google Earth data, collected 3 000 remotely sensed images from more than 40 military scenarios within Asia, North America, and Europe, and acquired 32 626 military vehicle targets from these images. These images have a spatial resolution of 0.3 m and size of 640 × 640 pixels. Our dataset consists of remotely sensed images and the corresponding labeled files, and the targets were manually selected and classified by experts through the interpretation of high-resolution optical images. We divide the granular categories in the dataset into the following categories based on vehicle size and military function: small military vehicles (SMV), large military vehicles (LMV), armored fighting vehicles (AFV), military construction vehicles (MCV), and civilian vehicles (CV). The geographic environments of the samples include cities, plains, mountains, and deserts. To solve the difficulty of recognizing military vehicles in remote sensing images, the proposed benchmark model takes into account the characteristics of those military vehicles with small targets and fuzzy shapes and appearances along with interclass similarity and intraclass variability. The number of instances of each category in the dataset and the number of instances of each image depend on their actual distribution in the remote sensing scene, which can reflect the realism and challenges of the dataset. We then design a cross-scale target-size-based detection head and context aggregation module based on YOLOv5s to improve the detection performance for fine-grained military vehicle targets.ResultWe analyze the characteristics of military vehicle targets in remote sensing images and the challenges being faced in the fine-grained detection of these vehicles. To address the poor detection accuracy of the YOLOv5 algorithm for small targets and reduce its risks of omission and misdetection, we design an improved model based on YOLOv5s as our baseline model and select a cross-scale detector head based on the dimensions of the targets in the dataset to efficiently detect the targets at different scales. We insert the attention mechanism module in front of this detector head to inhibit the interference of complex backgrounds on the target. Based on this dataset, we applied five target detection models for our experiments. Results of our experiments show that our proposed benchmark model improves its mean average precision by 1.1% compared with the classical target detection model. Moreover, the deep learning model achieves good performance in the fine-grained accurate detection of military vehicles.ConclusionThe MVRSD dataset can support researchers in analyzing the features of remote sensing images of military vehicles from different countries and provide training and test data for deep learning methods. The proposed benchmark model can also effectively improve the detection accuracy for remotely sensed military vehicles. The MVRSD dataset is available at https://github.com/baidongls/MVRSD.
摘要:ObjectiveTexture shows different characteristics on different scales. On a smaller scale, the texture may appear more intricate and detailed, but on a larger scale, texture may present large structures and patterns. Therefore, texture patterns are complex and diverse and show various characteristics across patterns. For example, structural texture has clear geometric shape and arrangement, natural texture has randomness and complexity, and abstract texture presents a combination of different colors, lines, and patterns. While the human visual system can effectively distinguish an ordered structure from a disordered one, computers are generally unable to do so. Texture filtering is a basic and important tool in the fields of computer vision and computer graphics whose main purpose is to filter out unnecessary texture details and maintain the stability of the core structure. The mainstream texture filtering methods are mainly divided into local- and global-based methods. However, the existing texture filtering methods do not effectively guarantee the structural stability while filtering the texture. To address this problem, we propose an adaptive regularization of the weighted relative total variation for image smoothing algorithm.MethodThe main idea of this algorithm is to obtain a structure measure amplitude image with high texture structure discrimination and then use the relative total variation model to smooth this image according to the difference between the texture and structure. Our method implements texture filtering and structure preservation in three steps. First, we propose a multi-scale interval circular gradient operator that can effectively distinguish texture from structure. By inputting the intensity change information of the interval gradient in the horizontal and vertical directions (captured by the interval circular gradient operator) into the frame of directional anisotropic structure measurement (DASM), we generate a structure measure amplitude image with high contrast. In each iteration, we constantly adjust the scale radius of the interval circular gradient operator, where the scale radius of the interval circular gradient operator decreases as the number of iterations increases. On the one hand, this approach can capture the low-level semantic information of the texture structure in a large range at the initial stage of iteration and suppress the texture effectively. On the other hand, this approach can accurately capture the advanced semantic information of the texture structure at the end of the iteration to keep the structure stable. Second, given the high accuracy of the Gaussian mixture model in data classification, we separate the texture and structure layers of the structure measure amplitude image by using this model along with the EM algorithm. Before the separation operation, we conduct a morphological erosion operation on the image to refine the structure edge and shrink the structure area so as to improve the accuracy of the separation result. Finally, we adaptively assign regularization weights according to the structure measure amplitude image and the texture structure separation image. We assign a regularization term with high weight to the texture region for texture suppression, and we allocate a regularization term with a small weight in the structure area to maintain the stability of the fine structure and to ensure that the texture is filtered out in a large area to the greatest extent while maintaining the integrity of the structure.ResultWe ran our experiment on the Windows platform and implement our algorithm using Opencv and MATLAB. We defined three main parameters, including the maximum scale radius of the multi-scale interval circular gradient operator, the regular term of the texture region, and the regular term of the structure region. Maximum scale radius controls how much texture is suppressed. A larger regular term of the texture region corresponds to smoother filtering results, while a smaller regular term of the structure region corresponds to a better structure retention ability. On the visual level, by testing the images of oil paintings, cross embroideries, graffiti, murals, and natural scenes and comparing with the existing mainstream texture filtering methods, our proposed algorithm not only effectively suppresses the strong gradient texture but also maintains the stability of the edge of the weak gradient structure. In terms of quantitative measurement, by removing compressed traces of JPG images and smoothing Gaussian noise images, our proposed algorithm obtains the maximum peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) compared with the relative total variation, rolling guidance filtering, bilateral texture filtering, scale-aware texture filtering, and gradient minimization.ConclusionCompared with the existing texture filtering methods, the proposed algorithm achieves strong gradient texture suppression and fine structure preservation by using the adaptive allocation of regularization weights and completes the differentiated filtering operation between the texture and structure. Experiments show that our algorithm can maintain the main structure of the image and achieve gradient smoothing. This algorithm can be used to design powerful image preprocessing methods for image stylization, detail enhancement, HDR tone mapping, superpixel segmentation, and other fields sensitive to strong gradient texture.
关键词:image smoothing;texture filtering;relative total variation;multi-scale;regularization term adaptation
摘要:ObjectiveIn recent years, with the development of the internet and computer technology, manipulating images and changing their content have become trivial tasks. Therefore, robust image tampering detection methods need to be developed. As passive forensic methods, image forgery methods can be categorized into copy-move, splicing, and inpainting methods. Copy-move involves copying part of the original image to another part of the same image. Many excellent copy-move forgery detection (CMFD) methods have been developed in recent years and can be categorized into block-based, keypoint-based, and deep learning methods. However, these methods have the following drawbacks: 1) they cannot easily detect small or smooth tampered regions; 2) a massive number of features leads to a high computational cost; and 3) false alarm rates are high when the tampered images involve self-similar regions. To solve these issues, a novel CMFD method based on matched pairs, namely, density-based spatial clustering of applications with noise (MP-DBSCAN), is proposed in this paper along with point density filtering.MethodFirst, a large number of scale-invariant feature transform (SIFT) keypoints are extracted from the input image by lowering the contrast threshold and normalizing the image scale, thus allowing the detection of a sufficient number of keypoints in small and smooth regions. Second, the generalized two nearest neighbor (G2NN) matching strategy is employed to manage multiple keypoint matching, thus allowing the detection algorithm to perform smoothly even when the tampered region has been copied multiple times. A hierarchical matching strategy is then adopted to solve keypoint matching problems involving a massive number of keypoints. To accelerate the matching process, keypoints are initially grouped by their grayscale values, and then the G2NN matching strategy is applied to each group instead of the keypoints detected from the entire image. The efficiency and accuracy of the matching procedure can be improved without deleting the correct matched pairs. Third, an improved clustering algorithm called MP-DBSCAN is proposed. The matched pairs are grouped into different tampered regions, and the direction of the matched pairs are adjusted before the clustering process. The cluster objects only represent one side of the matched pairs and not all the extracted keypoints, and the keypoints from the other side are used as constraints in the clustering process. A satisfying detection result is obtained even when the tampered regions are close to one another. The proposed method obtains better F1 measures compared with the state-of-the-art copy-move forgery detection methods. Fourth, the prior regions are constructed based on the clustering results. These prior regions can be regarded as the approximate tampered regions. A point density filtering policy is also proposed, where each point density of the region is calculated and the region with the lowest point density is deleted to reduce the false alarm rate. Finally, the tampered regions are located accurately using the affine transforms and the zero-mean normalized cross-correlation (ZNCC) algorithm.ResultThe proposed method is compared with the state-of-the-art CMFD methods on four standard datasets, including FAU, MICC-F600, GRIP, and CASIA v2.0. Provided by Christlein, the FAU dataset has an average resolution of about 3 000 × 2 300 pixels and includes tampered images under post-processing operations (e.g., additional noise and JPEG compression) and various geometrical attacks (e.g., scaling and rotation). This dataset involves 48, 480, 384, 432, and 240 plain copy-move, scaling, rotation, JPEG, and noise addition operations, respectively. The MICC-F600 dataset includes images in which a region is duplicated at least once. The resolutions of these images range from 800 × 533 to 3 888 × 2 592 pixels. Among the 600 images in this dataset, 440 are original images and 160 are forged images. The GRIP dataset includes 80 original images and 80 tampered images with a low resolution of 1 024 × 768 pixels. Some tampered regions on these images are very smooth or small. The size of the tampered regions ranges from about 4 000 to 50 000 pixels. The CASIA v2.0 dataset contains 7 491 authentic and 5 123 forged images, of which 1 313 images are forged using copy-move methods. Precision, recall, and F1 scores are used as assessment criteria in the experiments. The F1 scores of the proposed method on the FAU, MICC-F600, GRIP, and CASIA v2.0 datasets at the pixel level are 0.914 3, 0.890 6, 0.939 1, and 0.856 8, respectively. Extensive experimental results demonstrate the superior performance of the proposed method compared with the existing state-of-the-art methods. The effectiveness of the MP-DBSCAN algorithm and the point density filtering policy is also demonstrated via ablation studies.ConclusionTo detect tampered regions accurately, a novel CMFD method based on the MP-DBSCAN algorithm and the point density filtering policy is proposed in this paper. The matched pairs of an image can be divided into different tampered regions by using the MP-DBSCAN algorithm to detect these regions accurately. The mismatched pairs are then discarded by the point density filtering policy to reduce false alarm rates. Extensive experimental results demonstrate that the proposed method exhibits a satisfactory accuracy and robustness compared with the existing state-of-the-art methods.
关键词:multimedia forensics;image forensics;image forgery detection;copy-move forgery;density based spatial clustering of applications with noise (DBSCAN)
摘要:ObjectivePoint clouds captured by depth sensors or generated by reconstruction algorithms are essential for various 3D vision tasks, including 3D scene understanding, scan registration, and 3D reconstruction. However, a simple scene or object contains massive amounts of unstructured points, leading to challenges in the storage and transmission of these point cloud data. Therefore, developing point cloud geometry compression algorithms is important to effectively handle and process point cloud data. Existing point cloud compression algorithms typically involve converting point clouds into a storage-efficient data structure, such as an octree representation or sparse points with latent features. These intermediate representations are then encoded as a compact bitstream by using either handcrafted or learning-based entropy coders. Although the correlation of spatial points effectively improves compression performance, existing algorithms may not fully exploit these points as representations of the object surface and topology. Recent studies have addressed this problem by exploring implicit representations and neural networks for surface reconstruction. However, these methods are primarily tailored for 3D objects represented as occupancy fields and signed distance fields, thus limiting their applicability to point clouds and non-watertight meshes in terms of surface representation and reconstruction. Furthermore, the neural networks used in these approaches often rely on simple multi-layer perceptron structures, which may lack capacity and compression efficiency for point cloud geometry compression tasks.MethodTo deal with these limitations, we proposed a novel point cloud geometry compression framework, including a signed logarithmic distance field, an implicit network structure with the multiplicative branches, and an adaptive marching cube algorithm for surface extraction. First, the point cloud surface (serving as the zero level-set) maps the arbitrary points in space to the distance values of their nearest points on the point cloud surface. We design an implicit representation called signed logarithmic distance field (SLDF), which utilizes the thickness assumption and logarithmic parameterization to fit arbitrary point cloud surfaces. Afterward, we apply a multiplicative implicit neural encoding network (MINE) to encode the surface as a compact neural representation. MINE combines sinusoidal activation functions and multiplicative operators to enhance the capability and distribution characteristics of the network. The overfitting process transforms the mapping function from point cloud coordinates to implicit distance fields into a neural network, which is subsequently utilized for model compression. Through the decompressed network, the continuous surface is reconstructed using the adaptive marching cubes algorithm (AMC), which incorporates a dual-layer surface fusion process to further enhance the accuracy of surface extraction for SLDF.ResultWe compared our algorithm with six state-of-the-art algorithms, including the surface compression approaches based on implicit representation and point cloud compression methods, on three public datasets, namely, ABC, Famous, and MPEG PCC. The quantitative evaluation metrics included the rate-distortion curves of chamfer-L1 distance (L1-CD), normal consistency (NC), F-score for continuous point cloud surface, and the rate-distortion curve of D1-PSNR for quantized point cloud surface. Compared with the suboptimal method (i.e., INR), our proposed method reduces L1-CD loss by 12.4% and improves the NC and F-score performance by 1.5% and 13.6% on the ABC and Famous datasets, respectively. Moreover, the compression efficiency increases by an average of 12.9% along with the growth of model parameters. On multiple MPEG PCC datasets with samples taken from the 512-resolution MVUB dataset, 1024-resolution 8iVFB dataset, and 2048-resolution Owlii dataset, our method achieves a D1-PSNR performance of over 55 dB within the 10 KB range, which highlights its higher effective compression limit compared with G-PCC. Ablation experiments show that in the absence of SLDF, the L1-CD loss increases by 18.53%, while the D1-PSNR performance increases by 15 dB. Similarly, without the MINE network, the L1-CD loss increases by 3.72%, and the D1-PSNR performance increases by 2.67 dB.ConclusionThis work explores the implicit representation for point cloud surfaces and proposes an enhanced point cloud compression framework. We initially design SLDF to extend the implicit representations of arbitrary topologies in point clouds, and then we use the multiplicative branches network to enhance the capability and distribution characteristics of the network. We then apply a surface extraction algorithm to enhance the quality of the reconstructed point cloud. In this way, we obtain a unified framework for the geometric compression of point cloud surfaces at arbitrary resolutions. Experimental results demonstrate that our proposed method achieves a promising performance in point cloud geometry compression.
摘要:ObjectiveConvolutional neural networks have made breakthroughs in computer vision, speech recognition, and other fields. However, with the continuous pursuit for neural network models with excellent performance, the structure of these models has become increasingly complex as mainly reflected in the width and number of their layers. Accordingly, the size and computing resource requirements of these models are also constantly expanding, and such a huge resource consumption limits these models to server platforms with unlimited computing power and other resources. As deep learning networks gradually integrate into the application end devices, many network models cannot be deployed on resource-constrained embedded end devices, such as smartphones, low-end mainboards, and edge devices. To address the contradiction between the computing resource requirements of network models and resource-constrained embedded devices, the existing complex models should be compressed. Based on the extant model pruning methods, this article proposes a two-stage filter pruning method that incorporates cosine spatial correlation (CSCTFP), which improves pruning performance by utilizing the spatial correlation between filters. CSCTFP also relies on such spatial correlation to identify the filter bank that contributes the most to the network, thus avoiding the secondary model pruning results caused by the assumption that “if the measurement index is small, the measurement object is not important”.MethodThe existing model pruning methods are mainly divided into two types. The first type, called unstructured pruning, uses the weight parameters of the filter as the minimum pruning unit. However, this pruning method leads to the unstructured sparsity of the filter. The network structure after pruning cannot use the existing software and hardware to achieve an acceleration effect but needs to design a corresponding accelerator to speed up the calculation of unstructured sparse matrix. The second type, called structured pruning, takes the whole filter as the smallest pruning unit. This pruning method makes the network structure appear structured and sparse, thus facilitating the use of existing software and hardware for acceleration. The existing filter pruning methods mainly use the assumption that “if the measurement index is small, the measurement object is not important” as an important evaluation criterion for filters, such as using the kernel norm of the filter as the measurement importance index. Alternatively, the “similarity is redundancy” assumption can be used as a criterion for evaluating filter redundancy, such as using the distance between filters as a measure of redundancy. The above two assumptions need to meet the prerequisite conditions, and they do not always hold true in actual scenarios. CSCTFP aims to address these shortcomings as follows. First, in the pre-pruning stage of the model, instead of deleting small norm filters, CSCTFP identifies the filter represented by the maximum norm value, which is referred to as the key filter in this article. Second, in the pruning stage, a set of filters that are highly correlated with the key filters is preserved by computing the cosine distance. Measuring the correlation between filters in these two stages also avoids poor pruning results when the above two assumptions do not hold.ResultExperiments were conducted using various network structures, such as visual geometry group (VGG) 16, residual neural network (ResNet) 56, and MobileNet V1, to verify that the proposed method can be adapted to different types of network models with sequential, residual, and deeply separable structures. The experimental results on datasets CIFAR10 and CIFAR100 were compared with those of previous methods. On the CIFAR10 dataset, the parameter count and floating-point operations (FLOPs) of VGG16 were compressed by 72.9% and 73.5%, respectively, while the model accuracy was improved by 0.1%. Compared with the Hrank pruning method, CSCTFP can compress more floating-point operations and reduce accuracy loss (the accuracy of the Hrank method decreased by 0.62%). For the efficient residual network ResNet56, the CSCTFP can compress 53.81% of FLOPs with an accuracy increase of 0.33%, and the accuracy loss is much lower than those obtained by SFP, FPGM, and NSPPR. The efficient deep separable network MobileNet V1 can also be effectively compressed, with CSCTFP compressing 46.23% of FLOPs and 46.89% of the parameter quantities, thus improving accuracy by 0.11%. CSCTFP demonstrates a better compression effect than DCP, which reduces accuracy by 0.3% and only compresses 42.86% of FLOPs and 30.07% of parameter quantities. CSCTFP also achieves a good compression performance on highly complex datasets, such as CIFAR100. For VGG16, CSCTFP can compress more FLOPs (33.35%) and experience a much lower accuracy loss compared with Variational and DeepPruningEs. For ResNet56, CSCTFP can compress 43.02% and 40.36% of parameter quantities and FLOPs and achieve an accuracy improvement of 0.48%, while the comparison methods OICSR and NSPPR compress fewer FLOPs and experience higher accuracy loss. In addition, CSCTFP is not only applicable to image classification tasks but also to object detection visual tasks. The lightweight face detection model RetinaFace on the WiderFace dataset performs well on simple and moderate validation sets. CSCTFP is then compared with the assumptions “if the measurement index is small, the measurement object is not important” and “similarity is redundancy” and continuous to show accuracy improvements at different pruning ratios.ConclusionCSCTFP takes into account the uncertainty of the assumptions “if the measurement index is small, the object being measured is not important” and “similarity is redundancy”, thus avoiding suboptimal results in the pruned model resulting from the failure of these two assumptions. CSCTFP further improves the accuracy and compression rate of pruning by searching for key filters and using the spatial correlation between filters. A large number of experiments have confirmed the effectiveness of CSCTFP and its advantages over other extant methods. The iterative pruning method used in this article can compress the network model finely, but further research is needed to reduce the loss of time cost and avoid manually setting the pruning ratio.
摘要:ObjectiveWith the development of face recognition technology, face images have been used as identity verification in many fields. As important biometric features, face images usually involve personal identity information. When illegally obtained and used by attackers, these images may cause serious losses and harm to individuals. Protecting face privacy and security has always been an urgent problem. The de-identification of face image is conducted in this paper, and the convenient and efficient use of class universal perturbation for face privacy protection is studied. The class universal perturbation method generates exclusive perturbation information for each user, and then the exclusive perturbation is superimposed on the face image for de-identification, thus resisting the behavior of deep face recognizer maliciously analyzing user information. In view of the limited face images provided by users, using class universal perturbation to de-identify users often faces the problem of insufficient samples. In addition, extracting face image features can be difficult due to variations in shooting angles, which increase the difficulty of learning user features through class universal perturbation. At the same time, class universal perturbation faces a complex protection scenario. Class universal perturbation is generated from a local proxy model and needs to be able to resist different face recognition models. These face recognition models use different datasets, loss functions, and network architectures, thus increasing the difficulty of generating class universal perturbation with transferability. In view of the insufficient user training data and the need to further improve the protection effect of perturbation in the field of the class universal perturbation, a generation method of class universal perturbation constrained by the triplet loss function is proposed in this paper, called face image de-identification with class universal perturbations based on triplet constraints (TC-CUAP). The negative samples are constructed based on the feature subspace to augment the training data and to obtain samples in triplets.MethodThe Res-Net50 deep neural network is adopted to extract the features of user face images, which are used as positive samples for training. The feature subspace is then constructed using three affine combination methods (i.e., affine hull, convex hull, and class center) of positive samples. The maximum distance between the samples and feature subspace is solved by the convex optimization method. The training samples are optimized along the direction away from the feature subspace, and the optimized samples are labeled as negative samples. Perturbations are randomly generated as initial values for class general perturbations before they are added to the original image. The features are then extracted from the perturbed images to obtain the training samples. The positive, negative, and training samples constitute the triplet required for training. The cosine distance is measured when training perturbations. The distance between the training samples and positive samples is maximized, while that between the training samples and negative samples is minimized. The training sample moves closer to the negative sample when the former is equidistant from the positive sample, thus allowing the perturbations to learn more adversarial information within a limited range. A scaling transformation is then applied to the generated perturbation. Those parts of the perturbation whose values are greater than 0 are set to the upper limit value of the perturbation threshold, while those parts whose values are less than 0 are set to the lower limit value. The class universal perturbation is ultimately obtained.ResultThe data required for the experiment are taken from the MegaFace challenge, MSCeleb-1M, and LFW datasets. The Privacy-Common public dataset, which represents ordinary users, and the Privacy-Celebrities celebrity dataset, which represents celebrity users, are then constructed, and test sets corresponding to these two datasets are built using data from the MegaFace challenge, MSCeleb-1M, and LFW datasets. Black box tests are conducted on the Privacy-Common and Privacy-Celebrities datasets for face recognition models with different loss functions and network architectures. Three of the black box models use different loss functions, namely, CosFace, ArcFace, and SFace, while the other three black box models use different network architectures, namely, SENet, MobileNet, and IResNet variants. The proposed TC-CUAP is then compared with generalizable data-free objective for crafting universal perturbations (GD-UAP), generative adversarial perturbations (GAP), universal adversarial perturbations (UAP), and one person one mask (OPOM). In the Privacy-Commons dataset, the highest Top-1 protection success rates of each method in the face of different face recognition models are 8.7% (GD-UAP), 59.7% (GAP), 64.2% (UAP), 86.5% (OPOM), and 90.6% (TC-CUAP), while the highest Top-5 protection success rates are 3.5% (GD-UAP), 46.7% (GAP), 51.7% (UAP), 80.1% (OPOM), and 85.8% (TC-CUAP). Compared with the well-known OPOM method, the TC-CUAP method improved its protection success rate by an average of 5.74%. In the Privacy-Celebrities data set, the highest Top-1 protection success rates of each method in the face of different face recognition models are 10.7% (GD-UAP), 53.3% (GAP), 59% (UAP), 69.6% (OPOM), and 75.9% (TC-CUAP), while the highest Top-5 protection success rates are 4.2% (GD-UAP), 42.7% (GAP), 47.8% (UAP), 60.6% (OPOM), and 67.9% (TC-CUAP). Compared with the well-known OPOM method, the TC-CUAP method improved its protection success rate by an average of 5.81%. The time spent to generate perturbations for 500 users is used as an indicator to measure the efficiency of each method. The time consumption of each method is 19.44 min (OPOM), 10.41 min (UAP), 6.52 min (TC-CUAP), 4.51 min (GAP), and 1.12 min (GD-UAP). The above experimental results verify the superiority of the TC-CUAP method in face de-identification and its transferability on different models. The TC-CUAP method with perturbation scaling transformation achieves average Top-1 protection success rates of 80% and 64.6% on the Privacy-Commons and Privacy-Celebrities datasets, respectively, while the TC-CUAP method without perturbation scaling transformation achieves average Top-1 protection success rates of 78.1% and 62.5.1%. The TC-CUAP method with perturbation scaling transformation increased the protection success rate by about 2%, thus proving its effectiveness. In addition to using convex hull to model the user feature subspace and generate negative samples, these samples can also be constructed using feature iterative universal adversarial perturbations (FI-UAP), FI-UAP incorporating intra-class interactions (FI-UAP+), and Gauss random perturbation. On the Privacy-Commons and Privacy-Celebrities datasets, these methods obtain the highest Top-1 protection success rates of 85.6% (FI-UAP), 86% (FI-UAP+), 44.8% (Gauss), and 90.6% (convex hull). Using convex hull yields a 4.9% higher average protection success rate than using the suboptimal FI-UAP+ method, thereby verifying the rationality of the negative sample construction described in this paper.ConclusionThe proposed method uses positive, negative, and training samples as constraints to obtain the class universal perturbation for face image de-identification. The negative samples are constructed from the original training data, thus alleviating the problem of insufficient training samples. The class universal perturbation trained by these triple constraints provides the feature attack information. At the same time, the introduction of perturbation scaling increases the strength of class universal perturbation and improves the face image de-identification effect. The superiority of this method is further verified by comparing its face de-identification performance with that of GD-UAP, GAP, UAP, and OPOM.
摘要:ObjectiveThe extraction and utilization of contour information, as a low-level visual feature of the target subject, contribute to the efficient execution of advanced visual tasks, such as object detection and image segmentation. When processing complex images, contour detection based on biological vision mechanisms can quickly extract object contour information. However, the perception of primary contour information is currently based on a single scale receptive field template or a simple fusion of multiple scale receptive field templates, which ignores the dynamic characteristics of receptive field scales and makes it difficult to accurately extract contours in complex scenes. Considering the serial parallel transmission and integration mechanism of visual information in the magnocellular (M) and parvocellular (P) dual vision pathways, we propose a new contour detection method based on the fusion of dual vision pathway scale information.MethodFirst, we introduce Lab, a color system that is close to human visual physiological characteristics, to extract color difference and brightness information from an image. Compared with conventional RGB color systems, Lab is more in line with the way the human eye perceives visual information. Considering that the scale of the receptive field of ganglion cells varies with the size of local stimuli to adapt to different visual task requirements across various scenes, a smaller scale of the receptive field corresponds to a more refined perception of detailed information. We then simulate the fuzzy and fine perception of the stimuli by ganglion cells using two different scale receptive fields, and we use color difference and brightness contrast information to guide the adaptive fusion of large- and small-scale receptive field responses and highlight the contour details. Second, considering the differences in the perception of orientation information among receptive fields at different scales of the lateral geniculate body, we introduce the standard deviation of the optimal orientation obtained from perception at multiple scales as the encoding weight for the direction difference, thereby achieving a modulation of texture region suppression weight information. We also combine local contrast information to guide the lateral inhibition intensity of non-classical receptive fields based on the difference between the central and peripheral directions. Through the collaborative integration of these two, we successfully enhance the contour regions and suppress the background textures. Finally, to simulate the complementary fusion mechanism of color and brightness information in the primary visual cortex (V1) region, we propose a weight association model integrating contrast information. Based on the fusion weight coefficients obtained from the local color contrast and brightness contrast, we achieve a complementary fusion of information flows in the M and P paths, thereby enriching the contour details.ResultWe compared our model with three biological-vision-based mechanisms (SCSI, SED, and BAR) and one deep-learning-based model (PiDiNet). On the BSDS500 dataset, we used several quantitative evaluation indicators, including optimal dataset scale (ODS), optimal image scale (OIS), average precision (AP) indicators, and precision-recall (PR) curves, and selected five images to compare the detection performance of each method. Experimental results show that our model has a better overall performance than the other models. Compared with SCSI, SED, and BAR, our model obtains 4.45%, 2.94%, and 4.45% higher ODS index, 2.82%, 5.80%, and 8.96% higher OIS index, and 7.25%, 4.23%, and 5.71% higher AP index, respectively. While the PiDiNet model based on deep learning has some shortcomings compared with various indicators, this model does not require a pre-training of data, has biological interpretability, and has a small computational power requirement. We further extracted four images from the NYUD dataset to visually compared the false detection rate, missed detection rate, and overall performance of the models. We also conducted a series of ablation experiments to demonstrate the contribution of each module in the model to its overall performance.ConclusionIn this paper, we use the M and P dual-path mechanism and the encoding process of luminance and color information in the front-end visual path to realize contour information processing and extraction. Our proposed approach can effectively realize a contour detection of natural images, especially for subtle contour edge detection in images, and provide novel insights for studying visual information mechanism in the higher-level cortex.
摘要:ObjectiveDeep neural networks (DNNs) have witnessed widespread application across diverse domains and demonstrated their remarkable performance, particularly in the realm of computer vision. However, adversarial examples pose a significant security threat to DNNs. Adversarial attacks are categorized into white-box and black-box attacks based on their access to the target model’s architecture and parameters. On the one hand, white-box attacks utilize techniques, such as backpropagation, to attain high attack success rates by leveraging knowledge about the target model. On the other hand, black-box attacks generate adversarial examples on an alternative model before launching attacks on the target model. Despite their alignment with real-world scenarios, black-box attacks generally exhibit low success rates due to the limited knowledge about the target model. The existing methods for addressing adversarial attacks typically focus on perturbations in the spatial domain or the influence of frequency information in images yet neglect the importance of the other domain. The spatial and frequency domain information of images are crucial for model recognition. Therefore, considering only one domain leads to insufficient generalization of the generated adversarial examples. This paper addresses this gap by introducing a novel black-box adversarial attack method called multi-domain feature mixup (MDFM), which aims to enhance the transferability of adversarial examples by considering both domains.MethodIn the initial iteration, discrete cosine transform is employed to convert the original images from the spatial domain to the frequency domain and to store the clean frequency domain features of the original images. Subsequently, inverse discrete cosine transform is employed to transform the images from the frequency domain back to the spatial domain. An alternative model is then applied to extract the clean spatial domain features of the original images. In subsequent iterations, the perturbed images are transitioned from the spatial domain to the frequency domain. The preserved clean features are then arranged based on the images, thus enabling the mixing of these images with their own clean features or those of other images. The frequency domain features of the perturbed and clean images are mixed. Random mixing ratios are applied within the corresponding channels of the image to introduce arbitrary variations that are influenced by clean frequency domain features, thus instigating diverse interference effects. The mixed features are then reconverted to the spatial domain where they undergo further mixing with the clean spatial domain features during the alternative model processing. Shuffle and random channel mixing ratios are also implemented, and adversarial examples are ultimately generated.ResultExtensive experiments are conducted on the CIFAR-10 and ImageNet datasets. On the CIFAR-10 dataset, ResNet-50 is utilized as the surrogate model to generate adversarial examples, and MDFM is tested on the VGG-16, ResNet-18, MobileNet-v2, Inception-v3, and DenseNet-121 ensemble models trained under different defense configurations to evaluate its performance in addressing advanced black-box adversarial attack methods, such as VT, Admix, and clean feature mixup (CFM). Experimental results demonstrate that MDFM achieves the highest attack success rates across these models, reaching 89.8% on average. Compared with the state-of-the-art CFM method, MDFM achieves a 0.5% improvement in its average attack success rate. On the ImageNet dataset, ResNet-50 and Inception-v3 are employed as surrogate models, and MDFM is tested on the VGG-16, ResNet-18, ResNet-50, DenseNet-121, Xception, MobileNet-v2, EfficientNet-B0, Inception ResNet-v2, Inception-v3, and Inception-v4 target models. When ResNet-50 serves as the surrogate model, the experimental results indicate that MDFM attains the highest attack success rates across all target models, surpassing the other attack methods. Compared with CFM, MDFM achieves a 1.6% higher average attack success rate. This improvement reaches 3.6% when tested on the MobileNet-v2 model. When Inception-v3 is employed as the surrogate model, MDFM consistently achieves the highest attack success rates across all nine models, surpassing the other methods. MDFM consistently outperforms CFM on all models, demonstrating a maximum improvement of 2.5% in terms of attack success rate. This success rate reaches 40.6%, which is 1.4% higher than the success rate achieved by the state-of-the-art CFM. To further validate the effectiveness of MDFM, this model is tested on adv-ResNet-50 and five Transformer-based models. ResNet-50 and adv-ResNet-50 are used as surrogate models in these tests. When ResNet-50 serves as the surrogate model, MDFM achieves the highest attack success rates across all five models, with an average improvement of 1.5% over CFM. The most significant improvement is observed on the Pit model, where MDFM achieves a 2.8% improvement in its attack success rate, which surpasses that of CFM by 1.5%. Meanwhile, when adv-ResNet-50 is employed as the surrogate model, MDFM achieves an average attack success rate of 59.4%, surpassing the other methods. The ConVit model exhibits a 1.9% improvement over CFM, and its average attack success rate surpasses CFM by 0.8%.ConclusionThis paper introduces the novel MDFM that is specifically designed for addressing adversarial attacks in black-box scenarios. MDFM mixes clean features across multiple domains, prompting adversarial examples to leverage a diverse set of features to overcome the interference caused by clean features. As a result, highly diverse adversarial examples are generated, and their transferability is enhanced.
摘要:ObjectiveDeep neural networks (DNNs) have been successfully applied in many fields, especially in computer vision, which cannot be achieved without large-scale labeled datasets. However, collecting large-scale datasets with accurate labels is difficult in practice, especially in some professional fields. The labeling of these datasets requires the involvement of relevant experts, thus increasing manpower and financial resources. To cut costs, researchers have started using datasets built by crowdsourcing annotations, search engine queries, and web crawling, among others. However, these datasets inevitably contain noisy labels that seriously affect the generalization of DNNs because DNNs memorize these noise labels during training. Learning algorithms based on co-teaching methods, including Co-teaching+, JoCoR, and CoDis, can effectively alleviate the learning problem of neural networks on noisy label data. Scholars have put forward different opinions regarding the use of two networks to solve noisy labels. However, in a noisy label environment, the deep learning model based on CE loss is very sensitive to the noisy label, thus making the model easily fit the noisy label sample and unable to learn the real pattern of the data. With the progress of training, Co-teaching causes the parameters of the two networks to gradually become consistent and prematurely converge to the same network, thus stopping the learning process. As the iteration progresses, the network inevitably remembers some of the noisy label samples and thus failing to distinguish the noisy from the clean samples accurately based on the cross entropy (CE) loss value. In this case, relying solely on CE loss as a small loss selection strategy is not reliable. To solve these problems, this paper proposes learning with noisy labels by co-teaching with history losses (Co-history) that considers historical information in collaborative learning.MethodFirst, to solve the overfitting problem of cross entropy loss (CE) in a noisy label environment, a correction loss is proposed by analyzing the history of sample loss. The revised loss function adjusts the weight of the CE loss in the current iteration in order for the CE loss of the sample to remain stable in the historical iteration as a whole, hence conforming to the law that the classifier should be maintained after separating the noisy from the clean samples so as to reduce the influence of overfitting caused by CE loss. Second, the difference loss is proposed to address the problem of premature convergence of two networks in the co-teaching algorithm. Inspired by contrast loss, the difference loss makes the two networks maintain a certain distance from the feature representation of the same sample so as to maintain the difference between these networks in the training process and to avoid their degradation into a single network. Given the differences in the network parameters, various decision boundaries are generated, and different types of errors are filtered. Therefore, maintaining such difference can benefit collaborative learning. Finally, due to the existence of overfitting, those samples with noisy labels tend to have larger loss fluctuations than those samples with clean labels. By combining the historical loss information of these samples and following the small loss selection strategy, a new sample selection method is proposed to select clean samples accurately. Specifically, those samples with low classification losses and low fluctuations in historical losses are selected as clean samples for training.ResultSeveral experiments are conducted to demonstrate the effectiveness of the Co-history algorithm, including comparison experiments on four standard datasets (F-MNIST, SVHN, CIFAR-10, and CIFAR-100) and one real dataset (Clothing1M). Four categories of artificially simulated noise are added to the standard dataset, including symmetric, asymmetric, pairflip, and tridiagonal noise types, with 20% and 40% noise rates for each noise type. In the real dataset, the labels are generated by the text around the image, which contains the noise label itself, thus generating no additional label noise. At the symmetric noise type with 20% noise rate, the co-history algorithm demonstrates 2.05%, 2.19%, 3.06%, and 2.58% improvements over the co-teaching algorithm in the F-MNIST, SVHN, CIFAR-10, and CIFAR-100 datasets, respectively. With 40% noise rate, the corresponding improvements are 3.52%, 4.77%, 6.16%, and 6.96%. In the real Clothing1M dataset, the best and lowest accuracies of co-history have improved by 0.94% and 1.2%, respectively, compared with the co-teaching algorithm. The effectiveness of the proposed loss is proven by ablation experiments.ConclusionA correction loss is proposed in this paper to address the overfitting problem of CE loss training and the historical law of sample loss, and a difference loss function is introduced to solve the premature convergence of two networks in Co-teaching. In view of the traditional small-loss sample selection strategy, the historical law of sample loss is fully considered in this paper, and a highly accurate sample selection strategy is developed. The proposed Co-history algorithm demonstrates its superiority over the existing co-teaching strategies in a large number of experiments. This algorithm also shows strong robustness in datasets with noisy labels and is particularly suitable for noisy label scenarios. The various improvements in this algorithm are also clearly demonstrated in ablation experiments. Given that this algorithm needs to analyze the historical loss information of each sample, the historical loss value of each sample should be saved. Increasing the number of training samples would occupy more memory space, thus increasing computing and storage costs. In addition, with a large number of sample categories, the performance of the proposed algorithm becomes suboptimal under some noisy environments (e.g., asymmetric noise type with 40% noise rate and the CIFAR-100 dataset with 20% noise rate). Future work will focus on the development of high-performance solutions under the premise of guaranteed accuracy and excellent robust classification algorithms for learning with noisy labels.
关键词:deep neural network(DNN);classification;noisy labels;co-teaching;historical loss
摘要:ObjectiveWith the rapid development of the virtual reality (VR) industry, the omnidirectional image acts as an important medium of visual representation of VR and may degrade in the procedure of acquisition, transmission, processing, and storage. Omnidirectional image quality assessment (OIQA) is an evaluation technique that aims to quantitatively describe the degradation of omnidirectional images and plays a crucial role in algorithm improvement and system optimization. Generally, the omnidirectional image has some inherent characteristics, i.e., geometric deformation in the polar region and semantic information more concentrated on the equatorial region. The viewing behavior can conspicuously affect the perceptual quality of an omnidirectional image. Early OIQA methods that simply fuse this inherent characteristic in 2D-IQA do not consider the significant user viewing behavior, thus obtaining suboptimal performance. Considering the viewport representation that is in line with the user viewing behavior, some deep learning-based OIQA methods have recently achieved promising performance by taking the predicted viewport sequence as the model input and computing the degradation. However, the prediction of the viewport sequence is difficult and viewport extraction needs a series of pixel-wise computations, thus leading to a significant computation load and hampering the application in the industry environment. To address the above problems, we proposed a new no-reference OIQA model, which introduces an equirectangular modulated deformable convolution (EquiMdconv) that can deal with the irregular semantics and the regular deformation caused by equirectangular projection simultaneously without the predicted viewport sequence.MethodWe propose a viewport-independent and deformation-unaware no-reference OIQA model for omnidirectional image quality assessment. Our model is composed of three parts: a prior-guided patch sampling (PPS) module, a deformable-unaware feature extraction (DUFE) module, and an intra-interpatch attention aggregation (A-EPAA) module. The PPS module samples a set of patch images on the basis of prior probability distribution in a slice-based manner to represent the complete image quality information. DUFE aims to extract the perceptual quality features of the input patch images, considering the irregular semantics and regular deformation in this process. It contains eight blocks, and each block comprises an EquiMconv layer, a 1 × 1 convolutional layer, a batch normalization layer, and a 3 × 3 max pooling layer. The EquiMconv layer employs a modulated deformable convolution layer that introduces learnable offset parameters to model distortions in the images more accurately. Furthermore, we incorporate fixed offsets based on distortion regularity factors into the deformable convolution’s offset to effectively eliminate the regular deformation. The A-EPAA comprises a convolutional block attention module (CBAM) and a patch attention module (PA). The CBAM assigns weights to each channel to adjust perceptual quality features in both channel and spatial dimensions. The PA adjusts the contribution weights between patch images for an overall quality assessment. We train the proposed model on the CVIQ, OIQA, and JUFE databases. In the training stage, we split each database into two parts: 80% for training and 20% for testing. We sample 10 patch images from each omnidirectional image, and the size of the patch image is set to 224 × 224. All experiments are implemented on a server with an NVIDIA GTX A5000 GPU. Adaptive moment estimation optimizer (Adam) is utilized to optimize our model. We train the model for 300 epochs on the CVIQ and OIQA databases and 20 epochs on the JUFE database; the learning rate is 0.000 1 and the batch size is 16.ResultWe conduct experiments covering three databases, namely, CVIQ, OIQA, and JUFE. We demonstrate the performance of the proposed model by comparing it with nine viewport-independent models and five viewport-dependent models. To ensure a persuasive comparison result, we select the Pearson linear correlation coefficient and Spearman’s rank correlation coefficient (SRCC) as performance evaluation standards. The results indicate that compared with those of the state-of-the-art viewport-dependent model, i.e., Assessor360, the parameters of our model are reduced by 93.7% and the floating point operations are reduced by 95.4%. Compared with the MC360IQA, which has a similar model size, the SRCC is increased by 1.9%, 1.7%, and 4.3% on the CVIQ, OIQA, and JUFE databases, respectively.ConclusionOur proposed viewport-independent and deformation-unaware no-reference OIQA model thoroughly considers the characteristics of the omnidirectional image. It can effectively extract quality features and accurately assess the quality of omnidirectional images with limited computational cost.
摘要:ObjectiveSkeletal motion retargeting is a key technology that involves adapting skeletal motion data from a source character, after suitable modification, to a target character with a different skeleton structure, thereby ensuring that the target character performs actions identical to the source. This process, which is particularly crucial in animation production and game development, can greatly promote the reuse of existing motion data and significantly reduce the need to create new motion data from scratch. Skeletal motion data have an inherently strong relationship with a character’s skeleton structure, and the core challenge in retargeting lies in extracting motion data features that are independent of the source skeleton and solely embody the essence and pattern of the action. The complexity in this process increases markedly during practical applications, especially when the source and target characters stem from distinct datasets (e.g., translating motion capture data from real human subjects onto virtual animated characters with heterogeneous skeletal structures). The differences between such datasets extend beyond mere skeletal disparities and may encompass inconsistencies in capturing equipment, physiological variations among individuals, and diverse action execution environments. Collectively, these factors produce significant discrepancies between the source and target characters in terms of global movement ranges, joint angle variation range, and other motion attributes, thus posing formidable challenges for retargeting algorithms. This paper addresses the problem of overcoming data heterogeneity to enable a precise motion retargeting from real human motion data to heterogeneous yet topologically equivalent virtual animated characters. To this end, this paper proposes several strategies for feature separation and high-order skeletal convolution operators.MethodDuring the data preprocessing stage, feature separation is applied on the motion data to isolate those components that are independent of the skeletal structure. This approach significantly reduces the complexity of the data and consequently reduce the difficulty of the heterogeneous retargeting task and facilitate the attainment of superior retargeting outcomes. Moreover, given the high sensitivity of motion retargeting tasks to local features, this paper delves into the distance information between joints and, in conjunction with higher-order graph convolution theory, introduces innovative improvements to conventional skeletal convolution methods, ultimately proposing a novel high-order skeletal convolution operator. In high-order graph convolutional operations, the employed adjacency matrices of higher powers encapsulate a more abundant and tangible information profile. These matrices not only encompass fundamental structural information, i.e., direct adjacencies between nodes, but may also be extended to embody the multi-level distance characteristics among nodes. This new operator harnesses the rich adjacency relationships and distance information encapsulated within higher-order adjacency matrices, thereby enabling convolution operations to thoroughly and comprehensively extract the intrinsic structural features of the skeleton and enhancing the accuracy and visual effect of the retargeting results.ResultIn the heterogeneous motion retargeting task, the proposed algorithm demonstrates a significant improvement (38.6%) in retargeting accuracy compared with the current state-of-the-art methods when evaluated using the synthetic animation dataset Mixamo. To further understand the model’s characteristics, the root joint errors are examined to examine its precision in handling root joint position. Results show that relative to extant methods, the proposed algorithm reduces the root joint position errors by 35.5%, hence substantiating its exceptional capability in addressing retargeting tasks with large ranges of root joint position variations. This algorithm also demonstrates its applicability and superiority in homogeneous motion retargeting tasks, achieving a 74.8% higher accuracy compared with extant methods. In practical applications, when applying real-world motion data captured from humans to the retargeting of virtual animated characters in aheterogeneous context, our algorithm excels at delivering high levels of authenticity in reproducing specific actions and significantly reducing retargeting errors.ConclusionThis paper presents a framework that is capable of handling challenging motion retargeting tasks between heterogeneous yet topologically equivalent skeletons. When the training data originate from two significantly diverse datasets, the proposed data preprocessing methods and high-order skeletal convolutional operators enable the neural network models to effectively extract motion features from the source data and integrate them into the target skeleton, thereby generating skeletal motion data for the target character. By separating features of the motion data that are independent of the skeleton structure, the proposed model can focus on structure-relevant information, thereby effectively decoupling motion information from structural details and achieving motion retargeting. Additionally, by assigning different weights to joints at varying distances, the high-order skeletal convolutional operators gather enhanced skeletal structural information to improve network performance.
关键词:deep learning;motion retargeting;graph convolutional network;autoencoder;Human3.6M motion data
摘要:ObjectiveFederated learning enables multiple parties to collaboratively train a machine learning model without communicating their local data. In practical applications, the data between nodes usually follow a non-independent identical distribution (non-IID). In the local update, each client model will be optimized toward its local optima (i.e., fitting its individual feature distribution) instead of the global optimal objective and raises a client update drift. Meanwhile, in global updates that aggregate these diverged local models, the server model is further distracted by the set of mismatching local optima, which subsequently leads to a global drift at the server model. To solve the problems of slow global convergence and increasing number of training communication rounds caused by non-IID data, this paper proposes a joint dynamic correction federated learning algorithm (FedJDC) that is optimized from the client and server.MethodTo reduce the influence of non-IID on federated learning, this paper carries out a joint optimization from the two aspects of local model update and global model update and proposes the FedJDC algorithm. This paper then uses the cosine similarity between the local and global update directions to measure the offset of each participating client. Afterward, given that each client has a different degree of non-IID, if the degree of the model offset is only determined by the cosine similarity calculated in this round, then the model update may become unstable. Therefore, the FedJDC algorithm defines the cumulative offset and introduces the attenuation coefficient ρ. In calculating the cumulative offset of the model, the current and historical cumulative offsets are taken into account. In addition, by changing ρ to reduce the proportion of the cumulative offset of the current round, the influence of the offset of the current round on the final result can be reduced. This paper also proposes a strategy for dynamically adjusting the constraint terms for local model update offset. Specifically, the constraint terms of the local loss function are dynamically adjusted according to the calculated cumulative offset of the local model, and the algorithm is automatically adapted to various non-IID settings without a careful selection of hyperparameters, thus improving the flexibility of the algorithm. To dynamically change the weight of global model aggregation in each round and effectively improve the convergence speed and model accuracy, this paper also designs a dynamic weighted aggregation strategy that takes the accumulated offset uploaded by all clients as the weight of global model aggregation in each round of communication.ResultThe proposed method is tested on three dataset using different deep learning models. LeNeT5, the VGG16 network model, and the ResNet18 network model are used for training in the MNIST, FMNIST, and CIFAR10 datasets, respectively. Four experiments are designed to prove the effectiveness of the proposed algorithm. To verify the accuracy of FedJDC at different degrees of non-IID, the hyperparameter β of the Dirichlet distribution is varied, and the performance of different algorithms is compared. Experimental results show that FedJDC can improve the model accuracy by 5.48%, 1.62%, 2.10%, and 2.28% on average compared with FedAvg, FedProx, FedAdp, and FedLAW, respectively. To evaluate the communication efficiency of FedJDC, the number of communication rounds is counted as FedJDC reaches a target accuracy, and this number is compared with that obtained by other algorithms. Experimental results show that under different degrees of non-IID, FedJDC can reduce communication rounds by 62.29%, 20.90%, 24.93%, and 20.47% on average compared with FedAvg, FedProx, FedAdp, and FedLAW, respectively. This paper also investigates the effect of the number of local epochs on the accuracy of the final model. Experimental results show that FedJDC outperforms the other four methods under different epochs in terms of final model accuracy. FedJDC also demonstrates better robustness against the larger offset caused by more local update epochs. Ablation experiments also show that each optimization method performs well on all datasets, and FedJDC combines these two strategies to achieve the global optimal performance.ConclusionThis paper optimizes the local and global model offsets from two aspects and proposes a joint dynamic correction algorithm for these offsets in federated learning. The cumulative offset is defined, and the attenuation coefficient is introduced into the calculation of the cumulative offset. Considering the historical and current offset information, the size of the cumulative offset is dynamically adjusted to ensure the stability of the training parameter update. The dynamic constraint strategy takes the cumulative offset calculated by the client in each round as the constraint parameter of the client model. The dynamic weighted aggregation strategy changes the weight of each local model during the global model aggregation based on the cumulative offset of each participating client so as to dynamically update the global model in each round. The combination of the two optimization strategies has achieved good results, effectively alleviated the performance degradation of the federated learning model caused by non-IID data, and provided a good foundation for the further implementation of federated learning in this field.
关键词:federated learning(FL);non-independent identical distribution (non-IID);loss function;model aggregation;convergence
摘要:ObjectiveLow overlap point cloud registration presents a significant obstacle in the realm of computer vision, specifically in the context of deep-learning-based approaches. After acquiring knowledge from global point cloud scenes for feature matching, these deep-learning-based methods often fail to consider the interactions among local features, thus greatly impeding the efficiency of registration in settings where local feature interactions are vital for establishing precise alignment. The intricate interplay among local characteristics, which is crucial for accurately identifying and aligning partially intersecting point clouds, is also inadequately represented. This lack of consideration not only affects the reliability of point cloud registration in situations with limited overlap but also restricts the use of deep learning methods in varied and intricate settings. Therefore, techniques that include the comprehension of local feature interactions into the deep learning framework are crucial for point cloud registration, especially in situations with limited overlap.MethodThe present study introduces a novel technique for aligning point clouds with low overlap. This technique uses the edge adaptive KPConv (EAKPConv) module to enhance the identification of edge characteristics. The integration of local and global features is effectively accomplished by the combination of the hierarchical attention fusion module (HAFM) and the local spatial comparison attention module (LSCAM). LSCAM exploits the capacity of the attention mechanism to consolidate information, thus enabling the model to prioritize those connections with target nodes and to position itself near the clustered center of mass. In this way, the model can flexibly capture complex details of the point cloud. The SSAM system utilizes a hierarchical architecture, in which each tier of local matching modules applies its own similarity metric to quantify the similarities among point clouds. The local features are subsequently modified and transmitted to the subsequent layer of attention modules to establish a hierarchical structure. This structure also allows the model to collect and merge the inputs from local matches at different scales and levels of complexity, thereby forming global feature correspondences. In this model, the multilayer perceptron (MLP) is used to accurately find the ideal correspondences and successfully complete the alignment procedure.ResultThis work provides empirical evidence supporting the improved efficacy of the proposed algorithm as demonstrated by its consistent performance across multiple datasets. Notably, this algorithm achieved impressive registration recall rates of 93.2% and 67.3% on the 3DMatch and 3DLoMatch datasets, respectively. In the experimental evaluation conducted on the ModelNet-40 and ModelLoNet-40 datasets, this algorithm achieved minimal rotational errors of 1.417 degrees and 3.141 degrees, respectively, and recorded translational errors of 0.013 91 and 0.072. These outcomes highlight the effectiveness of this algorithm in point cloud registration and demonstrate its capability to accurately align point clouds with low rotational and translational discrepancies. These results also point to a significant enhancement in the accuracy of the proposed algorithm compared with the REGTR approach. Specifically, in contrast to REGTR, the proposed algorithm achieved significantly reduced inference times of 27.205 ms and 30.991 ms on the 3DMatch and ModelNet-40 datasets, respectively. The findings of this study emphasize the performance of the proposed algorithm in effectively addressing the challenging issue of disregarding features in point cloud registration tasks with minimal overlap.ConclusionThis article presents a novel point cloud matching technique that combines edge improvement with hierarchical attention. This technique integrates polynomial kernel functions into the EAKPConv framework to improve the identification of edge features in point clouds and uses HAFM to extract specific local information. The module improves feature matching by using the similarities in edge features. This approach successfully achieves a harmonious combination of local and global feature matching, hence enhancing the comprehension of point cloud data. Implementing a hierarchical analysis technique greatly increases the registration accuracy by accurately matching local-global information. Furthermore, increasing the cross-entropy loss function enhances the accuracy of local matching and reduces misalignments. This study assesses the performance of the proposed algorithm on the ModelNet-40, ModelelloNet-40, 3DMatch, and 3DLoMatch datasets, and results indicate that this algorithm substantially enhances registration accuracy, particularly in difficult situations with limited data overlap. This algorithm also exhibits superior registration efficiency compared with standard approaches.
关键词:3D point cloud registration;low-overlap point cloud;edge features;hierarchical attention;local similar matching
摘要:ObjectiveFine-grained image classification aims to classify a super-category into multiple sub-categories. This task is more challenging than general image classification due to the subtle inter-class differences and large intra-class variations. The attention mechanism enables the model to focus on the key areas of the input image and the discriminative regional features of the image, which are particularly useful for fine-grained image classification tasks. The attention-based classification model also shows high interpretability. To improve the focus of this model on the image discriminative region, attention-based methods have been applied in fine-grained image classification. Although the current attention-based fine-grained image classification models achieve high classification accuracy, they do not adequately consider the number of model parameters and computational volume. As a result, they cannot be easily deployed on low-resource devices, thus greatly limiting their practical application. The concept of knowledge distillation involves transferring knowledge from a high-accuracy, high-parameter, and computationally expensive large teacher model to a low-parameter and computationally efficient small student model to enhance the performance of the latter and to reduce the cost of model learning. To further reduce the model learning cost, researchers have proposed the self-knowledge distillation method that, unlike traditional knowledge distillation methods, enables models to improve their performance by utilizing their own knowledge instead of relying on teacher networks. However, this method falls short in addressing fine-grained image classification tasks due to its ineffective extraction of discriminative region features from images, which results in unsatisfactory distillation outcomes. To address this issue, we propose a self-knowledge distillation learning method for fine-grained image classification by fusing efficient channel attention (ECASKD).MethodThe proposed method embeds an efficient channel attention mechanism into the structure of the self-knowledge distillation framework to effectively extract the discriminative regional features of images. The framework mainly consists of a self-knowledge distillation network with a lightweight backbone and a self-teacher subnetwork and a joint loss with classification loss, knowledge distillation loss, and multi-layer feature-based knowledge distillation loss. First, we introduce the efficient channel attention (ECA) module, propose the ECA-Residual block, and construct the ECA-Residual Network18 (ECA-ResNet18) lightweight backbone to improve the extraction of multiscale features in discriminative regions of the input image. Compared with the residual module in the original ResNet18, the ECA-Residual block introduces the ECA module after each batch normalization operation. This module consists of two ECA-Residual blocks to form a stage of the ECA-ResNet18 backbone network, enhance the network’s focus on discriminative regions of the image, and facilitate the extraction of multiscale features. Unlike ResNet18, which is commonly used in self-knowledge distillation methods, the proposed backbone is based on the ECA-Residual module, which can significantly enhance the ability of the model to extract multi-scale features while maintaining lightweight and highly efficient computational performance. Second, considering the differences in the importance of different scale features output from the backbone network, we design and propose the efficient channel attention bidirectional feature pyramid network (ECA-BiFPN) block that assigns weights to the channels during the feature fusion process to differentiate the contribution of features from various channels to the fine-grained image classification task. Finally, we propose a multi-layer feature-based knowledge distillation loss to enhance the backbone network’s learning from the self-teacher subnetwork and to focus on discriminative regions.ResultOur proposed method achieves classification accuracies of 76.04%, 91.11%, and 87.64% on three publicly available datasets, namely, Caltech-UCSD Birds 200 (CUB), Stanford Cars (CAR), and FGVC-Aircraft (AIR). To ensure a comprehensive and objective evaluation, we compared the proposed ECASKD method with 15 other methods, including data-augmentation, auxiliary-network, and attention-based methods. Compared with data-augmentation-based methods, ECASKD improves the accuracy by 3.89%, 1.94%, and 4.69% on CUB, CAR, and AIR, respectively, with respect to the state-of-the-art (SOTA) method. Compared to the auxiliary network-based method, ECASKD improves the accuracy by 6.17%, 4.93%, and 7.81% on CUB, CAR, and AIR, respectively, with respect to SOTA method. Compared to the joint auxiliary network and data augmentation methods, ECASKD improves the accuracy by 2.63%, 1.56%, and 3.66% on CUB, CAR, and AIR, respectively, with respect to SOTA method. In sum, ECASKD demonstrates a better fine-grained image classification performance compared with the joint auxiliary network and data augmentation methods even without data augmentation. Compared with the attention-based self-knowledge distillation method, ECASKD improves about 23.28%, 8.17%, and 14.02% on CUB, CAR and AIR, respectively, with respect to SOTA method. In sum, the ECASKD method outperforms all three types of self-knowledge distillation methods and demonstrates a better fine-grained image classification performance. We also compare this method with four mainstream modeling methods in terms of the number of parameters, floating-point operations (FLOPs), and TOP-1 classification accuracy. Compared with ResNet18, the ECA-ResNet18 backbone used in the proposed method significantly improves the classification accuracy with an increase of only 0.4 M parameters and 0.2 G FLOPs. Compared with the larger-scale ResNet50, the performance of the proposed method is less than one-half of that of ResNet50 in terms of number of parameters and computation, but its classification accuracy on the CAR dataset differs from ResNet50 by only 0.6%. Compared with the larger ViT-Base and Swin-Transformer-B, the proposed method is about one-eighth of both in terms of number of parameters and computation, and its classification accuracies on the CAR and AIR datasets are 3.7% and 5.3% lower than the optimal Swin-Transformer-B. These results demonstrate that the classification accuracy of the proposed method is significantly improved with only a small increase in model complexity.ConclusionThe proposed self-knowledge distillation fine-grained image classification method achieves good performance results with 11.9 M parameters and 2.0 G FLOPs, and its lightweight network model is suitable for edge computing applications for embedded devices.
关键词:fine-grained image classification;channel attention;knowledge distillation(KD);self-knowledge distillation(SKD);feature fusion;convolutional neural network(CNN);lightweight model
摘要:ObjectiveKnowledge distillation aims to transfer the knowledge of a teacher model with a powerful performance and a large number of parameters to a lightweight student model and improve its performance without affecting the performance of the original model. Previous research on knowledge distillation mostly focus on the direction of knowledge distillation from one teacher to one student and neglect the potential for students to learn from multiple teachers simultaneously. Multi-teacher distillation can help the student model synthesize the knowledge of each teacher model, thereby improving its expressive ability. A few studies have examined the distillation of teacher models across these different situations, and learning from multiple teachers at the same time can integrate additional useful knowledge and information and consequently improve student performance. In addition, most of the existing knowledge distillation methods only focus on the global information of the image and ignore the importance of spatial local information. In image classification, local information refers to the features and details of specific regions in the image, including textures, shapes, and boundaries, which play important roles in distinguishing various image categories. The teacher network can distinguish local regions based on these details and make accurate predictions for similar appearances in different categories, but the student network may fail to predict. To address these issues, this article proposes a knowledge distillation method based on global and local dual-teacher collaboration, which integrates global and local information and effectively improve the classification accuracy of the student model.MethodThe original input image is initially represented as global and local image views. The original image (global image view) is randomly cropped locally, and the ratio of the cropped area to the original image is specified within 40%~70% to obtain local input information (local image view). Afterward, a teacher (scratch teacher) is randomly initialize to synchronize training with the student in processing global information, and its scratch global feature output is used to gradually help students approach the teacher’s final prediction with the optimal path. Meanwhile, a pre-trained teacher (expert teacher) is introduced to process local information. The proposed method uses a dual-teacher distillation architecture to jointly train the student network on the premise of integrating global and local features. On the one hand, the scratch teacher works with the student to train and process global information from scratch. By introducing the scratch teacher, it is no longer just the final smooth output of the pre-trained model (expert teacher); instead, it uses its temporary output to gradually help the student model, thus forcing the latter to approach the final output logits with higher accuracy through the optimal path. During the training process, the student model obtains not only the difference between the target and the scratch output but also the possible path to the final goal provided by a complex model with strong learning ability. On the other hand, the expert teacher processes local information and separates the output local features into source category knowledge and other category knowledge. In this collaborative teaching, the student model reaches a local optimum, and its performance becomes close to that of the teacher model.ResultThe proposed method is compared with other knowledge distillation methods being used in the field of image classification. The experimental datasets include CIFAR-100 and Tiny-ImageNet, and image classification accuracy is used as the evaluation index. On the CIFAR-100 dataset, compared with the optimal feature distillation method SemCKD, the average distillation accuracy of the proposed method increased by 0.62% under the same architecture of teachers and students. In the case of heterogeneous teachers and students, the average accuracy rate of the proposed method increased by 0.89%. Compared with the state-of-the-art response distillation method NKD, the average classification accuracy of the proposed method increased by 0.63% and 1.00% in the cases of homogeneous and heterogeneous teachers and students, respectively. On the Tiny-ImageNet dataset, the teacher network is ResNet34, and the student network is ResNet18. The final test accuracy of the proposed method reached its optimal level at 68.86%, which was 0.74% higher than that of NKD and other competing models. This method also achieved the highest classification accuracy in the case of different teacher and student architecture combinations. Ablation experiments and visual analysis are also conducted on CIFAR-100 to demonstrate the effectiveness of the proposed method.ConclusionA dual-teacher collaborative knowledge distillation framework that integrates global and local information is proposed in this paper. This method separates the teacher–student output features into source categories and other category knowledge and transfers them to the students separately. Experimental results show that the proposed method outperforms several state-of-the-art knowledge distillation methods in the field of image classification and can significantly improve the performance of the student model.
摘要:ObjectiveDental computer-aided therapy relies on the use of dental models to aid dentists in their practice. One of the most fundamental tasks in dental computer-aided therapy is the automated segmentation of teeth using point cloud data obtained from intra-oral scanners (IOS). The precise segmentation of each individual tooth in this procedure provides vital information for a variety of subsequent tasks. These segmented dental models facilitate customized treatment planning and modeling, thus providing extensive assistance in carrying out further treatments. However, the automated segmentation of individual teeth from dental models faces three significant challenges. First, the indistinct boundary between teeth and gums poses difficulties in segmentation based solely on geometric features. Second, certain factors, such as occlusion during scanning, can lead to suboptimal results, particularly in posterior dental regions, thereby further complicating the segmentation process. Lastly, teeth often exhibit complex anomalies in patients, including crowding, missing teeth, and misalignment issues, which further complicate the task of accurate segmentation. To address these challenges, two conventional methods are proposed for segmenting teeth in images obtained from IOS scanners. The first method employs a projection-based approach, wherein a 3D dental scan image is initially projected into a 2D space, segmentation is then performed in a 2D space, and the result is remapped back into the 3D space. The second method adopts a geometry-based approach and typically utilizes geometric attributes, such as surface curvature, geodesic information, harmonic fields, and other geometric properties, to distinguish tooth structures. However, these methods are not fully automated and rely on domain-specific knowledge and experience. Moreover, the predefined low-level attributes used by these methods lack robustness when dealing with the complex appearance of patietns’ teeth. Considering the impactful application of convolutional neural networks (CNN) in computer vision and medical image processing, several deep learning methods rooted in CNN have been introduced. Some of these methods directly extract translation-invariant depth geometric features from 3D point cloud data but suffer from a lack of necessary receptive field for fine-grained tasks, such as dental model segmentation. Moreover, the network structure exhibits redundancy and neglects the crucial details of dental models. To address these issues, a fully automatic tooth segmentation network called TRNet is proposed in this paper, which can automatically segment teeth on unprocessed intra-oral scanned point cloud models.MethodIn the proposed end-to-end 3D point cloud-based multi-scale fusion dental model segmentation method, an encoder with a fine-grained receptive field is employed to address those challenges posed by the small size of each tooth within the dental model and the lack of distinct features between the teeth and gums. Each tooth within the dental model is relatively small in comparison to the entire dental model, and the boundaries between the teeth and gums lack distinct features. Consequently, a fine-grained receptive field is essential for extracting features from this model. The network adopts a small radius for querying the neighborhood, thus narrowing the receptive field and enabling the network to focus on detailed features. Additionally, downsampling can lead to the uneven density of the point cloud, thereby causing the network trained on sparse point clouds to struggle in recognizing fine-grained local structures. Multiscale feature fusion coding is implemented to address these issues. Given that the encoder uses a small query radius to create a fine-grained receptive field, the relative coordinates become relatively small. Consequently, the network needs to learn large weights to operate on these relative coordinates, thereby introducing further challenges in network optimization. TRNet normalizes the relative coordinates in the feature extraction layer to facilitate network optimization and enhance segmentation performance. The network also employs a highly efficient decoder. Previous segmentation methods often utilize the U-Net structure, which incorporates jump connections for multi-level feature aggregation between the input features of the cascaded decoder and the outputs of the corresponding layer encoder. However, this top-down propagation is considered inefficient for feature aggregation. The decoding approach used by TRNet directly combines the features outputted from all cascade encoders, thereby allowing the network to learn the importance of each cascade. The discrepancies in scales or dimensions of the features represented by fused information in the network may also introduce unwanted bias during the fusion process. To address these issues and ensure that the network focuses on crucial information within the fused features, a soft attention mechanism is incorporated into the fusion process. Specifically, a soft attention operation is performed on the newly combined features after their connection, thereby enabling the network to adaptively balance the discrepancies of different scales or levels in the propagated features.ResultA dataset comprising dental models taken from numerous patients with irregular tooth shapes, such as crowding, misalignment, and underdeveloped teeth, was compiled. To establish the labeled values, an experienced dentist meticulously segmented and annotated these models. The dataset was then randomly divided into two subsets, with 146 models allocated for training and 20 models reserved for testing. Data augmentation techniques, such as random panning and scaling, were employed to enhance the diversity of the training set. In each iteration, intra-oral scan images were shifted by a randomly selected value within the range of [-0.1, 0.1] and scaled by a randomly chosen magnification within the range of [0.8, 1.25], thereby generating new training data. Experimental results from a 5-fold cross-validation reveal that TRNet achieved an overall accuracy (OA) of 97.0150.096% and a mean intersection over union (mIoU) of 92.6910.454%, significantly outperforming the existing methods.ConclusionAn end-to-end deep learning network called TRNet is introduced in this paper for the automatic segmentation of teeth in 3D dental images acquired from intra-oral scanners. An encoder with fine-grained receptive fields was also implemented to enhance the local feature extraction capabilities essential for dental model segmentation. Additionally, a decoder based on hierarchical connections was employed to allow the network to decode efficiently by learning the significance of each level. This refinement significantly improves the precision of dental model segmentation. A soft attention mechanism was also integrated into the feature fusion process to enable the network to focus on key information within dental model features. Experimental results indicate that TRNet shows excellent performance on intra-oral scanned point cloud models and enhances the ability of the network to segment dental models, thereby improving the accuracy of point cloud segmentation results.
摘要:ObjectiveDiabetic retinopathy (DR) is a leading cause of blindness in humans, and regular screening is helpful for its early detection and containment. While automated and accurate lesion segmentation is crucial for DR grading and diagnosis, this approach encounters many challenges due to the complex structures, inconsistent scales, and blurry edges of different kinds of lesions. However, the manual segmentation of DR lesions is time-consuming and labor-intensive, thus making the large-scale popularization of the approach particularly difficult due to the limited doctor resources and the high cost of manual annotation. Therefore, an automatic DR lesion segmentation method should be developed to reduce clinical workload and increase efficiency. Recently, convolutional neural networks have been widely applied in the fields of medical image segmentation and disease classification. The existing deep-learning-based methods for DR lesion segmentation are classified into image-based and patch-based approaches. Some studies have adopted the attention mechanism to segment lesions using the whole fundus image as input. However, these methods may lose the edge details of lesions, thus introducing challenges in obtaining fine-grained lesion segmentation results. Other studies have cropped the original images to patches and inputted them into the encoder-decoder networks for DR lesion segmentation. However, most of the approaches proposed in the literature utilize fixed weights to fuse coding features at different levels while ignoring the information differences among them, thus hindering the effective integration of multi-level features for accurate lesion segmentation. To address these issues, this paper proposes a multi-attention and cascaded context fusion network (MCFNet) for the simultaneous segmentation of multiple lesions.MethodThe proposed network adopts an encoder-decoder framework, including the VGG16 backbone network, triple attention module (TAM), cascaded context fusion module (CFM), and balanced attention module (BAM). First, directly fusing multi-level features from different stages of the encoder easily results in inconsistent feature scales and information redundancy. Dynamically selecting important information from multi-resolution feature maps not only preserves contextual information in low-resolution feature maps but also effectively reduces background noise interference in high-resolution feature maps. TAM is proposed to extract three types of attention features, i.e., channel attention, spatial attention, and pixel-point attention. Second, the channel attention assigns different weights to different feature channels to enable the selection of specific feature patterns for lesion segmentation. The spatial attention also highlights the location information of lesions in the feature map, thus making the proposed model pay attention to lesion areas. Lastly, the pixel-point attention mechanism extracts small-scale lesion features. TAM ensures feature consistency and selectivity by learning and fusing these attention features. In addition, traditional receptive field ranges can hinder the capture of subtle features due to the small size of lesions. To address this problem, CFM is proposed to capture global context information at different levels and to perform summation with local context information from TAM. The module is designed to expand the scope of multi-scale receptive fields and consequently improve the accuracy and robustness of small-scale lesion segmentation. This study also uses BAM to address the rough and inconspicuous lesion edges. This module calculates the foreground, background, and boundary attention map to reduce the adverse interference of the background noise and to clarify the lesion contour.ResultThe lesion segmentation performance of the proposed method was compared with that of extant methods on the IDRiD, DDR, and E-Ophtha datasets. Experimental results show that despite the variations in the number and appearances of retinal images from different countries and ethnicities, the proposed model outperforms the state-of-the-art in terms of accuracy and robustness. Specifically, on the IDRiD dataset, MCFNet achieves AUC values of 0.917 1, 0.719 7, 0.655 7, and 0.708 7 for lesion segmentation in the EX, HE, MA, and SE, respectively. The mAUC, mIOU, and mDice of four kinds of lesions on the IDRiD dataset reach 0.750 3, 0.638 7, and 0.700 3, respectively. On the DDR dataset, the proposed model achieves mAUC, mIOU, and mDice values of 0.679 0, 0.434 7, and 0.598 9 for these lesions. Compared with PSPNet, the proposed method obtains 52.7%, 18.63%, and 33.06% higher mAUC, mIOU, and mDice values, respectively. On the E-Ophtha dataset, the proposed MCFNet achieves mAUC, mIOU, and mDice values of 0.660 1, 0.449 5, and 0.628 5, respectively. When compared with MLSF-Net, these values improve by 15.11%, 4.06%, and 20.68%, respectively. The segmentation performance of the proposed model was also compared with that of other methods. Compared with these methods, the segmentation results of the proposed model are closer to the ground truth, and the obtained edges are finer and more accurate. To verify the effectiveness of the proposed TAM, CFM, and BAM, comprehensive ablation experiments were conducted on the IDRiD, DDR, and E-Ophtha datasets. The proposed model obtained mAUC, mIOU, and mDice values of 0.597 5, 0.451 2, and 0.584 8 on the IDRiD dataset when using only the baseline. The fusion of VGG16 with TAM, CFM, and BAM achieved the best segmentation results for all four types of multi-scale lesions, thereby suggesting that the proposed modules contribute to improving the multiple lesion segmentation performance in various degrees.ConclusionThis paper proposes a multi-attention and cascaded context fusion network for the multiple lesion segmentation of diabetic retinopathy images. The proposed MCFNet introduces TAM to learn and fuse channel attention, spatial attention, and pixel-point attention features to ensure feature consistency and selectivity. CFM utilizes adaptive average pooling and non-local operation to capture local and global contextual features for concatenation fusion and to expand the receptive field of fundus lesions. BAM calculates attention maps for the foreground, background, and lesion contours and uses the squeeze-and-excitation modules to rebalance the attention features of these regions, preserve the edge details, and reduce interference from background noise. Experimental results on the IDRiD, DDR, and E-Ophtha datasets demonstrate the superiority of the proposed method compared with the state-of-the-art. This method also effectively overcomes the interference of background and other lesion noises, thus achieving an accurate segmentation of different types of multi-scale lesions.
摘要:ObjectiveCholangiocarcinoma is a type of cancer with high fatality rate, and the early detection and treatment of cancer can significantly reduce its incidence. Digital diagnosis of pathological sections can effectively improve the accuracy and efficiency of cancer diagnosis. Microscopic hyperspectral images of pathological sections contain richer spectral information than color images. Due to the specific spectral response of biological tissues, pathological tissues have different spectral characteristics from normal tissues, and meaningful and rich spectral information provides great potential for the classification of cancer cells and healthy cells. The performance of most pathologic hyperspectral image classification algorithms is highly dependent on high-quality labeled datasets, but pathologic hyperspectral images need to be manually labeled by experienced pathologists, which can be time consuming and laborious. The feature extraction algorithm based on self-supervision initially extracts features from unlabeled image data in an unsupervised way by designing pretext tasks and then transfers these image data to downstream tasks. After fine-tuning the downstream task network with a limited number of labeled samples, these algorithms can achieve a supervised learning performance and alleviate the data annotation problem. However, traditional contrast self-supervised learning shows limitations in extracting high-level semantic information, and an image enhancement method specific to pathological hyperspectral images is not yet available. Therefore, this paper proposes a self-supervised method to extract sequential spectral data and semantic information from hyperspectral images of cholangiocarcinoma and improve the feature extraction capability and classification accuracy of the self-supervised method.MethodHyperspectral images are different from natural images in that image enhancement techniques, such as color transformation, can change spectral information. It is meaningful to use the encoder structure as an image enhancement method for hyperspectral images. However, the encoder used in existing methods is based on the convolutional neural network (CNN), and the feature map extracted by the CNN corresponds to the local receptive field and ignores the global information of the spectral dimension. Given the limited ability of CNN in characterizing spectral sequence data, this paper first designs a Transformer encoder structure for image enhancement, which retains the details of the sequence in the original image. Borrowing from natural language processing, the Transformer architecture with sequential information modeling capability takes the spectral curve reflected by each pixel of the hyperspectral image as a spectral sequence. Transformer then uses position embedding and attention module to pay attention to the differences among spectral sequences and to efficiently learn spectral sequence information. Second, after the image is enhanced with a Transformer encoder structure to obtain positive samples, the convolutional autoencoder can be used as another set of image enhancement to obtain negative samples required for contrastive learning. To address the problem where traditional contrastive learning extracts features through low-level image differences, thus resulting in its limited ability to extract advanced semantic information, this paper applies prototypical contrastive learning to extract features from pathological hyperspectral images. Positive and negative samples are trained through the clustering and instance discrimination tasks of a prototypical contrastive learning network to learn advanced semantic information in images. The above process of extracting features from network structures uses unlabeled data. Finally, the classification results are obtained by fine-tuning the downstream classification task network with a few labeled features.ResultExperiments were conducted on eight scenes in the hyperspectral dataset of multidimensional cholangiocarcinoma pathology. These scenes were selected from eight patients. To ensure the representativeness of these scenes, different cancer cell morphology, cancer cell proportion, and spectral response curve were used in each scene. The proportion of cancer regions in scenes 2, 3, and 8 only accounted for 1/8 of the whole picture. Experiments were conducted on each scenario, where 5% of the data was labeled for training and 95% was used for testing. To verify the effectiveness of the proposed self-supervised method proposed on pathological hyperspectral datasets, this method was compared with 12 widely used algorithms and networks, including 7 supervised feature extraction methods and 5 unsupervised feature extraction algorithms. Experimental results show that the features extracted by the proposed method achieve optimal results in downstream classification tasks, with an average overall accuracy of 96.63%, average accuracy of reaching 95.37%, and average Kappa coefficient of 0.91. Ablation experiments were also conducted to verify that compared with the convolutional module, the Transformer module pays more attention to sequence details when extracting features after adding the self-attention mechanism and multi-head attention mechanism, which can effectively retain original image information and achieve high classification accuracy. The prototypical contrastive learning module adds a clustering process on the basis of contrastive learning and achieves high classification accuracy, thereby proving that the prototypical contrastive learning module can effectively extract high-level semantic information from microscopic hyperspectral images of cholangiocarcinoma. Results of the dimensionality reduction experiment also show that the semantic features extracted by the proposed method are linearly separable.ConclusionThe proposed method can extract effective features from unlabeled hyperspectral images of cholangiocarcinoma, and these features can be applied to classification tasks to achieve high classification accuracy and alleviate the problem of pathological hyperspectral image data labeling. This method carries certain research value and practical significance for the medical diagnosis of cholangiocarcinoma.