最新刊期

    30 3 2025

      Review

    • 监控视频异常检测研究进展综述,为社会治理提供新思路。
      Wang Yang, Zhou Jiaogen, Yan Jun, Guan Jihong
      Vol. 30, Issue 3, Pages: 615-640(2025) DOI: 10.11834/jig.240329
      Survey of anomaly detection methods in surveillance videos based on deep learning
      摘要:Video anomaly detection plays a crucial role in social governance by utilizing surveillance footage, making it a crucial and challenging topic within the field of computer vision. This paper presents a detailed classification and review of current key video anomaly detection methods from a deep learning perspective, analyzing existing technical challenges and future development trends. First, the paper provides a comprehensive introduction to the definition of video anomalies, including the delineation of anomalies and video anomalies, the five types of video anomalies (intuitive anomalies, action change anomalies, trajectory change anomalies, group change anomalies, and spatiotemporal anomalies), and the three characteristics of anomaly detection (abstraction, uncertainty, and sparsity). The paper then reviews the development trends in video anomaly detection research from 2008 to the present based on the digital bibliography & library project (DBLP) database and provides a detailed analysis of the progress of fully supervised, weakly supervised, and unsupervised deep learning methods in the field of video anomaly detection. The core innovations, structures, and advantages and disadvantages of each method are discussed, particularly focusing on the latest research advancements involving large models. For instance, some studies address the challenge of applying virtual anomaly video datasets to real-world scenarios by designing anomaly prompts that guide mapping networks to generate unseen anomalies in real-world settings. Additionally, some works have designed dual-branch model structures based on multimodal large model frameworks. One branch uses the contrastive language-image pre-training (CLIP) visual encoding module for coarse-grained binary classification, while the other branch aligns textual features of anomaly category labels with visual encoding features for fine-grained anomaly classification, surpassing the current state-of-the-art performance in video anomaly detection. Furthermore, research has explored the potential of using GPT-4V, a powerful large visual language model, to tackle general anomaly detection tasks, examining its applications in multimodal and multidomain anomaly detection tasks, including image, video, point cloud, and time-series data across various fields such as industry, healthcare, logic, video, 3D anomaly detection, and localization. The introduction of large models presents new opportunities and challenges for video anomaly detection. Moreover, the paper introduces 10 commonly used and latest datasets, providing a comparative analysis of their characteristics and presenting detailed content through figures, along with corresponding download links. These datasets play a crucial role in video anomaly detection research, and this paper offers their comprehensive evaluation. The paper also introduces four anomaly determination standards (frame-based, pixel-based, and trajectory-based) and three performance evaluation standards (area under the receiver operating characteristic curve (AUC), equal error rate (EER), and average precision (AP)), and conducts a comparative analysis of the performance of various algorithms. The strengths and weaknesses of current video anomaly detection algorithms are summarized, and suggestions for improvement are proposed. Based on this information, datasets may have become a bottleneck in the development of current methods. In complex real-world scenarios, research methods based solely on simple scenes may not effectively address anomaly issues in the real world. Future datasets will aim to better reflect real-world anomalies, such as collecting data from the remote sensing field, improving the quality of existing image and video data through models, and collecting multi-camera, multidimensional annotated data, to detect more diverse and challenging anomaly events and effectively promote research development. Additionally, in terms of evaluation standards, common evaluation methods primarily rely on calculating the true false positive rates and computing the area under the receiver operating characteristic curve. However, in practical applications, some methods may achieve high AUC but exhibit a high false alarm rate due to the direct influence of different anomaly determination methods on the true and false positive rates. Adopting different anomaly determination methods may result in models achieving high AUC performance while generating high false alarm rates. Therefore, this paper proposes the need to design an evaluation system that simultaneously considers AUC performance and false alarm rates to comprehensively evaluate methods. Finally, the outlook of the paper emphasizes new opportunities presented by large models in video anomaly detection. The emergence of large models in recent years has substantially improved the performance of deep learning-based methods on commonly used video anomaly detection datasets. This field has accumulated a solid academic research foundation. Therefore, future research should not only focus on improving anomaly detection performance but also consider the application of this field to practical problems to address existing challenges. Future research should aim to design more fine-grained and general models, leveraging the rich prior knowledge of large models to gradually develop video anomaly detection models that can distinguish specific types of anomalies. With the powerful multimodal information understanding capabilities of large models, video anomaly detection models will evolve toward a more general direction, ultimately blurring the boundaries between supervised, weakly supervised, and unsupervised learning methods. Overall, this paper substantially enhances readers’ understanding of the field of video anomaly detection and provides valuable references and guidance for future research directions. Through a systematic review and analysis of existing research, this paper offers crucial insights for further development of the video anomaly detection field.  
      关键词:video anomaly detection;deep learning;dataset;large model;supervised learning;weakly supervised learning;unsupervised learning;multimodal   
      363
      |
      175
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 70933323 false
      更新时间:2025-04-11
    • View planning methods of object detection: a survey AI导读

      目标检测视点规划技术,为计算机视觉领域提供新解决方案,推动智能化生活发展。
      Wang Jianyu, Zhu Feng, Hao Yingming, Wang Qun, Zhao Pengfei, Sun Haibo
      Vol. 30, Issue 3, Pages: 641-659(2025) DOI: 10.11834/jig.240319
      View planning methods of object detection: a survey
      摘要:Object detection is one of the fundamental research directions in the field of computer vision and is also the cornerstone of advanced vision research. When objects are densely arranged or located under poor lighting conditions, crucial details can be lost during image acquisition. When using images with missing details as input, the detection results from conventional target detection algorithms often fail to meet task requirements. To address these challenges, intelligent perceptual methods for point-of-view planning in target detection have emerged. These methods can autonomously analyze the factors affecting detection tasks under current conditions, adjust the camera’s pose parameters to mitigate these effects, and achieve accurate target detection. This paper reviews and analyzes relevant studies since 2007 and summarizes domestic and foreign research methods to reflect the research status and the latest development of viewpoint planning methods for object detection. For simplification, this method is called active object detection (AOD) in this article. According to the different use scenarios, this paper divides the active object detection methods into two categories: AOD in two-dimensional scenes, AOD in three-dimensional scenes, and AOD combining the two. The third method is uncommon; thus, this paper mainly introduces the first two methods. Specifically, in two-dimensional scenes, AOD methods are divided into pixel-based methods and those that simulate camera parameters, depending on whether a single-pixel or an overall image is being planned. The most important part of the pixel-based approach is the selection of the target pixel point and the strategy for planning the next pixel. Typically, integral features, scale features, or key points, which are the parts of the target that have the largest gap between the target and the background, are used by researchers to locate the possible location of target pixels. After positioning the target pixel, the moving position of the next pixel will be set in accordance with the category of the region to ensure the continuity of the front and back frames and avoid the task failure caused by planning errors. For AOD methods that simulate camera parameters, different influencing factors cause various difficulties in target detection. Therefore, researchers have designed different planning scenarios by analyzing the types of influencing factors, and some excellent results have emerged in recent years. As time goes by, the popularity of moveable robots has introduced AOD into a new development environment: 3D scenes. In three-dimensional environments, the AOD method enables the intelligent agent to actively select the next viewpoint pose in space, thereby mitigating the influence of interference factors on the target detection process. We classify 3D scenes based on the degree of known spatial location information into two categories: 3D scenes with known spatial relationships and 3D scenes with unknown spatial relationships. In the first type of scenario, the placement of the target object and surrounding objects, the display of spatial category markers, and the range of viewpoint planning are all known, and the AOD method can perform viewpoint planning based on the known information. In this type of approach, researchers focus more on the representation of relationships and the selection of the next viewpoint in a fixed search space. The second type of space has no information to assist, and the agent can only rely on the observation results to select the next viewpoint. In real life, situations where relationships are unknown are highly common; therefore, the design of AOD methods in this situation is currently a popular research direction. Researchers have made considerable efforts to provide detailed descriptions due to the close relationship between the planning strategy of such scenarios and the observed results. In AOD, observation information is usually referred to as state expression, and a detailed expression leads to improved strategy generation. In addition, researchers have made numerous efforts in the evaluation function of the next view to evaluate the next viewpoint and modify the planning strategy. AOD has two main objectives in unknown environments: path optimization and detection effect optimization. The evaluation function is generally divided into single-element and multi-element evaluations based on the types of evaluation factors. Despite the accuracy of multi-element evaluation, the selection of elements in different problems must be highly consistent. Identifying the same components across various scenarios to design a universal evaluation function remains a potential breakthrough area for researchers in the future. In addition to analyzing the methods mentioned above, this paper also provides a brief introduction to the datasets that AOD methods can use in different types of scenarios. The viewpoint planning in two-dimensional scenes is consistent with the scenes used by conventional object detection methods. Therefore, the datasets, such as large-scale public datasets COCO and Pascal VOC, have numerous overlaps. Meanwhile, the evaluation indicators of the two methods are also the same; therefore, performance comparison can be directly conducted. Considering motion factors, directly comparing detection results on 3D datasets such as AVD and T-LESS to determine the accuracy of the movement path is impossible. Therefore, researchers have designed task success rate (SR) and average travel distance as the leading indicators to measure the effectiveness of the AOD algorithm. Notably, although many excellent results have been achieved in viewpoint planning methods oriented toward target detection, some parts can still be improved in terms of scene design and research methodology. First, some real physical elements can be added to the scene design to transform the planning problem into an optimization problem under certain constraints. Second, the methods suitable for two- and three-dimensional scenes are closely combined, further realizing accurate detection by changing the sensor parameters in inaccessible areas. Finally, detection-oriented viewpoint planning methods typically output discrete actions and are also tightly bound to the task. Therefore, viewpoint planning in continuous environments or establishing a generic framework for task-independent viewpoint planning can also be considered future directions.  
      关键词:object detection;active vision;parameter adjustment;view planning;intelligent perception   
      269
      |
      193
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 68493870 false
      更新时间:2025-04-11

      Object Detection and Industrial Applications

    • 在智能制造领域,专家提出了融合知识蒸馏与记忆机制的无监督工业缺陷检测模型,有效提升了检测准确性与效率。
      Liu Bing, Shi Weifeng, Liu Mingming, Zhou Yong, Liu Peng
      Vol. 30, Issue 3, Pages: 660-671(2025) DOI: 10.11834/jig.240202
      Unsupervised industrial defect detection by integrating knowledge distillation and memory mechanism
      摘要:ObjectiveFrom airplane wings to chip grains, industrial products are ubiquitous in modern society. Industrial defect detection, which aims to identify appearance defects in various industrial products, is a crucial technology for ensuring product quality and maintaining stable production. Previous defect detection methods rely on manual screening, which is costly, inefficient, and often inadequate for large-scale quality inspection needs. In recent years, the continuous emergence of new technologies in industrial imaging, computer vision, and deep learning has notably advanced vision-based industrial defect detection, making it an effective solution for inspecting product appearance quality. However, several types of industrial defects are found in the actual scene, and a lack of sufficient samples poses challenges for existing unsupervised industrial defect detection methods, These methods often struggle to effectively detect local normal logic defects, such as when a normal target appears in the wrong position or is missing altogether. This difficulty arises from the lack of prior knowledge regarding normal samples during the testing phase, which can lead to defective parts being incorrectly identified as normal. Additionally, deep neural networks possess strong generalization capability, but existing methods often misidentify interference factors on the background of the image as defects, leading to issues of over-detection. To address the challenges of logic defect detection failures and over-detection in unsupervised industrial defect detection, a new unsupervised industrial defect detection model is proposed.MethodFirst, a saliency detection network and Berlin noise are used to synthesize defect images, enhancing the distribution consistency between synthesized and real defect images while alleviating the over-detection problem in traditional models. Second, the proposed model comprises a teacher-student branch and a memory branch. The teacher-student branch trains the student network by distilling knowledge and synthesizing defect images, allowing it to extract normal image features consistent with those of the teacher network while also repairing defected areas, effectively alleviating the overgeneralization issue of the student network. The memory branch can effectively learn the prototype features that represent normal samples by introducing the average memory module, thereby enhancing the capability of the model to detect logical defects. The two branches adaptively fuse multiscale defect features, enabling accurate detection of various defects through joint discrimination.ResultExperiments on MVTec AD, a benchmark dataset for industrial defect detection, show that the proposed method achieves excellent detection performance across all types of defect images. For texture defect images, the average image-level area under the receiver operating characteristic curve(AUROC) metric improved from 99.3% to 99.8% compared to the baseline model DeSTSeg, while the average pixel-level AUROC metric increased from 98.1% to 98.7%. For object class defect images, the average image-level AUROC metric rose from 97.5% to 99.1%, and the average pixel-level AUROC metric increased from 97.9% to 99.1%, relative to DeSTSeg. Notably, for transistor logic defect detection, the proposed method showed an improvement of 9.1%. Across the entire MVTec AD dataset, compared to the baseline model, the average image-level AUROC metric increased from 98.1% to 99.3%, and the average pixel-level AUROC metric improved from 97.9% to 98.9%. Additionally, the proposed approach achieved improvements of 0.9% and 2.5% in the more challenging pixel-level PRO (per-region-overlap) and average accuracy AP metrics, respectively. In addressing logic defects in the more challenging Breakfast box dataset, the proposed method achieved an 11.5% improvement in image-level AUROC metrics compared to the baseline model. Simultaneously, in terms of the pixel-level AUROC index, the proposed method showed a 4.0% improvement compared to the baseline model. In the ablation experiment, each module of the proposed method is validated. The introduction of substantial detection to constrain synthetic defects in the foreground markedly reduces the over-detection phenomenon caused by background interference in the model, improving classification performance by 1% compared to the baseline model. With the addition of memory branches, the model can effectively detect logic defects and substantially enhance segmentation performance. However, the direct average fusion method compromises the respective advantages of the two branches, leading to poor defect detection performance. Therefore, the normalization module effectively combines the advantages of the two segmentation networks, resulting in improvements of 0.7% and 0.5% in classification and segmentation performances, respectively, compared to the direct averaging approach.ConclusionThe proposed method is not limited by traditional defect synthesis techniques and can effectively alleviate the over-detection problems caused by existing synthesis methods. The introduction of the average memory module not only reduces memory costs but also saves time in searching the memory library without requiring a complicated search algorithm. In this paper, the proposed defect synthesis method is organically combined with the memory mechanism, enabling accurate detection of various types of industrial defects.  
      关键词:defect detection;knowledge distillation;memory mechanism;defect synthesis;saliency detection   
      183
      |
      87
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 86934407 false
      更新时间:2025-04-11
    • 在电力设备巡检影像缺陷检测领域,专家提出了基于视觉语言模型的零样本缺陷检测模型,有效提升了缺陷检测精度,为电网安全运行提供新方案。
      Wu Hua, Jia Donghao, Zhang Tingting, Bai Xiaojing, Sun Li, Pu Mengyang
      Vol. 30, Issue 3, Pages: 672-682(2025) DOI: 10.11834/jig.240246
      Multimodal zero-shot anomaly detection using dual-experts for electrical power equipment inspection images
      摘要:ObjectiveAnomaly detection in electrical power equipment inspection images plays a crucial role in enhancing the safety of power transmission and the reliability of grid operations. Traditional anomaly detection methods primarily rely on supervised learning and depend heavily on high-quality datasets. However, in inspection images, power equipment is typically in normal working conditions, with very few instances of abnormal devices, resulting in a severe imbalance between normal and abnormal samples. Moreover, constructing datasets of electrical power equipment inspection images involves complex steps, such as image acquisition, image screening, and pixel-level segmentation mask annotation, which require remarkable costs. These factors make it challenging for traditional supervised learning methods to adapt to anomaly detection in power equipment inspection images. Additionally, the working environments of power equipment substantially vary, leading to inspection images with diverse and changing backgrounds, such as forests, power towers, snow-capped mountains, and grasslands. Furthermore, inspection images are typically captured by unmanned aerial vehicles (UAVs), which makes it challenging to control factors such as weather, location, and time during the capture process. This phenomenon leads to notable differences in illumination and viewing angles for the same type of power equipment, which can seriously hinder the model’s capability to accurately identify defect areas.MethodMultimodal large models based on the Transformer framework are pre-trained on massive datasets and possess powerful zero-shot generalization capabilities. Visual language models (VLMs), a type of multimodal large model, can interpret image content based on textual prompts. A zero-shot anomaly detection model for electrical power equipment inspection images that leverages VLMs combined with textual prompts is proposed to address challenges associated with the construction of inspection image datasets during training. The model uses textual descriptions of normal and abnormal conditions as prompts, extracting textual features by processing these prompts through the text encoder of the VLM. Simultaneously, the images to be inspected acquire multiscale visual information via processing through the visual encoder of the VLM, generating multiple visual features from various intermediate layers. Multiple visual and textual features are further processed using multiple dual-expert modules. Two experts independently process the visual features, combining them with textual features to derive decisions from two experts. These decisions are then integrated to obtain a joint decision. This dual-expert module results effectively mitigates the effects of varying backgrounds, different illumination conditions, and viewing angles in inspection images, allowing it to focus on defect areas. Multiple joint decisions are fused to generate the final anomaly detection results to incorporate additional contextual and local detail information from the images. The dual-expert modules are pre-trained on public industrial anomaly detection datasets, with the VLM’s visual and text encoders kept frozen. Power inspection image datasets with pixel-level segmentation mask annotations are currently lacking. A corresponding anomaly detection test dataset that features diverse backgrounds, lighting conditions, and viewpoints is constructed to comprehensively evaluate the model.ResultThe experiment compared the proposed method with several benchmarks, including SAA+ (segment any anomaly+), AnomalyGPT, WinCLIP (window-based CLIP), PaDiM (patch distribution modeling), and PatchCore, using the constructed electrical power equipment inspection image dataset. For pixel-level anomaly segmentation performance, the AUROC (area under the receiver operating characteristic curve) average improved by 18.1% and an average F1-max (F1 score at optimal threshold) improvement of 26.1%. For image-level anomaly classification performance, the average AUROC improved by 20.2%, while the AP (average precision) average improved by 10.0%. Specifically, the proposed model demonstrated the best results in pixel-level anomaly segmentation for various electrical power equipment. In pixel-level anomaly segmentation performance across various insulator categories, the model achieved at least a 15% improvement in F1-max, demonstrating excellent performance for other types of electrical power equipment. In terms of the AUROC metric, the method outperformed most others for various power equipment as well. Notably, the image-level anomaly classification AUROC for all methods performed poorly on line clamps. This performance is due to the backgrounds in line clamp images, which often comprise objects such as pylons and wire rods that share similar textures or colors to the line clamps, complicating the capability of the model to interpret the semantic content of the entire image. However, the model effectively uses multilayer features and dual-expert modules to reduce the interference from background objects that resemble similar the foreground, allowing it to achieve relatively good performance on line clamp images. An ablation study is also conducted. When the model used a single expert, it exhibited a high number of false positives. In contrast, the dual-expert module enables reasonable cooperation between experts, allowing the model to concentrate on defects while minimizing attention to irrelevant areas, leading to a remarkable improvement in anomaly detection accuracy.ConclusionThe model uses normal and abnormal text prompts to realize zero-shot anomaly detection in electrical power equipment inspection images, thus avoiding the imbalance issue between normal and abnormal samples in electrical inspection image datasets and their high construction costs. The model incorporates a wide range of contextual and local detail information by using multilayer features from the image encoder. In the dual-expert module, the joint decision design between two experts effectively focuses on defect areas, minimizing interference from background areas. In addition, an electrical power equipment inspection image dataset is constructed, and experiments with various models are conducted on this dataset to meet the evaluation requirements of pixel-level anomaly segmentation and image-level anomaly classification performance of electrical power equipment anomaly detection models in outdoor work scenarios. The proposed model outperforms other zero-shot anomaly detection methods based on VLM in anomaly segmentation and anomaly classification. Furthermore, an ablation study demonstrated the excellent performance of the dual-expert module in focusing on defect areas and mitigating background.  
      关键词:zero-shot anomaly detection;dual-experts;visual language models;multimodal;electrical power equipment inspection images   
      105
      |
      54
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 86934145 false
      更新时间:2025-04-11
    • 在遥感目标检测领域,专家基于YOLOv5s网络,提出了一种局部无参注意力和联合损失的遥感目标检测方法,有效提高了复杂场景下小目标的检测性能。
      Xia Bo, Xue Weitao, Zhou Xinyao, Huang Hong
      Vol. 30, Issue 3, Pages: 683-695(2025) DOI: 10.11834/jig.240316
      Remote sensing object detection via local parameter-free attention and combined loss
      摘要:ObjectiveRemote sensing object detection technology has been widely used in various fields, including remote sensing mapping, smart cities, rural revitalization, resource exploration, national security, and military affairs. Particularly with the completion and improvement of notable Chinese projects in high-resolution Earth observation systems, strong data support and development opportunities are available for remote sensing object detection. Traditional detection algorithms have relatively weak generalization capabilities and are easily affected by noise, data distribution, and other factors. Consequently, challenges are encountered in overcoming the diverse scales and directions of remote sensing objects, as well as complex data distribution. In recent years, the strong representation learning capability and generalization capacity of deep learning have led to its widespread use in the field of object detection. Current detection algorithms based on deep learning can be simply divided into three categories: region-, pixel-, and query-based methods. Region-based methods typically offer high detection accuracy but require high computation and exhibit low efficiency. Pixel-based methods are usually single-stage detectors with low computation and high efficiency, but they often struggle with lower detection accuracy, particularly in small object detection tasks. Query-based methods require a large amount of data and have low efficiency. However, the object features extracted using the aforementioned methods become overshadowed by the background information as the network deepens due to the complexity of background information and the many small objects in remote sensing images. This condition is not conducive to subsequent detection tasks, leading to limited final detection performance. In response to the issue of insufficient small object detection performance under complex backgrounds, this work proposes a local parameter-free attention YOLO (you only look once) network (LPFA-YOLO) based on YOLOv5s.MethodFirst, a local parameter-free attention (LPFA) mechanism is proposed. This mechanism can improve the attention for objects within a local region based on current features, without introducing any trainable parameters, thus constructing a bottleneck with parameter-free attention (BPFA). The attention mechanism in this block only assigns corresponding weights to the residuals and does not have a notable impact on the main weights. This approach helps avoid the problem of gradient disappearance and accelerates convergence by utilizing pre-trained weights. The C3 with attention module (C3A) constructed based on BPFA is then embedded into different stages of the backbone network to cater to objects of different scales. The shallow stage enhances the features of small objects, whereas the deep stage improves the features of medium and large objects, thereby realizing multiscale object feature enhancement and background information suppression, which addresses the issue of redundant background information. On this basis, the Wasserstein distance is utilized to measure the similarity of bounding boxes. A combined measurement method called Wasserstein-complete intersection of union (W-CIoU) and the related loss function are developed. This method can alleviate the sensitivity of small objects to position deviation and enable the separation of objects of different scales. Consequently, this method alleviates the issue of label misallocation due to the substantial difference between the anchor box and ground truth, reducing the missed detection rate of small objects.ResultExperiments are conducted on two datasets and compared with seven advanced algorithms, including EfficientNet, YOLOv4, detection Transformer (DETR), Swin Transformer, detecting objects with recursive feature pyramid and switchable atrous convolution (DetecoRS), dynamic anchor boxes for DETR (DAB-DETR), and YOLOv8s. On the remote sensing object detection (RSOD) dataset, the mean average precision (mAP) reaches 98.2%, which is 0.9% and 0.3% higher than the baseline and YOLOv8s, respectively. The average precision for small objects (APS) reaches 42.7%, showing a 2.4% improvement compared to the baseline. The average precision for medium (APM) and large (APL) objects reach 67.7% and 82.1%, respectively, both demonstrating improvements compared to the baseline. On the remote sensing super-resolution object detection (RSSOD) dataset, the mAP achieves 87.4%, which is 2.7% and 2.6% higher than the baseline and YOLOv8s, respectively. The APs increases to 38.0%, revealing a 2.7% rise compared to the baseline. The APM and APL achieve 52.7% and 46.4%, respectively, both showing improvements compared to the baseline. Simultaneously, the model has the fewest number of parameters and computation compared to contrast algorithms. Experiments are conducted using two external remote sensing images obtained from Google Earth to assess the generalization of the algorithm. The results indicate that the algorithm detects all objects of interest, demonstrating good generalization capabilities.ConclusionIn this work, a YOLO network model based on local parameter-free attention is introduced. The experimental results demonstrate that this model can effectively address the requirements of small object detection in complex scenes compared to several existing methods.  
      关键词:remote sensing images;object detection;local parameter-free attention(LPFA);Wasserstein distance;combined loss function   
      219
      |
      157
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67575198 false
      更新时间:2025-04-11
    • 在遥感图像目标检测领域,研究者基于Faster R-CNN,结合无锚检测和监督掩码注意力技术,提出了一种新的两阶段无锚检测模型。该模型通过注意力机制和掩码监督方法引导检测模型关注目标区域,提高目标特征质量,并采用动态调整的软标签策略,实现标签合理分配,提高检测精度。在DOTA和HRSC2016数据集上,平均精确率均值分别达到76.36%和90.51%,超过大多定向检测模型,表明了该方法的先进性和有效性。
      Yu Lingxiao, Hao Jie, Zuo Liang
      Vol. 30, Issue 3, Pages: 696-709(2025) DOI: 10.11834/jig.240247
      Supervised attention-based oriented object detection in remote sensing images
      摘要:ObjectiveWith the emergence of convolutional networks in recent years, deep learning-based object detection algorithms for remote sensing images have achieved considerable effectiveness. Compared to object detection in natural scenes, object detection in remote sensing images faces remarkable challenges, including arbitrary orientation, dense arrangements, and multiscale distributions. Traditional bounding box annotations use horizontal bounding boxes aligned with the coordinate axes to represent object locations. However, using horizontal boxes as anchor boxes or proposals for detecting oriented objects leads to notable drawbacks. Therefore, object detection in remote sensing images generally employs oriented bounding boxes to accurately describe the objects. However, the existing oriented object detection algorithms mostly identify potential regions using predefined anchor boxes. While these predefined anchor boxes can assist oriented object detection algorithms in recognizing objects of different scales and shapes, they also have notable limitations. First, the anchor boxes are predefined. Therefore, any deviation of object dimensions from these boxes can result in decreased detection accuracy, potentially leading to higher miss rates or false detection rates. In addition, the number, scale, and aspect ratio of anchor boxes must be determined based on experience or parameter tuning, which can hinder the model’s capability to generalize across different scenarios or datasets. Second, anchor-based object detection models often define a large number of anchor boxes to cover objects of various sizes and aspect ratios to achieve a high recall rate. However, this phenomenon notably increases the computational complexity and further exacerbates the positive-negative sample imbalance problem. Additionally, remote sensing images often depict highly complex scenes, such as dense urban buildings, which introduce a substantial amount of disturbing information in the images. This complexity makes it challenging for traditional feature extraction networks based on classic backbone networks and feature pyramid networks to accurately extract and highlight the important features of the objects. This paper proposes a novel high-precision oriented object detection model for remote sensing images by combining supervised mask attention module (SMAM) wand anchor-free oriented region proposal networks (AFORPN) to address these challenges.MethodIn this paper, a two-stage detector based on region-based convolutional neural network(Faster R-CNN) is introduced for oriented object detection in remote sensing images. This model comprises the following four components: the feature extraction backbone network, the SMAM, the AFORPN, and the R-CNN. The main contributions are as follows: An SMAM is constructed for the extracted feature pyramid to focus more on the object regions, suppress background noise, and achieve fine feature extraction. The SMAM comprises three parts: multiscale feature fusion, spatial attention, and supervised mask enhancement module. First, a sub-pixel convolution technique is adopted for the multiscale fusion of feature pyramids to achieve the unity of spatial scales by upsampling the feature maps through convolutional learning and channel rearrangement, retaining more object information compared with adjacent upsampling. Then, the channel attention mechanism is used to learn the weight coefficients of each channel and adaptively fuse different levels of feature maps, improving the representation capability of the model for input features. Subsequently, a self-attention mechanism is employed in the design of the spatial attention module for the processing of fused feature maps, allowing each pixel in the feature map to consider the information from all other pixels, thereby establishing a global wise dependency. This approach contributes to an improved understanding of the image background and semantic correlations within the model, thereby enhancing its capability to comprehend the surrounding environment and suppress the blending effects due to fusion. Finally, in the design of the supervised mask enhancement module, the model is guided to learn the semantic features and object contour information in the form of mask loss feedback, which drastically improves the accuracy of classification and localization. An AFORPN is proposed to avoid complex and redundant anchor box designs. The AFORPN comprises localization and classification branches. In the design of the localization branch, the keypoint regression method based on fully convolutional one-stage object detection(FCOS) is adopted, and the gliding vertex technique is introduced to achieve the generation of oriented bounding boxes. The prediction of midpoint offsets effectively alleviates the sensitivity issue to angle variations, allowing the bounding boxes to better conform to object shapes and improving keypoint regression performance. During the training stage, based on the spatial alignment between samples and objects, as well as the regression performance of the samples, a new label assignment criterion is proposed. This criterion gradually improves the weight assignment of regression performance through dynamic adjustment, enabling accurate label assignment and effectively mitigating the inconsistency between classification and regression that may arise from static heuristic assignment rules.ResultAblation analysis is performed on the dataset for object detection in aerial images(DOTA) dataset to demonstrate the effectiveness of the method. Experimental results showed that the proposed SMAM and AFORPN improved the mean accuracy precision (mAP) by 1.52% and 2.65%, respectively, compared to the baseline. The proposed approach is evaluated on the DOTA and HRSC2016 datasets, comparing it with other oriented detection algorithms to demonstrate its state-of-the-art performance. Without special processing, a mAP of 75.36%, which is higher than most oriented object detection models, is achieved on the DOTA dataset. On the HRSC dataset, the model also achieved a mAP up to 90.51%.ConclusionExtensive experimental results demonstrate that the proposed SMAM improves the quality of feature maps, while the proposed AFORPN generates high-quality region proposals, thereby further enhancing the detection performance of oriented objects. Overall, the proposed oriented detection model, which combines SMAM with AFORPN, exhibits promising detection capabilities and can effectively adapt to complex oriented object detection scenarios in remote sensing images.  
      关键词:deep learning;remote sensing image;oriented object detection;anchor-free;attention mechanism;multi-scale feature fusion   
      131
      |
      53
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 86934556 false
      更新时间:2025-04-11
    • 在红外飞机目标检测领域,专家提出了基于全局—局部上下文自适应加权融合机制的检测算法,有效提升了目标与背景的判别能力。
      Xu Hongpeng, Liu Gang, Xi Jiangtao, Tong Jun
      Vol. 30, Issue 3, Pages: 710-723(2025) DOI: 10.11834/jig.240271
      Infrared aircraft detection algorithm based on adaptive weighted fusion of global-local context
      摘要:ObjectiveAn infrared aircraft target detection algorithm based on adaptive weighted fusion of global-local context (AWFGLC) is proposed to address the challenge of insufficient target feature extraction due to the small imaging area and weak radiation intensity in long-range infrared aircraft target detection. The global context focuses on the overall distribution of the target, providing the detection algorithm with global information regarding targets in images with strong radiation intensity and clear contours. In contrast, the local context emphasizes the local details of the target and the surrounding background information, offering the detection algorithm local information regarding targets with weak radiation intensity and blurred contours. Therefore, the global context and local context should be combined in accordance with the target characteristics in practical applications.MethodBased on the global-local context adaptive weighted fusion mechanism, the input feature map is randomly divided and reorganized along the channel dimensions, resulting in two separate feature maps. Compared with global and local context modeling for input feature maps based on a specific arrangement or simply dividing the input feature map into two feature maps, the arrangement of the input feature map during iterative training can be diversely changed by randomly reorganizing it with different random numbers in each training round. Global and local context modeling can help the detection algorithm learn additional diversified and comprehensive features by combining feature maps with different arrangements. The global and local context modeling of different combinations of feature maps can enable learning of additional diverse and comprehensive features through the detection algorithm. Moreover, the complexity of contextual modeling is reduced by half by dividing the input feature maps equally in the channel dimension and performing global and local contextual modeling to reduce the complexity of contextual modeling of input feature maps. A feature map is subjected to global context modeling using self-attention to realize the following: establish the dependencies between each pixel in the feature map and other pixels, demonstrate the correlation between the target and background features, emphasize the more salient features of the target, and enable the detection algorithm to perceive the global features of the target effectively. The other feature map is divided into windows, and maximum and average pooling are performed within each window to highlight the local features of the target. The pooled feature map is then subjected to local contextual modeling using self-attention to establish the correlation between the target and its surrounding neighborhoods, which further enhances the weaker parts of the target features and allows the detection algorithm to perceive the local features of the target effectively. According to the target characteristics, the global and local context feature maps are adaptively weighted and channel spliced using an adaptive weighted fusion strategy of learnable parameters, updating the learnable parameters along with the optimizer under the guidance of loss function reduction of the target detection algorithm. Subsequently, feature maps containing more complete target information are obtained using convolution, batch normalization, and activation functions to enhance the detection the capability of the algorithm to discriminate between target and background. The mechanism of global-local context-adaptive weighted fusion is incorporated into the YOLOv7 feature extraction network, and the context modeling is performed on the downsampled 4-fold feature maps and the downsampled 32-fold feature maps to maximize the physical information in the shallow feature maps and the semantic information in the deeper feature maps, improving the capability of the model to extract infrared aircraft target features.ResultExperimental results show that the proposed AWFGLC mechanism outperforms the contextual mechanisms such as global context, position attention, and local Transformer in terms of detection accuracy under the condition of increasing the number of parameters and computation in the homemade infrared aircraft dataset. The proposed AWFGLC mechanism tends to learn the global features of the target. Compared with the YOLOv7 detection algorithm, which has the second-best performance, the proposed AWFGLC-YOLO algorithm improves the mAP50 and mAP50:95 by 1.4% and 4.4%, respectively. In the publicly available dataset of weak aircraft target detection and tracking in infrared images in the ground/air context, the proposed AWFGLC mechanism learns the local features of the target. Compared to the YOLOv8 detection algorithm, which exhibits the second-best performance, the proposed AWFGLC-YOLO algorithm improves mAP50 and mAP50:95 by 3.0% and 4.0%, respectively.ConclusionThe infrared aircraft detection algorithm proposed in this paper is superior to classical target detection algorithms, effectively achieving infrared aircraft target detection.  
      关键词:infrared aircraft;target detection;global context;local context;adaptive weighted   
      112
      |
      168
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 68493923 false
      更新时间:2025-04-11
    • 在合成孔径雷达图像舰船检测领域,研究者提出了基于YOLOv8的改进算法,有效提升了检测精度和速度。
      Yu Guanghao, Chen Runlin, Xu Jinyan, Xu Qianxiang, Wang Dahan, Chen Feng
      Vol. 30, Issue 3, Pages: 724-736(2025) DOI: 10.11834/jig.240210
      Lightweight model for SAR ship detection incorporating deformable convolution and attention mechanism
      摘要:ObjectiveSynthetic aperture radar (SAR) has recently been widely used in fields such as maritime monitoring, military intelligence acquisition, and maritime management, primarily due to its capability to acquire data at any time under all weather conditions. Algorithms with better performance not only help improve ocean monitoring and navigation safety but also play a key role in areas such as maritime rescue, border security, and ocean resource management. Ship target detection methods can be divided into two categories: those based on deep learning and traditional methods. Deep learning methods offer high accuracy and strong generalization capabilities. These methods can be further classified into two categories: one-stage detection and two-stage detection. Compared to two-stage detection methods, one-stage detection methods generally achieve faster detection speeds at the expense of lower detection accuracy. One-stage detection methods, such as YOLO and single shot multibox detector (SSD), extract features through a backbone network, followed by direct classification and spatial position regression. Two-stage detection methods, such as R-CNN(region-based convolutional neural network) and Fast R-CNN, typically involve initial region generation followed by final region classification and regression. Currently, an increasing number of scholars are focusing on deep learning-based algorithms for ship target detection using SAR images. However, most of these methods have struggled to achieve an optimal balance between detection accuracy and processing efficiency. In this study, a lightweight model based on YOLOv8 was proposed to improve the performance of SAR ship detection while considering the balance between detection accuracy and efficiency.MethodThis study proposed a new method that substantially improved YOLOv8, called LDCE (lightweight-deformable convolution-CBAM-EIoU)-YOLOv8. The network structure of YOLOv8 was initially reconstructed to reduce network redundancy while maintaining sensitivity to ship features in SAR images. Furthermore, the introduction of deformable convolutional (DConv) allows the network to better perceive the environmental information around ship targets, thereby improving its capability to understand and capture ship these targets. Convolutional block attention module (CBAM) was introduced to minimize the interference of background information, enabling the network to focus on the key features of ship targets. Additionally, the efficient intersection over union (EIoU) loss function was adopted to enhance the convergence speed of the model. The experiments were initially conducted using the publicly available SAR ship detection dataset (SSDD), which comprises 1 160 images with an average size of 500 × 500 pixels and a total of 2 587 instances of ship targets. SSDD was randomly divided into training and testing sets in a ratio of 8∶2. During the training process, the size of the input images was adjusted to 640 × 640 pixels. The batch size and initial learning rate were set to 32 and 0.001, respectively. Meanwhile, the momentum and weight decay coefficients were 0.937 and 0.000 5, respectively. Multiple ablation experiments were conducted to validate the effectiveness of the newly proposed model, using the original YOLOv8 as baseline for comparison. Furthermore, additional comparisons were conducted with other recently proposed methods (i.e., Yang’s method and MSSDNet) as well as widely used detection algorithms, including Faster R-CNN, SSD, RetinaNet, YOLOv5, and YOLOv6. Additional experiments were conducted on the high-resolution SAR images dataset for ship detection (HRSID) to further validate the effectiveness and generalization of LDCE-YOLOv8. This dataset contains 5 604 images with an average size of 800 × 800 pixels and a total of 16 951 instances of ship targets.ResultThe accuracy evaluation indexes mAP@0.5 and mAP@0.75 for YOLOv8 (the baseline) were 98.16% and 82.46%, respectively, while the frames per second (FPS, as speed evaluation index) was 263 frame/s. For LDCE-YOLOv8, the accuracy evaluation indexes mAP@0.5 and mAP@0.75 were 98.84% and 85.78%, respectively, and the speed evaluation index FPS was 285 frame/s. The parameter count of LDCE-YOLOv8 decreased by 24.58% compared to YOLOv8. The mAP@0.5, mAP@0.75, and FPS of LDCE-YOLOv8 were 0.62%, 2.23%, and 18.30% higher than those of MSSDNet, while these indexes were 0.90%, 2.96%, and 4.40% higher than those of Yang’s method, respectively. The iterative curve of the bounding box loss values for each training session in the ablation experiment showed that LDCE-YOLOv8 had the lowest loss value and the fastest iteration speed. Overall, the results clearly showed that the newly proposed model (i.e., LDCE-YOLOv8) exhibited the best detection performance in terms of parameter, precision, recall, average precision, and FPS. These results indicate a higher feature extraction capability of LDCE-YOLOv8 for detecting ship targets in SAR images. Detection result graphs for five representative scenarios were presented to compare the detection performance among different methods intuitively. In each case, LDCE-YOLOv8 achieved the best performance, accurately detecting all ship targets across these different scenarios. Consequently, the newly proposed method demonstrated strong anti-interference capability when handling substantial irregular noise distributed in SAR images. Moreover, this method suppressed false alarms from strong bright spots that had high similarity to ship features, while also performing well in small object detection. Whether in complex nearshore scenes or simpler sea scenes, LDCE-YOLOv8 effectively reduced missed detections and false detections while maintaining a high level of detection confidence. Additionally, the newly proposed method achieved better experiment results on the HRSID dataset, with mAP@0.5 at 88.91%, mAP@0.75 at 73.74%, and FPS at 312 frame/s, representing improvements of 1.29%, 3.13%, and 6.1% compared to YOLOv8. Accordingly, the results with HRSID demonstrated the superior detection performance of LDCE-YOLOv8 in SAR ship detection.ConclusionIn this study, a lightweight model based on YOLOv8 was proposed to enhance SAR ship detection in terms of accuracy and efficiency. This model, named LDCE-YOLOv8, incorporates deformable convolution and an attention mechanism. Specifically, LDCE-YOLOv8 features substantially reduced network redundancy while maintaining its capability to capture ship features in SAR images. The integration of deformable convolution enhances the network’s capability to perceive environmental information around ship targets. Additionally, convolutional block attention modules and an efficient intersection over union loss function were incorporated to enhance target location. Experiments on two publicly available datasets (i.e., SSDD and HRSID) validated the effectiveness of LDCE-YOLOv8 in SAR ship detection. However, the newly proposed algorithm still exhibits limitations, particularly in accurately detecting all ship targets in SAR images with densely packed ships. Ongoing investigations are focused on addressing this specific challenge.  
      关键词:synthetic aperture radar (SAR);object detection;YOLOv8;convolutional block attention module(CBAM);deformable convolution;efficient intersection over union (EIoU)   
      46
      |
      35
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 86934683 false
      更新时间:2025-04-11

      Image Processing and Coding

    • 在数字文化遗产保护领域,专家提出了一种自适应卷积约束与全局上下文推理的墓室壁画修复方法,有效修复多种复杂病害,为手工绘制专家的物理修复提供参考。
      Wu Meng, Guo Ge, Sun Zengguo, Lu Zhiyong, Zhang Qianwen
      Vol. 30, Issue 3, Pages: 737-754(2025) DOI: 10.11834/jig.240277
      Tomb mural inpainting with adaptive convolutional constraints and global context inference
      摘要:ObjectiveAs an important cultural heritage of ancient civilization, murals have suffered over the years due to environmental factors such as humidity and ground settlement. This phenomenon has led to issues such as peeling, cracks, mud spots, and mildew, which seriously affect the sustainable development of mural preservation and hinder activities related to appreciation, cultural creativity, and cultural dissemination. Considering the influence of underground environmental factors, researchers often use block excavation methods to transfer and restore the murals. Traditionally, the restoration process requires professional restorers to manually redraw the murals, which demands a high level of skill and results in a lengthy and inefficient repair cycle. Therefore, in response to the complex semantic environment and the lack of diverse information, deep learning-based restoration methods have been gradually applied to the reconstruction of murals, providing scientific and technological protection for murals. However, most existing methods typically perform restoration in a single dimension or within a fixed area, which fails to fully capture sparse mural features and repair multiple complex diseases simultaneously. This phenomenon results in semantic inconsistencies or incoherent structural results. This paper proposes a tomb mural inpainting model with adaptive convolutional constraints and global context inference to solve the above problems. This model can repair various types of damage and diseases, producing a rich database of digital cultural heritage.MethodBased on the end-to-end encoder-decoder architecture, the model first designs a multiscale enhanced convolution (MEConv) module in the encoder path for content constraints and extracts different features of the image from the frequency and spatial domains simultaneously to complement each other. The enhanced activation unit fused with differential convolution is also added to the repair path to introduce edge prior information, correcting the adaptive multiscale feature mapping, which can effectively capture the global and local latent feature information. Second, considering the differences in the textural and structural patterns in the mural painting process, a multiscale feature aggregation (MSFA) module was added between the encoder and the decoder. Through the fusion of multiscale feature components and the dynamic selection of the attention mechanism, the adaptive selection and enhanced effective information of the global context were strengthened. The representation and generalization capability of the original feature map were also enhanced, and the drawing accuracy was improved. In addition, the model compensates for the difference between the underlying and high-level features through automatic mask update and jump connection in each layer to obtain true and accurate results in the feature transfer process, enabling the decoder to accurately draw the missing content.ResultThis paper used the objective evaluation indexes to verify the effect of network disease repair: peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and learned perceptual image patch similarity (LPIPS). These indexes are used to conduct three types of disease remediation experiments on the constructed polo mural dataset and then compared with the latest four mainstream methods. Experimental results show that the restored images by this method exhibit improvements in subjective vision and objective evaluation, and the content and style of the generated murals are closer to those of the real murals. For example, compared with the second-ranked algorithm, the mean values of each index were increased by 0.143 2 dB, 0.016 7, and 2.17% for mural restoration in the epitaxial damage area. For the mural repair in the irregular damage area, the mean values of each index increased by 0.413 2 dB, 0.030 4, and 15.11%. For the mural restoration in the random damage area, the mean values of each index increased by 2.365 3 dB, 0.012 8, and 12.75%.ConclusionThe image inpainting model proposed in this paper can not only fully capture the latent features of different characteristics of the image, but also extract the long-term contextual semantic features in the deep features. The model can effectively deal with a variety of complex mural diseases and restore images with consistent semantics, rich details, complete content, and natural coherence, providing a feasible solution for the digital restoration and display of murals. Future work will collect additional high-definition tomb murals and consider the mural painting and historical background of different periods and years to restore the real damaged mural images. This approach will allow the model to learn highly accurate and comprehensive mural content and style, improving the restoration performance of the model.  
      关键词:mural restoration;multi-scale enhanced convolution module;multi-scale feature aggregation module;enhanced activation unit;differential convolution;disease restoration   
      128
      |
      415
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67576748 false
      更新时间:2025-04-11
    • 在图像去模糊领域,研究人员提出了基于提示学习的多尺度去模糊新方法,有效解决了伪影、细节模糊和噪声残留问题,为图像去模糊研究提供了新方向。
      Xie Bin, Li Yanxian, Shao Xiang, Dai Bangqiang
      Vol. 30, Issue 3, Pages: 755-768(2025) DOI: 10.11834/jig.240315
      Multiscale image deblurring based on prompt learning and gated feedforward networks
      摘要:ObjectiveImage deblurring aims to restore a clean image from blurry images while still maintaining the structure and details of the original image during the restoration. With the rapid development of Internet technology, the way people obtain images becomes highly diversified. However, the image is often blurred or distorted by various factors during the acquisition process. Therefore, deblurring the image is necessary. Image deblurring is of considerable importance to improve image quality and plays a key role in numerous fields such as medical imaging, satellite image processing, and security monitoring, which has attracted the attention of many researchers. Additional prior knowledge is needed to recover images with high quality due to the ill-posed image deblurring task. At present, the existing deblurring methods include traditional and deep learning-based approaches. In the traditional methods, despite the simplicity and convenience of the filter-based deblurring method, the recovered images often have artifacts, content loss, and other problems, which fail to meet the needs of various applications. The deblurring method based on the idea of regularity has received increasing attention from researchers for a long time, and various methods of constructing regular terms have been proposed to solve this kind of ill-posed problems. These traditional methods can achieve the purpose of deblurring to a certain extent. However, they rely on the prior information of images, which is difficult to obtain accurately in practical applications. Therefore, this kind of method cannot be effectively promoted in a wide range. With the extensive application of deep learning technology, an increasing number of researchers begin to use this technology to address ill-posed problems. The image deblurring methods generally fall into three main categories: convolutional neural network (CNN)-based method, generative adversarial network (GAN)-based method, and Transformer-based method. In the CNN-based methods, the powerful feature extraction capability of CNNs allows the model to learn complex mapping relationships. By minimizing the loss function, these methods guide the model’s convergence to obtain the best output images. However, such approaches often lack multiscale features and can introduce artifacts and result in the loss of image details. Researchers have proposed a new framework named GAN to address these shortcomings. In this approach, the generator and discriminator are trained alternately to continuously improve the performance of the generator, leading to higher-quality output images. Following the success of Transformers in natural language processing, researchers have begun to introduce them into the field of image processing. The advantage of Transformer-based methods is their capability to better capture local context information, leading to improved image deblurring. However, incorporating Transformer blocks inevitably increases the computational complexity of the model. A novel multiscale image deblurring method based on prompt learning is proposed to address the problems of noticeable artifacts, fuzzy details and residual noise in previous image deblurring methods.MethodIn this paper, three improvements are made. First, the degraded information coding module based on Prompt learning can use the context information in the degraded image to dynamically guide the deep network to complete different image deblurring tasks. Next, a gated feedforward network is designed to control the flow of information at each level and build a richer and more hierarchical feature representation. Therefore, prompt U-shaped block (PUBlock) is designed. In addition, considering the original loss function, the adaptive total variation regularization is added to effectively suppress the noise residue in the process of image restoration and improve the visual performance of the result image. Through the introduction of a gating mechanism, the network can dynamically control the flow of information to effectively capture complex feature relationships. Using deep convolution can improve the efficiency of the model while ensuring its performance. Prompt learning can effectively help the model utilize degraded images, and adaptive regularization can selectively smooth the image, which not only removes the noise but also prevents the image from being over-smooth.ResultDeblurring experiments are performed on the GoPro and REDS datasets and compared them with other advanced methods to demonstrate the effectiveness of the proposed method. In addition, peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are used as objective evaluation metrics. Experimental results show that the proposed method outperforms all other methods in GoPro and REDS datasets and achieves 33.04 dB and 0.962, respectively, on the GoPro dataset and 28.70 dB and 0.859, respectively, on the REDS dataset under the two metrics. These results are better than the PSNR and SSIM values of the conventional image deblurring method. The comparison results with the segment anything model-deblur (SAM-deblur) algorithm show that PSNR improves by 1.77 dB on the REDS dataset, while those with double-scale network with deep feature fusion attention (DFFA-Net) based on the GoPro dataset show that the proposed method improve the PSNR and SSIM by 0.49 dB and 0.005, respectively. In addition, the visual results reveal that the images recovered by the proposed model are closest to the original real image, maintaining the original structure and features, and exhibits a finer edge.ConclusionIn this paper, aiming to address the problems of existing image deblurring methods, a novel multiscale image deblurring method based on prompt learning is introduced. The experimental results show that the new method not only preserves the details of the result image but also effectively overcomes the problems of evident artifacts and noise residue. The result image also has superior performance in the objective evaluation metrics on PSNR and SSIM.  
      关键词:image deblurring;prompt learning;multi-scale;gated feedforward network(GFFN);depthwise convolution   
      115
      |
      119
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67575247 false
      更新时间:2025-04-11
    • 在信息隐藏领域,专家提出了一种位平面隐含异位联合编码的密文图像可逆信息隐藏方案,有效提升了嵌入率。
      Chen Zhenyu, Yin Zhaoxia, Zhan Hongjian, Lyu Shujing, Hu Menghan
      Vol. 30, Issue 3, Pages: 769-783(2025) DOI: 10.11834/jig.240287
      Reversible data hiding in encrypted images based on combined encoding method containing the opposite bit
      摘要:ObjectiveThe technology of reversible data hiding in encrypted images (RDHEI) aims to embed secret information into encrypted images, ensuring that the secret information and the original image can be extracted and restored without loss. This technology is gaining increasing attention from researchers and is widely applied in cloud services to protect users’ privacy. Currently, RDHEI can be mainly divided into two categories: the VRAE (vacating room after encryption) algorithm and the RRBE (reserving room before encryption) algorithm, based on the order of image encryption and room operation. The VRAE algorithm vacates space by compressing the pixels of the encrypted image. This compression yields only a limited amount of space due to the high information entropy of the encrypted image. In contrast, the RRBE algorithm primarily compresses the image using pixel correlation before encrypting it. The original image has lower information entropy; thus, more space can be reserved before encryption. This paper proposes a new RDHEI scheme based on bit-plane compression containing opposite bits and leverages the correlation between the encoding information delivery efficiency and adjacent pixels to improve the performance of reversible data hiding in encrypted images.MethodFirst, bit-plane rearrangement and pixel prediction methods are adopted to ensure the utilization of the correlation between adjacent pixels. The image owner initially divides the original image into several equally sized blocks and calculates the prediction errors of the original image pixels. Afterward, the eight bit-planes of the processed image are rearranged. In the phase of bit-plane compression, a combined encoding method containing the opposite bit is introduced. Specifically, based on the threshold length, the image bitstream is divided into continuous and discontinuous streams for compression. After compressing a continuous bitstream string, the next opposite digit at the end of the string is include; that is, each long compressed bitstream adds an opposite digit. According to this rule, the rearranged images are compressed and sequentially placed in each high-level plane with additional information. Encryption and scrambling operations occur at this point. The room in the low-level plane is then vacated, and the information hider embeds the data into the reserved space of the encrypted image. Finally, the image receiver extracts the original image or secret information without loss based on the different keys used.ResultExperimental comparisons with six advanced methods on six standard test images and two common datasets, namely BOSSBase and BOWS2, are conducted to evaluate the effectiveness of this algorithm. The embedding rate is used to measure algorithm performance, while peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) indicators serve as quantitative evaluation metrics for lossless reversible recovery. Experimental results show that the average embedding rates of the proposed algorithm on the BOSSBase and BOWS2 datasets are 3.818 3 bpp and 3.694 3 bpp, respectively, demonstrating superior performance compared to similar algorithms. PSNR and SSIM are constant values equal to +∞ and 1, indicating the reversibility of the algorithm.ConclusionThe proposed algorithm uses the image correlation of the original image and effectively explores the potential of the encoded and compressed information during the encoding and compression processes. This algorithm addresses the issue of compression loss caused by the short and large number of continuous bitstream strings in practical applications, thereby improving compression efficiency and enhancing the embedding rate.  
      关键词:reversible data hiding in encrypted image(RDHEI);containing opposite bit compression;combined encoding;prediction error;bit-plane;embedding rate(ER)   
      105
      |
      111
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67575649 false
      更新时间:2025-04-11
    • 在图像超分辨率领域,研究者提出了融合通道注意力的跨尺度Transformer模型,有效提升了图像重建性能,并在多个数据集上验证了模型的有效性。
      Li Yan, Dong Shihao, Zhang Jiawei, Zhao Ru, Zheng Yuhui
      Vol. 30, Issue 3, Pages: 784-797(2025) DOI: 10.11834/jig.240279
      Cross-scale Transformer image super-resolution reconstruction with fusion channel attention
      摘要:ObjectiveThe image super-resolution reconstruction technique refers to a method for converting low-resolution (LR) images to high-resolution (HR) images within the same scene. In recent years, this technique has been widely used in computer vision, image processing, and other fields due to its wide practical application value and far-reaching theoretical importance. Although the model based on convolutional neural networks has made remarkable progress, most super-resolution network structures remain in a single-layer level end-to-end format to improve the reconstruction performance. This approach often overlooks the multilayer level feature information during the network reconstruction process, limiting the reconstruction performance of the model. With the advancement of deep learning technology, Transformer-based network architectures have been introduced into the field of computer vision, yielding substantial results. Researchers have applied Transform models to underlying vision tasks, including image super-resolution reconstruction. However, in this context, the Transformer model suffers from a single feature extraction pattern, loss of high-frequency details in the reconstructed image, and structural distortion. A cross-scale Transformer image super-resolution reconstruction model with fusion channel attention is proposed to address these problems.MethodThe model comprises the following four modules: shallow feature extraction, cross-scale deep feature extraction, multilevel feature fusion, and a high-quality reconstruction module. Shallow feature extraction uses convolution to process early images to obtain highly stable outputs, and the convolutional layer can provide stable optimization and extraction results during early visual feature processing. The cross-scale deep feature extraction module uses the cross-scale Transformer and the enhanced channel attention mechanism to acquire features at different scales. The core of the cross-scale Transformer lies in the cross-scale self-attention mechanism and the gated convolutional feedforward network, which down samples the feature maps to different scales by scale factors and learns contextual information using image self-similarity, and the gated convolutional network encodes spatial neighboring pixel position information and helps learn the local image structure, replacing the feedforward network in the traditional Transformer. A reinforced channel attention mechanism is used after the cross-scale Transformer to expand the sensory field and extract different scale features to replace the original features via weighted filtering for backward propagation. Increasing the depth of the network will lead to saturation. Thus, the number of residual cross-scale Transformer blocks is set to 3 to maintain a balance between model complexity and super-resolution reconstruction performance. After stacking different scale features in the multilevel feature fusion module, the enhanced channel attention mechanism is used to dynamically adjust the channel weights of different scale features and learn rich contextual information, thereby enhancing the network reconstruction capability. In the high-quality reconstruction module, convolutional layers and pixel blending methods are used to up-sample features to the corresponding dimensions of high-resolution images. In the training phase, the model is trained using 900 HR images from the DIV2K dataset, and the corresponding LR images are generated from the HR images using double-triple downsampling (with downsampling multiples of ×2, ×3 and ×4). The network is optimized using Adam’s algorithm with L1loss as the loss function.ResultTests on five standard datasets, namely, Set5, Set14, BSD100, Urban100, and Manga109, are performed, and the performance of the proposed model is compared with 10 state-of-the-art models. These models include the following: enhanced deep residual networks for single image super-resolution (EDSR), residual channel attention networks (RCAN), second-order attention network (SAN), cross-scale non-local attention (CSNLA), the cross-scale internal graph neural network (IGNN), holistic attention network (HAN), non-local sparse attention (NLSA), image restoration using Swin Transformer (SwinIR), efficient long-range attention network (ELAN), and permuted self-attention (SRFormer). Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are used as metrics to measure the performance of these methods. Humans are very sensitive to the brightness of an image; therefore, these metrics are measured in the Y-channel of the image. Experimental results show that the proposed model obtains high PSNR and SSIM values and recovers additional detailed information and highly accurate textures at magnification factors of ×2, ×3, and ×4. The proposed method improves 0.13~0.25 dB over SwinIR and 0.07~0.21 dB over ELAN on the Urban100 dataset and 0.07~0.21 dB over SwinIR and 0.06~0.19 dB over ELAN on the Manga109 dataset. The localized attribution map (LAM) is used to further explore the model performance. The experimental results revealed that the proposed model can utilize a wider range of pixel information, and the proposed model exhibits a higher diffusion index (DI) compared to SwinIR, proving the effectiveness of the proposed model from the interpretability viewpoint.ConclusionThe proposed cross-scale Transformer image super-resolution reconstruction model with multilevel fusion channel attention reduces noise and redundant information in the image by fusing convolutional features with Transformer features. This model also uses a strengthened channel attention mechanism to reduce the likelihood of image blurring and distortion in the model, and the image super-resolution performance is effectively improved. The test results verify the effectiveness of the multi-tip model in numerous public experimental datasets. The model visually obtains a reconstructed image that is sharper and closer to the real image with fewer artefacts.  
      关键词:image super-resolution;cross-scale transformer;channel attention mechanism;feature fusion;deep learning   
      224
      |
      233
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 68494016 false
      更新时间:2025-04-11
    • 在图像去雾领域,研究者提出了一种多分支非均质图像去雾算法,有效增强非均质雾霾图像,展现出较高的鲁棒性和性能指标。
      Jin Xinle, Liu Chunxiao, Ye Shuangshuang, Wang Chenghua, Zhou Zixiang
      Vol. 30, Issue 3, Pages: 798-810(2025) DOI: 10.11834/jig.240253
      Multi-branch non-homogeneous image dehazing based on concentration partitioning and image fusion
      摘要:ObjectiveWhen capturing images using a camera, atmospheric floating particles, such as smoke, dust, and fog, can affect image quality, leading to decreased clarity. These compromised images not only increase the likelihood of human visual misjudgment but also hinder the development of visual tasks such as remote sensing monitoring and autonomous driving. Current dehazing methods are effective for homogeneous thin hazy images but often perform poorly on the nonhomogeneous hazy images. Therefore, a multibranch non-homogeneous image dehazing method combined with concentration partitioning and image fusion is proposed to address these challenges. A single non-homogeneous hazy image is regarded as a combination of multiple local regions with homogeneous thin or dense haze. The entire non-homogeneous image is dehazed by separately addressing different homogeneous hazy regions in a single nonhomogeneous hazy image.MethodConcentration partitioning and image fusion based multi-branch image dehazing neural network (CPIFNet), a two-stage network framework for image enhancement and image fusion, is then designed. Experiment results revealed that training models based on homogeneous haze image datasets with different haze concentrations can lead to enhancement in image models with varying enhancement intensities. Homogeneous hazed image datasets with different haze concentrations are necessary to obtain varying enhancement models. FiveK-Haze is a synthesized dehazing dataset based on the atmospheric physical scattering model, encompassing nine types of different homogeneous hazed images with varying haze concentrations. The hazy images in the FiveK-Haze dataset are re-partitioned based on haze concentration, dividing the dataset into 1 to 5 different haze concentration levels to exclude the hazy image samples with excessive haze concentrations. Then, the image enhancement network is trained on those new homogeneous dehazing datasets to obtain image enhancement models for different haze concentrations. In the image enhancement networks, the deep image features of hazy images are continuously extracted to obtain the stretching coefficient of the image enhancement model. This coefficient is multiplied with the hazy image to produce the image enhancement result. The image enhancement network replaces network layers with residual modules to extract deep feature information, avoiding feature information loss as the network deepens. The ReLU activation function is used after each convolutional layer to accelerate the convergence speed of network training and avoid the transmission of negative values in the feature layers. Each enhancement network performs well for the corresponding haze concentration image region. However, a single enhancement network can only effectively enhance image regions with corresponding haze concentrations, which leads to insufficient or excessive dehazing in other regions. Therefore, an image fusion network is designed to combine the advantageous regions in the multiple initial enhancement results, producing the final dehazed result. In the image fusion network, deep image features of different image enhancement results are continuously extracted, and the dehazed result is obtained by stacking and merging these deep image features. In addition to reconstruction, perceptual, and structural losses, the image enhancement and image fusion networks also utilize color loss to constrain the image dehazing results of the network modules. This condition is due to the severe loss of pixel information in dense hazy images, complicating color restoration and allowing the color loss function to guide the color of the image dehazing result closer to the reference image.ResultTheoretically, a dehazing dataset with more finely divided haze concentration levels could enable network models to learn additional information of hazed images. However, experiments reveal otherwise. When the number of haze datasets is 3, that is, the number of image enhancement models is 3, CPIFNet achieves the best dehazing performance. Large-scale experiments are conducted in comparison with more than 10 latest image dehazing algorithms, revealing that the proposed method is optimal in terms of performance indicators and dehazing effects. Compared with the second-ranked self-paced semi-curricular attention network (SCAN) method, the method improves the reference indicators of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) by 5.286 6 dB and 0.113 8, respectively, compared with the synthetic hazy image dataset FiveK-Haze. Compared with the second-ranked DEAN method, the proposed method reduces the non-reference metrics as FADE and HazDes by 0.079 3 and 0.051 2, respectively, over the real-world hazy image dataset. Additionally, additional tests are conducted on some publicly available datasets to obtain more comparative experiments and indicator evaluations. On the SOTS-indoor dataset, the method improves PSNR and SSIM by 2.518 2 dB and 0.012 3, respectively, compared to the second-ranked DeFormer method. On the SOTS-outdoor dataset, the method improves PSNR by 2.83 2 dB compared to the second-ranked SGID-PFF method and enhances SSIM by 0.023 8 compared to the second-ranked DeFormer method.ConclusionA two-stage, multi-branch deep neural network is designed to remove haze from a single non-homogeneous hazy image by separately addressing different homogeneous hazy regions. Compared to existing methods, the proposed method can enhance the structural contrast of dense hazy regions and slightly improve thin hazy regions while restoring color information.  
      关键词:image dehazing;non-homogeneous hazy images;haze concentration partitioning;image fusion;multi-branch neural network   
      28
      |
      23
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 86934143 false
      更新时间:2025-04-11

      Image Analysis and Recognition

    • 在3D模型分类领域,专家提出了一种深度学习网络,有效融合多视图一致信息和互补信息,显著提升了分类准确率。
      Wu Han, Hu Liangchen, Yang Ying, Jie Biao, Luo Yonglong
      Vol. 30, Issue 3, Pages: 811-823(2025) DOI: 10.11834/jig.240060
      Multiview consistent and complementary information fusion method for 3D model classification
      摘要:Objective3D model classification holds considerable promise across diverse applications, including autonomous driving, game design, and 3D printing. With the rapid advancement of deep learning, numerous deep neural networks have been investigated for 3D model classification. Among these approaches, view-based methods consistently outperform voxel mesh and 3D point cloud-based methods. The view-based method captures multiple 2D perspectives from various angles of 3D objects to represent their 3D information. This approach closely aligns with human visual processing, transforming 3D problems into manageable 2D tasks that can be solved using standard convolutional neural networks (CNNs). In contrast, voxel-based and point cloud-based methods primarily focus on the spatial characteristics of 3D models, necessitating the generation of substantial datasets. The view-based method visually transforms 3D challenges into 2D tasks by obtaining multiple 2D views from different angles of 3D models, mirroring human approaches and facilitating resolution through conventional CNNs. The utilization of CNNs typically involves employing established models such as the visual geometry group (VGG) network, the inception network (GoogLeNet), and the residual network (ResNet) to derive a view representation of 3D models. Methods such as MVCNN and the group-view convolutional neural network leverage pretrained network weights to obtain view descriptors from multiple perspectives. However, these approaches often neglect the complementary information between views, an aspect crucial for shaping the final descriptor. As shape descriptors are integral to the recognition task, acquiring 3D model shape descriptors through view descriptors remains a fundamental challenge for achieving optimal 3D model classification. Recent studies, including MVCNN and the dynamic routing convolutional neural network (DRCNN), employ a view pooling scheme to generate discriminative descriptors from feature representations of multiple views, marking crucial milestones in 3D model classification with notable performance improvements. Notably, existing methods inadequately exploit the view characteristics among multiple perspectives, severely limiting the efficacy of shape descriptors. On the one hand, the inherent differences in the two-dimensional views projected from three-dimensional objects constitute complementary information that enhances the generation of the final shape descriptor. On the other hand, each 2D view can, to some extent, represent its corresponding 3D object, indicating consistent features between views. Integrating these consistent features enhances the accuracy of recognition tasks. Consequently, learning complementary information between views and integrating it with consistent information emerges as a critical aspect for advancing 3D model classification.MethodAddressing this challenge, our paper introduces a network model that integrates complementary and consistent information derived from multiple views, thereby enhancing the overall comprehensiveness of the information. Specifically, the model aims to fuse association information between views for 3D object classification. Initially, an enhanced residual network is employed to extract feature representations from multiple views, resulting in view descriptors. Subsequently, a pre-classification network, combined with an attention mechanism and a weight learning strategy, is utilized to merge these view descriptors and generate shape descriptors. Multiscale dilated convolution is introduced following ordinary convolution within the residual module during one-way network propagation in a single view to improve the residual structure of the ResNet model. This improvement extends the receptive field of the convolution operation, facilitating the extraction of complementary information. Additionally, a pre-classification module is proposed to assess the recognition degree of each view based on shape characteristics. Using this information, the views are categorized into complementary and consistent groups. A subset of both types of views is combined into feature views, each possessing two key characteristics. These feature views are then input into an attention network to strengthen consistency and complementarity. Subsequently, a learnable weight fusion module is applied to the two feature views, weighting and merging them to generate shape descriptors. Finally, the overall network structure is refined and the pre-classification and attention layers are strategically positioned to ensure optimal performance for the proposed methodology.ResultThis study conducted a series of experiments to validate the effectiveness of the proposed model on the ModelNet10 and ModelNet40 datasets. Initially, an ablation experiment was performed to compare the positions of module insertion, feature extraction networks, and the number of views. Experimental results indicate that tightly coupling the pre-classification module with the attention module and placing them in the second or third layer of the residual network yields superior final classification accuracy compared to inserting them in other layers. The method introduced in this study demonstrates higher average single-class classification accuracy than models such as ResNet50 with an equivalent number of training iterations. Additionally, this method exhibits lower average losses and more robust convergence of losses in terms of loss reduction. The performance of the model is further evaluated across varying numbers of views. Regardless of the number of views, the method consistently outperforms or matches MVCNN and DRCNN. As the number of views increases from 3 to 6 to 12, the proposed model and the DRCNN model exhibit a continuous improvement in accuracy. Finally, when compared to 15 classic multiview 3D object classification algorithms, the proposed model achieved an average instance accuracy of 97.27% on ModelNet10 using the designed feature extraction network with the same 12 views. However, this model falls slightly behind DRCNN, possibly due to limited data in the ModelNet10 dataset, which may lead to overfitting and reduced classification accuracy. In ModelNet40, the model achieved an average instance accuracy of 96.32%. Comparisons with ResNet50 and VGG-M also highlight the advantages of the model in classification accuracy, confirming its capability to more effectively extract information between views on the backbone network than other methods.ConclusionThis study presents a robust deep 3D object classification method that leverages multiview information fusion to integrate consistency and complementarity information among views through adjustments in the network structure. The experimental findings substantiate the favorable performance of the proposed model.  
      关键词:multi-view learning;3D model classification;consistency and complementarity;improving residual network;view fusion   
      91
      |
      115
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 69427793 false
      更新时间:2025-04-11

      Image Understanding and Computer Vision

    • 在计算机视觉领域,研究人员提出了一种新的单目深度估计模型,通过多层次感知条件随机场模型和混合金字塔特征融合策略,有效提高了深度估计精度。
      Jia Di, Song Huilun, Zhao Chen, Xu Chi
      Vol. 30, Issue 3, Pages: 824-841(2025) DOI: 10.11834/jig.240260
      Multilevel perceptual conditional random field model for monocular depth estimation
      摘要:ObjectivePredicting scene depth from a single RGB photograph is a complex and challenging issue. Accurate depth estimates are essential in various computer vision applications, including 3D reconstruction, autonomous driving, and robotics navigation. Accurately determining depth information from a two-dimensional image is a difficult task due to the ambiguity and absence of clear depth indicators. Modern approaches to this issue involve creating intricate neural networks that attempt to estimate depth maps in a direct and approximate way. These networks frequently utilize deep learning methods and large quantities of labeled data to understand the complex relationships between RGB pixels and their associated depth values. Although these methods have demonstrated promising outcomes, they frequently encounter issues such as computational inefficiency, overfitting, and poor generalization skills. This research introduces a multilevel perceptual conditional random field model that relies solely on the Swin Transformer.MethodFirst, an adaptive hybrid pyramid feature fusion approach is a fundamental component of the entire architecture. This technique is precisely crafted to encompass various existing dependencies across multiple spatial positions, including short-distance and long-distance linkages. The proposed technique also efficiently gathers overall and specific contextual information by smoothly combining feature fusion techniques that include various kernel shapes, offering a thorough comprehension of the data. This consolidation not only guarantees the smooth transmission of information within the model but also considerably boosts the distinguishing capability of the feature representations. Therefore, the model becomes better at recognizing and understanding complex patterns and structures in the data, resulting in enhanced performance and accuracy. Second, the decoder includes dynamic scaling attention, a clever approach that markedly enhances the capacity of the model to capture complex dependency relationships among various regions in the input image. The attention mechanism enables the model to selectively concentrate on the most pertinent areas while disregarding irrelevant or noisy data. This mechanism optimizes the efficiency and robustness of the model against different sorts of distortions and noise. A distinct update initialization mechanism is employed to identify and adjust the most appropriate parameters for the task. This method successfully avoids problems related to linear projection limitations and severe network behaviors, guaranteeing a seamless and steady learning experience. Finally, a hierarchical perception adaptor is presented to handle the intricate interplay among several feature modalities. This adaptor acts as an intermediary between several feature representations, increasing the feature mapping dimension and enabling improved interaction between channels. The feature learning capability of the model is notably enhanced by encouraging this interaction, allowing its effective management of increasingly intricate jobs. Particularly, situations when various sources of information must be combined and understood efficiently involve tasks such as picture identification, object detection, or semantic segmentation.ResultMeticulous comparative tests and ablation studies were conducted on the NYU Depth v2 dataset to thoroughly assess the performance of the proposed network. The trial findings demonstrated a notable enhancement in all performance metrics, conclusively confirming the superiority of the proposed approach. The solution outperformed the previous advanced methods by a substantial margin of 6.8% on the absolute relative error (Abs Rel) indicator, achieving a low error rate of 0.088. This enhancement demonstrates the precision and accuracy of the network in calculating depth from a solitary RGB image. The technique outperformed others with a root mean squared error (RMSE) score of 0.031 6, representing a 13% performance improvement. This notable enhancement demonstrates the capability of the model to manage intricate scenarios and generate precise depth maps. Furthermore, the proposed approach resulted in a 5% improvement in the root mean square logarithmic error (log RMSE) and square relative error (Sq Rel) measures. This finding highlights the strength of the model in handling pixels with extreme depth values, guaranteeing accurate depth predictions in several situations. The proposed technique showed substantial enhancements in terms of accuracy. The δ<1.25 indicator, which quantifies the proportion of estimated depths that are within a specified range of the actual depth, increased by 10%. The more rigorous criteria of δ<1.253 scored remarkably high at 99.8%, nearing the optimal level. The results provide additional proof of the high accuracy and dependability of the model in depth estimation tasks. Pre-training was conducted on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago(KITTI) dataset with 50 rounds to confirm the effectiveness of the model in real-world situations. The KITTI dataset, known for its lifelike urban environments, served as a difficult platform for assessing the generalization capabilities of the model. The methodology revealed substantial enhancements in all assessment criteria on the KITTI dataset when compared to existing advanced depth estimation methods. The RMSE demonstrated a performance improvement of 53%, showcasing the superior depth estimation capabilities of the model. The threshold indications (δ<1.252 and δ<1.253) showed approximately 100% accuracy, demonstrating the strength and flexibility of the model in handling the complexity and variations found in street situations. In addition, the strong generalization of the model was verified on the MatterPort3D dataset, and substantial improvements were achieved in all indicators.ConclusionA multilevel feature extractor that substantially improves the Swin Transformer design is presented in the paper. This innovative method seeks to enable highly accurate and seamless information transfer by reducing the semantic gap between the encoder and decoder. A mixed pyramid feature fusion methodology, which is essential to its design, allows for the extraction and integration of numerous features at various scales, efficiently capturing contextual information at the local and global levels. This technique ensures that the decoder obtains relevant and in-depth feature representations while simultaneously improving network flow by bridging the semantic gaps between the encoder and decoder. This phenomenon raises the quality of the output while also increasing the overall efficiency of the network. Moreover, fully connected decoding, a method that notably improves the precision of monocular depth estimate, is included in the suggested strategy. The model produces highly accurate depth maps by utilizing this decoding technique, which is a considerable advancement over conventional techniques.  
      关键词:monocular depth estimation;conditional random field;hybrid pyramid feature fusion(HPF);dynamic scaling attention;hierarchical awareness adapter(HA)   
      46
      |
      35
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 86934332 false
      更新时间:2025-04-11
    • 在动作定位领域,研究人员提出了一种多类型提示互补模型,通过利用文本提示信息的互补性,提高了动作定位的准确性。
      Ren Xiaolong, Zhang Feifei, Zhou Wanting, Zhou Ling
      Vol. 30, Issue 3, Pages: 842-854(2025) DOI: 10.11834/jig.240354
      Complementary multi-type prompts for weakly-supervised temporal action location
      摘要:ObjectiveWeakly supervised temporal action localization uses only video-level annotations to locate the start and end times of action instances and identify their categories. Only video-level annotations are available in weakly supervised environments; thus, directly designing a loss function for the task is impossible. Therefore, the existing work generally adopts the strategy of “localization by classification” and utilizes multi-example learning for training. However, this process has some limitations: 1) Localization and classification are two different tasks, revealing a notable gap between them; therefore, localization based on classification results may affect the final performance. 2) In weakly-supervised environments, fine-grained supervisory information to effectively distinguish between actions and backgrounds in videos is lacking, thereby posing a remarkable challenge for localization. Visual language models have recently received extensive attention. These models aim to model the correspondence between images and texts for more comprehensive visual perception. Specific textual prompts can improve the performance and robustness of the models to effectively apply large models to downstream tasks. Visual language-based approaches currently utilize auxiliary textual prompt information to compensate for supervisory information and improve the performance and robustness of temporal action localization models. In visual language models, action label text is regularly encapsulated as textual prompts, which can be categorized into Handcrafted Prompts and Learnable Prompts. Handcrafted Prompts comprise fixed templates and action labels (e.g., “a video of {class})”, which can learn a more generalized knowledge of the action class but lacks the specific knowledge of the relevant action. In contrast, Learnable Prompts comprise a set of learnable vectors, which can be adjusted and optimized during the training process. Therefore, the learnable type cues can learn more specific knowledge. The two types of text cues complement each other, improving the capability to discriminate different categories. However, the existing methods ignore the complementarity between the two, hindering the introduced text cues to maximize its guiding role and bringing certain noise information, which leads to inaccurate localization of action boundaries. Therefore, this paper proposes a multitype prompt complementary model for weakly-supervised temporal action location, which maximizes the guiding effect of textual cues to improve the accuracy of action localization. The methodology of this paper focuses on improving the accuracy of action location by maximizing the guidance of textual cues.MethodFirst, this paper designs a prompt interaction module from the complementarity of textual prompts, which matches manual and learnable type prompts with action segments to obtain different similarity matrices. Then, the intrinsic connection between textual and segment features is mined through the attention layer and fused to realize the interaction between different types of textual prompts. Additionally, text cues must be effectively matched with action segments to play their guiding role. Therefore, this paper introduces the segment-level contrast loss, which is used to constrain the matching between cue messages and action segments. Finally, this paper designs a threshold filtering module to filter the class activation sequence (CAS) obtained from the guidance of different types of textual cue messages according to the threshold value. The final CAS is obtained by summing the CAS obtained after multilayer filtering according to a specific scale parameter, with the CAS obtained using only video-level features, covering the parts of each sequence with higher confidence scores, thus realizing the complementary advantages between different types of text cue information.ResultExtensive experiments on three representative datasets, THUMOS14, ActivityNet1.2, and ActivityNet1.3, validate the effectiveness of the method proposed in this paper. Among them, different mAP(0.1∶0.5), mAP(0.1∶0.7), and mAP(0.3∶0.7) on THUMOS14 datasets achieved 58.2%, 39.1%, and 47.9% average performance, respectively, which is comparable to P-MIL (proposal-based multiple instance learning) average performance by up to 1.1%. In the ActivityNet1.2 datasets, the method proposed in this paper achieves a performance of 27.3% at mAP (0.5∶0.95), which is an average improvement of nearly 1% compared to the state-of-the-art method. The mAP (0.5∶0.95) of ActivityNet1.3 datasets achieves comparable performance to the same work, with an average mAP of 26.7%. In addition, numerous ablation experiments were conducted on the THUMOS14 datasets, and the experimental results proved the effectiveness of the modules.ConclusionThis paper proposes a new weakly supervised temporal action localization model based on the complementarity of multiple types of cues, which alleviates the problem of inaccurate localization boundaries by leveraging the complementarity between different types of textual cues. The cue interaction module is designed in a targeted way to realize the interaction of different types of textual cue information. In addition, this paper introduces clip-level contrast loss to model the correspondence between text cue information and video, effectively constraining the matching between the introduced text cue information and action clips. Finally, this paper designs a multilayer nested threshold screening module, which can maximize the advantages of two different types of textual cue information, avoid the interference of noisy information on the model, and maximize the auxiliary role of textual information. Extensive experiments on two challenging datasets demonstrate the effectiveness of the method proposed in this paper.  
      关键词:weakly supervised temporal action location (WTAL);visual language model;handcrafted Prompts;learnable Prompts;class activation sequence (CAS)   
      81
      |
      118
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 69430174 false
      更新时间:2025-04-11

      Medical Image Processing

    • 在医学图像处理领域,研究者提出了一种视网膜血管图像分割网络,通过尺度特征表示学习,有效提高了血管分割的准确性,为视网膜血管疾病诊断提供辅助。
      Yang Kexin, Liu Li, Fu Xiaodong, Liu Lijun, Peng Wei
      Vol. 30, Issue 3, Pages: 855-869(2025) DOI: 10.11834/jig.240120
      Scale feature representation learning network for retinal vessels image segmentation
      摘要:ObjectiveRetinal vessel image segmentation refers to the process of separating vessel pixels in a color fundus image from the background pixels. The morphology of retinal vessels is closely associated with various ophthalmic diseases and plays a crucial role in computer-aided diagnosis and smart medicine. Additionally, retinal vessel images provide important biological information that can be used as a basis for personal identification systems in the field of social security. Furthermore, segmented retinal vessel images can serve as a priori for other anatomical sites, such as the macula. Currently, retinal image segmentation methods can be categorized into traditional and deep learning methods. Existing methods for retinal vessel image segmentation demonstrate good performance in segmenting large-scale vessels, primarily due to the ease of capturing features related to these prominent structures. Particularly, U-Net can effectively handle the complicated anatomical semantics involved in retinal vessel segmentation tasks, fusing adjacent-level features to learn additional local and global semantic information for highly accurate segmentation. Although remarkable progress has been made in retinal vessel segmentation with the advancement of deep learning, several challenging issues remain. First, current methods do not adequately represent vessels feature at multiple scales, resulting in poor segmentation results for retinal vessels with large differences in size and shape. Second, thin vessels, particularly those located at the ends of extremely low-contrast branches, are easily missed by current methods, resulting in incomplete vessel segmentation. Additionally, the medical semantics surrounding retinal vessels are complex. Specific regions, such as the optic cup, optic disc, and lesions, can interfere with vessel segmentation and seriously affect the accuracy of retinal vessel segmentation. Moreover, most images in the STARE dataset have severe lesions, and the information in different datasets notably varies, resulting in lower sensitivity of vessel segmentation results. To address these issues, a scale feature representation learning network for retinal vessel image segmentation is proposed by introducing the following three modules: scale feature representation, texture feature enhancement, and double contrastive learning.MethodIn this study, the images are first inputted into the retinal image set, and the initial layer of retinal vessel features is extracted using a U-Net-based backbone network. Hierarchical representation and stepwise fusion strategies are employed to fully capture the scale features of the vessels. This strategy is realized by introducing average pool operations and a spatial self-attention mechanism, which enriches multiscale information and generate a vessel scale feature representation. Then, based on the vessel scale feature, the last four layers of scale-encoded features are obtained through downsampling. During the skip connection process, the scale-encoded features are combined with deeper features to create intermediate features using three types of convolutions. These intermediate features are further enhanced using contextual information-guided texture filters, resulting in enhanced texture features that effectively focus on the edges of thin vessels by supplementing texture information. Finally, the vessel scale and texture-enhanced features are sampled to obtain vessel pixels, background pixels near vessels, and other background pixels in the two feature space domains. These samples are used as inputs for double constraint learning to calculate the double constraint loss. Double constraint learning helps reduce intra-class distance, increase inter-class variance, and substantially improve the segmentation of thin vessels and vessels in specific regions, such as the optic cup, optic disc, or lesions regions.ResultThe illustrated method is validated on three challenging retinal vessel image datasets: digital retinal images for vessel extraction(DRIVE), structured analysis of the retina(STARE), and child heart and health study in England(CHASE_DB1). The accuracy(Acc), sensitivity(Se), specificity(Sp) on the STARE and CHASE_DB1 datasets are (0.976 5, 0.841 5 and 0.987 4) and (0.978 4, 0.886 4 and 0.992 3), respectively. These results indicate that the proposed method outperforms most competing methods and remarkably improves performance in extracting thin vessels in regions with lesions or near the optic disc. Compared with other methods, the Acc of the proposed method in the DRIVE dataset is increased by 0.67%, while the Sp is improved by an average of 0.48%. In the STARE dataset, the Se value is increased by 6.01% and the Sp value is increased by 6.86% on average. In the CHASE_DB1 dataset, the Se value is increased by 1.88%, and the F1 score (F1) value is improved by 1.98% compared to other methods. The advantages of the method are visually analyzed by demonstrating the vessel details, demonstrating its capability to achieve better results for thin vessels with fewer vessel breaks. Additionally, improved results are obtained in dark and unevenly illuminated images. Some existing methods struggle with inaccurate segmentation results for crossed vessels at the optic disc and often fail to segment thin vessels around this area. In contrast, the segmentation results of the proposed method for crossed vessels at the optic disc do not show the phenomenon of vessel rupture. Notably improvements are observed in the segmentation of vessels and thin vessels in regions such as the optic disc, lesions, and other regions. The results indicate that the constructed retinal vessel image segmentation network effectively represents the scale features of vessels and enhances their texture, enabling accurate segmentation of complete retinal capillaries, especially in challenging regions. Finally, the findings from ablation studies and analyses demonstrate that the combination of the three modules is essential for simultaneously addressing the variable scale and anatomical semantic variations of retinal vessels.ConclusionThis paper presents a retinal vessel image segmentation network that accurately segments multiscale vessels, thin vessels, and vessels in specialized regions by effectively representing scale features and enhancing texture features, thereby assisting in the diagnosis of vascular diseases.  
      关键词:retinal vessels image segmentation;scale feature representation;texture feature enhancement;texture filters;double constraint learning   
      120
      |
      130
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 69432532 false
      更新时间:2025-04-11

      Remote Sensing Image Processing

    • 在遥感智能解译领域,专家设计了轻量级变化检测网络FIFLNet,显著提升了变化检测性能,改善了小目标漏检、边界误检问题,为高质量变化检测结果提供解决方案。
      Wang Renfang, Yang Zijian, Qiu Hong, Wang Feng, Gao Guang, Wu Dun
      Vol. 30, Issue 3, Pages: 870-882(2025) DOI: 10.11834/jig.240280
      Lightweight change detection network integrating feature interaction and fusion
      摘要:ObjectiveChange detection in remote sensing imagery is a process that leverages remote sensing technology to compare and analyze images from the same geographical area but captured at different time intervals. This process mainly aims to identify changes on the Earth’s surface. The main challenge in this process lies in the extraction of effective change features from a large volume of image data and subsequently mapping them onto pixel-level change labels for high-precision detection. Detection methods for changes in remote sensing imagery can be broadly divided into traditional and deep learning-based methods. Traditional methods primarily rely on image processing and pattern recognition techniques. However, these methods often require manual selection of suitable features and thresholds, which can introduce a degree of subjectivity and limitations. By contrast, deep learning methods can automatically learn abstract and high-level change features from remote sensing images, thereby enabling end-to-end change detection. This approach notably enhances the accuracy and efficiency of change detection. Among them, change detection models based on convolutional neural networks (CNNs) and Transformer architectures have shown remarkable performance. Models that utilize these mechanisms have demonstrated notable advancements in recent years due to the extensive research conducted by scholars worldwide. However, for the currently effective models based on the Transformer architecture, the complexity of the model also increases as the detection accuracy of the model improves. Therefore, designing a change detection method with a small parameter size, low computational cost, and high detection accuracy is a pressing issue that must be urgently addressed in this field.MethodThis paper proposes a lightweight change detection method for remote sensing images based on feature interaction and fusion. The main idea of this method is to use EfficientNet B7 as a lightweight backbone network to extract deep- and low-level features from bi-temporal remote sensing images. Channel and pixel swap modules are introduced to enable the interaction and combination of bi-temporal features, enhancing the spatiotemporal feature fusion. Low-level skip-connections are employed to transfer the original image information to the upsampling phase, aiming to preserve additional edge and texture details and reduce artifact generation. A feature fusion group convolution module that reduces the computational overhead and the number of parameters is designed to effectively fuse the deep- and low-level features obtained in the downsampling stage. Finally, the feature fusion group convolution and upsampling modules are used to fuse and recover the deep- and low-level features, and the pixel-level change detection map is generated.ResultIn this paper, experiments on two datasets are conducted for remote sensing image change detection landearth view image retrieval building change detection dataset(LEVIR-CD) and Sun Yat-sen University change detection dataset(SYSU-CD). Each dataset is split into 7∶1∶2 for training, validation, and testing, respectively, and each image is segmented into 256 × 256 pixels. This approach facilitated the processing and increased the generalization capability of the model. The binary cross-entropy (BCELoss) is used as the loss function, and the performance of the proposed method is evaluated using three metrics: F1 score (F1), intersection over union (IoU), and overall accuracy (OA). The proposed method achieved F1 scores of 91.51% and 82.19%, IoU of 84.35% and 69.76%, and OA of 99.14% and 91.99% on the LEVIR-CD and SYSU-CD datasets, respectively. Compared with previous classical methods, the model obtained the best change detection results, especially preserving additional details on the change boundary. Ablation experiments are performed on the same dataset to illustrate the effect of low-level skip-connections and channel and spatial exchange modules. The results showed that the channel and spatial exchange module substantially optimized the utilization and representation of spatiotemporal features in the network, while the low-level skip-connection compensated for the loss of detailed features in the downsampling process and further enhanced the feature learning capability of the network.ConclusionThe network used channel and spatial exchange modules to increase the utilization and understanding of spatiotemporal features and low-level skip-connections to focus the model on local detailed features. Finally, a binary cross-entropy loss function at the output layer is utilized to achieve optimal change detection performance. Experiments show that the method proposed in this paper can improve the capability of recognizing changing regions while ensuring a light network, enhancing the detection performance of change detection in various environments and terrains.  
      关键词:remote sensing image;change detection;local feature;Feature interact;Light network   
      194
      |
      234
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67575687 false
      更新时间:2025-04-11
    • 在遥感应用领域,专家提出了基于门控注意力聚合网络的多源遥感图像分类方法,有效融合高光谱图像与LiDAR/SAR数据,显著提升分类性能。
      Jin Xuepeng, Gao Feng, Shi Xiaochen, Dong Junyu
      Vol. 30, Issue 3, Pages: 883-894(2025) DOI: 10.11834/jig.240359
      Gated cross-modal aggregation network for multi-source remote sensing data classification
      摘要:ObjectiveIn recent years, multisource remote sensing data fusion has emerged as a research hotspot in the field of remote sensing applications. This trend aims to overcome the technical limitations of single sensors and address the constraints associated with relying on a single data source. Traditional remote sensing methods, which often depend on a single type of sensor, face considerable challenges in providing comprehensive and accurate information due to the inherent limitations of the sensors. For instance, hyperspectral sensors capture detailed spectral information but may lack spatial resolution, while LiDAR (light detection and ranging) and SAR (synthetic aperture radar) sensors excel in capturing structural information but fail to provide sufficient spectral details. The integration of hyperspectral images and LiDAR/SAR data holds remarkable promise for enhancing remote sensing applications. However, current methods for fusion classification have not fully utilized the rich spectral features of hyperspectral images and the structural information of ground objects provided by LiDAR/SAR data. The two types of data have fundamentally different characteristics, which pose considerable challenges for effective feature correlation. Hyperspectral images contain abundant spectral information that can identify different materials, while LiDAR provides 3D structural information, and SAR offers high-resolution imaging under various weather conditions. The differences in data characteristics among these imaging modalities create substantial challenges in correlating multisource image features. Although some deep learning-based methods have shown promising results in the fusion classification tasks of hyperspectral and LiDAR/SAR data, they often fall short in fully exploiting the texture and geometric information embedded within the multisource data during the fusion process. These methods may perform well in specific scenarios but often lack the necessary robustness and versatility for broader applications. Consequently, highly sophisticated approaches that can leverage the complementary strengths of different data sources are urgently needed to improve classification accuracy and reliability.MethodThis paper proposes a novel multisource remote sensing image classification method based on a gated cross-modal aggregation network (GCA-Net) to address this critical issue. This approach aims to comprehensively exploit the complementary information available in multisource data. The core innovation of the proposed method lies in its capability to effectively integrate the unique advantages of different sensor data types through a series of advanced neural network modules. First, a gated cross-modal aggregation module is designed to facilitate the organic integration of detailed structural information from LiDAR/SAR data with the spectral features of hyperspectral images. This module leverages cross-attention mechanisms to enhance the feature fusion process, ensuring that the distinctive information from each data source is effectively utilized. The cross-attention feature fusion mechanism allows the model to focus on the most relevant features from each modality, enhancing the overall representation of the fused data. Second, a refined gating module is employed to integrate valuable LiDAR/SAR features into the hyperspectral image features. This gating mechanism selectively incorporates the most informative features, thereby enhancing the fusion effect of multisource data. The gating module acts as a filter that prioritizes the integration of features that provide the most contribution to the classification task, ensuring that the fused data retains the critical information from both sources. This approach not only improves the accuracy of the classification but also enhances the robustness of the model across different datasets and scenarios.ResultThe proposed method was rigorously tested through experiments conducted on two widely recognized datasets: Houston2013 and Augsburg. These datasets provide a diverse range of scenes and conditions, making them ideal for evaluating the effectiveness of the proposed method. The performance of the GCA-Net was compared with seven mainstream methods to assess its efficacy. Experimental results demonstrated that the proposed method achieved the best performance in terms of overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa). Specifically, the GCA-Net outperformed the other methods by a substantial margin, highlighting its superior capability to fuse and classify multisource remote sensing data. Particularly, on the Augsburg dataset, the proposed method achieved the best indicators across most categories. This dataset includes a variety of urban and rural scenes, challenging the classification models with diverse textures and structures. The capability of the GCA-Net to handle such diversity showcases its robustness and versatility. Moreover, the classification visualization results show that the proposed classification model has a notable performance advantage. The visualized results provide a clear and intuitive representation of the classification accuracy, with the GCA-Net producing more precise and consistent classifications compared to other methods. This visual evidence emphasizes the practical benefits of the proposed method, demonstrating its potential for real-world applications in remote sensing.ConclusionThe experimental results on the Houston2013 and Augsburg datasets strongly support the excellent performance of the proposed GCA-Net, drastically surpassing current mainstream methods such as HCT(hierarchical CNN and Transformer) and MACN(mixing self-attention and convolutional network). A key factor in the success of the GCA-Net lies in its capability to fully integrate information from different modalities based on their unique characteristics. This integration allows the model to leverage the strengths of each data source, resulting in a highly accurate and reliable classification. The superior performance of the proposed method can be attributed to its innovative use of gated attention mechanisms and cross-modal feature fusion. These techniques enable the model to selectively focus on the most relevant features from each modality, enhancing the overall quality of the fused data. The GCA-Net sets a new benchmark for multisource remote sensing data fusion by effectively combining the spectral richness of hyperspectral images with the structural details of LiDAR/SAR data. Overall, the GCA-Net provides a robust theoretical foundation for the fusion classification of multisource remote sensing data. The capability of this network to address the limitations of single-sensor approaches and leverage the complementary strengths of different data types increases its value as a tool for advancing remote sensing applications. The proposed method not only improves classification accuracy but also enhances the practical applicability of remote sensing technologies, paving the way for highly sophisticated and effective solutions in various domains, such as environmental monitoring, urban planning, and disaster management.  
      关键词:hyperspectral image (HSI);light detection and ranging (LiDAR);synthetic aperture radar (SAR);Back Scattering;Multi-Source Feature Fusion   
      129
      |
      149
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 69427873 false
      更新时间:2025-04-11
    0