最新刊期

    29 1 2024

      Review

    • Ren Yangfu,Li Zhiqiang,Zhang Songhai
      Vol. 29, Issue 1, Pages: 1-21(2024) DOI: 10.11834/jig.221147
      Survey of visualization methods for multiscene visual cue information in immersive environments
      摘要:The immersive environment presents a near-real environment experience to users through technologies such as virtual reality (VR). Virtual reality is a simulated environment that generates the natural world through computers, providing users with a rich immersion experience, interactivity, and imagination. In virtual reality, the eyes are the window for viewing the scene and a meaningful way to interact with the virtual world. Virtual reality is a very convenient way for users to obtain information about the location through visual means. The interaction of the scene enhances the users’ perception. Augmented reality (AR) places virtual information in a natural setting, and users can interact with the virtual information in the actual location. To understand the application research of visual cue information in different immersive scenarios such as virtual reality, combined with the relationship between visual information cue methods, this paper mainly distinguishes information according to the information display position, function use, and actual application field summarizes and discusses recent years — a study of information visualization methods for visual cues in immersive multiscene environments. First, the prompt information outside the field of view is analyzed. Due to the limitation of hardware devices, the view range of the content in the scene viewed by users through different devices is affected. The methods for visual prompt information outside the field of view can be divided into overview + details and focus + context and discussed separately for 2D and 3D scenarios. The situation within the field of view focuses on the label placement around the model, which is slightly different from the problem of label layout in the scene and the problem of information outside the field of view. When the information in the field of view is displayed, more attention is given to the displayed information. Whether the content is occluded or overlapped, the authors hope to ensure users are more comfortable when viewing information content. At the same time, different requirements are for the layout and distribution of label information. The primary purpose of research on label layout in the scene is to reduce label placement. Good overlapping and confusing problems reduce users’ cognitive load when observing. From the perspective of information management, for the information in the scene, how to place the information reasonably and avoid occlusion can not only improve the user’s scene awareness but also improve the user’s experience in the scene. Considering the use function of visual prompt information, the fast visual information added in VR helps users know their position or scene information as soon as possible during the scene roaming. Just like the user in reality, when the users enter an unfamiliar scene, their location in the scene can be obtained through map navigation or eye-catching signs. Therefore, enhancing user perception through visual prompts is conducive to improving users’ interactive experience in the scene, helping them grasp scene information quickly and find points of interest as soon as possible. In practical applications, the problems encountered in panoramic videos and their solutions, as well as problems encountered in industrial and life scenarios, such as applying the currently popular video barrage in panoramic videos, are discussed. Panoramic videos are viewed in immersive scenes or applications in entertainment and industrial production. However, judging from the current state of technology, some challenges remain in the visualization method of multiscene visual cue information in the virtual reality environment in the following aspects: first, the method needs to improve user experience on display devices of different sizes in 2D scenes, improve the efficiency of users’ acquisition and understanding of information, reduce the cognitive load of users when using machines, and improve the exploration of adapting to multisize screen information prompt methods. Second, the user’s information prompting method outside the field of view in different 3D scenes can ensure a better immersion and a more convenient interaction method, providing users comprehensive information prompts. Third, the research on the technique of tag placement and layout in the scene, considering the user experience, mainly covers three aspects, namely, the way users view and interact with tags, the location and display time of titles, and the movement of tags and scenes. At the same time, users pay more attention to whether the interaction mode in the background is convenient, whether the viewing method is comfortable, and whether the effect on the scene’s content is small enough. Fourth, the map navigation and attention guidance methods in the scene are critical for users to obtain information. Therefore, the users’ interactive experience when using map navigation and how to obtain important news and to improve the users’ attention guidance need to be improved. Accuracy and reducing the effect of prompt information on immersion are of excellent research importance. Fifth, watching videos in virtual reality can provide users a good viewing experience, so how to enhance their sense of immersion further in the viewing, present a better way to guide the storyline, and reduce the feeling of dizziness in the viewing are all issues worth discussing. Sixth, in scenarios such as industrial production, education, and entertainment, how visual prompt information can form a good interaction with users and help users complete a series of tasks is important. The discussion in this paper on practical applications such as out-of-field, in-scene label layout and attention guidance in 2D and 3D scenes, and panoramic video viewing enables showing in more detail the effects of visual cue information in immersive environments and multiplescenes: research prospects and development directions.  
      关键词:immersive environment;virtual reality (VR);augmented reality (AR);multiscene;visual prompt information;panoramic video;guidance of attention   
      3
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47621794 false
      发布时间:2024-01-16
    • Hu Bo,Xie Guoqing,Li Leida,Li Jing,Yang Jiachen,Lu Wen,Gao Xinbo
      Vol. 29, Issue 1, Pages: 22-44(2024) DOI: 10.11834/jig.220722
      Progress of image retargeting quality evaluation:a survey
      摘要:With the popularization of the Internet, intelligent technology, and various sensing devices, images and videos are playing an increasingly important role in video surveillance, health care, and distance education. All kinds of information can be obtained through images or videos using different terminal devices (such as mobile phones, laptops, and tablets) anytime and anywhere. Because different terminal devices have dissimilarsizes and resolutions, how to display the same image with high visual quality on these devices has become a common concern of the academic and industrial community. To solve this problem, image retargeting technology has emerged and become a popular, cutting-edge research direction in the field of computer vision and image processing. Image retargeting technology aims to adjust the image resolution without destroying the visual content to adapt to the information acquisition of various terminals. Traditional image retargeting algorithms achieve this goal through some simple operations (such as scaling and clipping), but such operations usually cause serious distortions of visual content and the loss of important information; thus, obtaining a visual satisfying retargeted image is difficult. To compensate for the performance disadvantages of traditional algorithms, a series of more advanced content-aware image retargeting algorithms have been proposed in recent years. Generally, these algorithms adopt a two-stage framework. The first step is to calculate the importance map of the input image and assign an importance weight to each pixel. The higher the weight is, the higher the probability that it should be retained. The second step is to implement the corresponding image resizing method to preserve the important content of the image as much as possible to meet the geometric constraints. According to different technologies, content-based image retargeting algorithms can be divided into four categories: discrete methods, continuous methods, multioperator methods, and deep learning methods. Although great progress has been made in this field, no algorithm can be guaranteed to meet the requirements of multiple display devices without reducing the visual quality. Distortions are inevitably introduced in image retargeting. Therefore, how to evaluate the quality of retargeted images objectively and accurately is very important for the selection, optimization, and development of image retargeting algorithms. Image quality assessment (IQA) is a basic problem in image processing and computer vision. In general, it can be divided into subjective assessment and objective assessment. Subjective assessment, which is the most direct, effective way, is completed through the judgment of subjects. Objective assessment is to predict and evaluate image quality automatically by constructing models. Compared with subjective assessment, objective assessment has the advantages of low cost, reusability, and easy deployment, so it is the focus of the existing research. According to whether reference image is used or not, it can be further divided into full-reference quality metrics, reduced-reference quality metrics, and no-reference quality metrics. According to different problems, it can also be divided into objective assessment of natural images, objective assessment of screen content images, objective assessment of cartoon images, and objective assessment of image retargeting. In recent years, a massive effort has been made in objective image retargeting quality assessment (IRQA); thus, encouraging research progress has been achieved. However, up to now, no review paper has studied image retargeting quality evaluation, which could hinder further development of this field. To this end, this review paper analyzes the challenges facing the field, reviews and summarizes the existing methods, examines their advantages and disadvantages, and states the possible development directions of the field to help researchers quickly understand and master the basic situation and promote the development of this field. First, the related work, namely, image retargeting and traditional IQA, is briefly introduced according to the classification principle. Specifically, image retargeting is divided into traditional algorithms and content-aware algorithms, and traditional IQA is described and summarized according to full-reference metric, reduced-reference metric, and no-reference metric. Second, the dataset and objective evaluation of image retargeting are introduced emphatically. For the datasets, subjective quality assessment is used as the performance upper bound of objective quality assessment and provides performance comparison for objective quality algorithms. From these datasets, the distortion characteristics of image retargeting are analyzed, and the important steps in the construction of different databases are compared and summarized. Two types of representative IRQA models, namely, the traditional feature similarity-based IRQA model and the image registration-based IRQA model, are introduced and summarized in detail. Traditional feature similarity-based IRQA models extract perception-sensitive features from reference images and retargeted images, respectively and quantify the distortion degree of retargeted images by calculating their feature similarity. To cater to the characteristics of the human visual system, saliency maps are often incorporated into the design of these models. Before feature extraction, image registration-based IRQA models first conduct sparse or dense image registration of two images to extract features effectively and improve the outcome of the algorithms. Next, their main ideas, advantages, and disadvantages are thoroughly analyzed and compared. Third, the performances of representative IRQA metrics are compared and analyzed on three public datasets in terms of prediction accuracy and monotonicity. Experimental results show the current models are only partially consistent with human perception and can still be improved. More energy and efforts are needed in this field. Finally, the current problems and challenges in the field of IRQA are summarized, and the possible development directions in the future are identified.  
      关键词:image quality assessment (IQA);image retargeting;image registration;content loss;geometric distortion   
      4
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47621795 false
      发布时间:2024-01-16
    • Liu Shiyu,Dai Wenrui,Li Chenglin,Xiong Hongkai
      Vol. 29, Issue 1, Pages: 45-64(2024) DOI: 10.11834/jig.230069
      From graph convolution networks to graph scattering networks:a survey
      摘要:In image processing and computer graphics, non-Euclidean data such as graphs have gained increasing attention in recent years because Euclidean data such as images and videos fail to represent data with structure information. Compared with traditional Euclidean data, the scale of a graph can be arbitrary large. The structure of a graph usually contains information such as the relation between vertices, the logical sequences, and the properties of graph itself. While images can be easily converted into graphs based on the Euclidean position of pixels, graphs (especially for irregular graphs) can merely be converted into images. Therefore, graphs require a higher level of representation learning compared with traditional Euclidean data. However, in the era of deep learning, traditional convolution neural networks (CNNs) fail to learn representations for graphs due to permutation covariance for nodewise features and permutation invariance for outputs such as classification labels. The performance of CNNs in graph representation learning is still limited even if inputs are augmented by arbitrary permutation during training to learn permutation covariance. The development of graph neural networks and graph convolution operations achieves milestone in representation learning of non-Euclidean data such as graphs. Commonly, graph convolution neural networks (GCNs) can be divided into two categories: spatial GCNs and spectral GCNs. Spatial GCNs focus on the establishment of neighborhood and update with aggregation functions that combine the features of the center vertex and its neighbors. Though GCNs based on neighborhood feature aggregation encourage the propagation of nodewise features, deep GCNs usually suffer from oversmoothness issue, and the features of vertices become indistinguishable. Therefore, later works consider introducing skip connections in deep GCNs or constructing shallow GCNs with multiscale neighborhood considered within each convolution to alleviate this issue. Spectral GCNs focus on the graph spectral theorem and update their parameters by signal filtering in spectral domain with designed filters. However, eigen-decomposition of graph shift operator is costly for large graphs because its computation complexity is O(N3) . Therefore, spectral GCNs usually apply K-order polynomials (i.e., Chebyshev polynomials) to approximate the target filters and avoid eigen-decomposition. Though spectral GCNs may avoid oversmoothness issue with graph filter design, the limited number of learnable parameters and filter responses of spectral GCNs usually limit their expression ability. Spatial GCNs and spectral GCNs are not necessarily independent from one another. For example, 1-order Chebyshev polynomials with diffusion matrix are equivalent to feature aggregation within 1-hop neighborhood. Therefore, spatial GCNs based on feature aggregation with diffusion Laplacian matrix or lazy random walk matrix usually have the spectral form, which bridges the spatial GCNs and spectral GCNs. The rapid development in graph representation learning gives rise to the demand for survey and review that summarize existing works and serve as guidance for beginners. Currently, graph neural networks such as graph convolution neural networks, graph embeddings, and graph autoencoders have been reviewed. However, current surveys and reviews lack one domain in graph representation learning: graph scattering transforms (GSTs) and graph scattering networks (GSNs). GSNs are non-trainable spectral GCNs based on wavelet decomposition. With the benefit of multiscale wavelets and the structure of networks, GSNs generate diverse features with nearly nonoverlapping frequency responses in the spectral domain. As one of the newly developed graph representation learning methods, GSNs are used in tasks such as graph classification and node classification. Recent works employed graph scattering transform to spatial GCNs to overcome the oversmoothness issue. Compared with spectral GCNs, GSNs generate diverse features that strengthen the expressive capability of model without introducing the oversmoothness issue. However, the nontrainable property of GSNs may limit the flexibility in graph representation learning on different graph datasets with different distributions of spectrum. GSNs suffer from the exponential growth of diffusion paths with the increase of scattering layers, which limit the depth of GSNs in practice. In this paper, a survey comprehensively reviews the designs from GCNs to GSNs. First, GCNs are divided into two categories: spatial GCNs and spectral GCNs. Spatial GCNs are categorized into the following types: 1) diffusion-based GCNs, 2) GCNs on large graphs with neighbor sampling or subgraph sampling, 3) GCNs with attention mechanism, and 4) GCNs with dynamic neighborhood construction. Spectral GCNs are reviewed according to different filters (filter kernels): Chebyshev polynomials, Cayley polynomials, and K-order polynomials. After addressing the drawbacks of spatial GCNs and spectral GCNs, the definition of GSTs, the structure of classical GSNs, and the advantages of GSNs compared with GCNs are introduced. The current arts of GSNs are elaborated from the perspectives of network design along with application and stability in theory. The networks and application of graph scattering transform are reviewed in the following categories: 1) classical GSNs, 2) graph scattering transforms in GCNs to solve the over-smoothness issue, 3) graph attention networks with GSTs, 4) graph scattering transform on spatial-temporal graphs, 5) reducing scattering paths of GSNs via pruning to increase the efficiency of graph scattering transform, 6) GSNs with multiresolution graphs, and 7) trainable GSNs with wavelet scale selection and learnable spectral filters. In theory, the frame theorem and the stability theorem under signal and topology perturbation, respectively, are concluded. The limitations of GSNs (GSTs) in current works are analyzed, and possible directions for the development of graph scattering technics in the future are proposed.  
      关键词:deep learning;graph convolution network (GCN);graph scattering network (GSN);representation learning;stability;signal perturbation;topology perturbation   
      3
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47621796 false
      发布时间:2024-01-16

      Image Processing and Coding

    • Huang Ying,Peng Hui,Li Changsheng,Gao Shengmei,Chen Feng
      Vol. 29, Issue 1, Pages: 65-79(2024) DOI: 10.11834/jig.230063
      LLFlowGAN: a low-light image enhancement method for constraining invertible flow in a generative adversarial manner
      摘要:ObjectiveLow-light images are produced by imaging devices that cannot capture sufficient light due to unavoidable environmental or technical limitations (such as nighttime, backlight, and underexposure). Such images usually have the characteristics of low brightness, low contrast, narrow grayscale range, color distortion, and strong noise, which almost need more information. Low-light images containing these problems do not meet human beings’ visual requirements and directly limit the role of the subsequent advanced visual system. The low-light image enhancement task is an ill-posed problem because the low-light image loss of illumination information, that is, a low-light image may correspond to countless normal-light images. Low-light image enhancement should be regarded as selecting the most suitable solution from all possible outputs. Most existing reconstruction methods rely on pixel-level reconstruction algorithms that aim to learn a deterministic mapping between low-light inputs and normal-light images. They provide a normal-light result for a low-light image rather than modeling complex lighting distributions, which usually result in inappropriate brightness and noise. Furthermore, most existing image generation methods use only one (explicit or implicit) generative model, which limits flexibility and efficiency. Flow models have recently demonstrated promising results for low-level vision tasks. This paper improves a hybrid explicit-implicit generative model, which can flexibly and efficiently reconstruct normal-light images with satisfied lighting, cleanliness, and realism from degraded inputs. The model alleviates the fuzzy details and singularity problems produced by explicit or implicit generative modeling.MethodThis paper proposes a low-light image enhancement network with a hybrid explicit (Flow) and implicit generative adversarial network (GAN), named LLFlowGAN that contains three parts: conditional encoder, flow generation network, and discriminator. Flow generation networks operate at multiple scales conditioned on encoded information from low-light input. First, a residual attention conditional encoder is designed to process low-light input, calculate low-light color maps, and extract rich features to reduce the color deviation of generated images. Due to the flexibility of the flow model, the conditional encoder mainly consists of several residual blocks plus efficient stacking of channel attention modules. Then, the features extracted by the encoder are used as conditional prior to the generative flow model. Moreover, the flow model learns to map the high-dimensional random variables obeying the normal exposure image distribution into a bidirectional mapping with simple tractable latent variables (Gaussian distribution). By simulating the conditional distribution of normal exposure images, the model allows the sampling of multiple normal exposure results to generate diverse samples. Finally, the GAN-based discriminator provides constraints for the model and improves the detailed information of the image in the reverse mapping. Because the model learns a bidirectional mapping relationship, both mapping directions can be regarded as constrained by the loss function, providing the network stability and resistance to mode collapse.ResultThe proposed algorithm in this paper is validated using experiments on two datasets, namely, Low-Light (LOL) dataset and MIT-Adobe FiveK dataset, to verify its effectiveness. Quantitative evaluation metrics include peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), learned perceptual image patch similarity (LPIPS), and natural image quality evaluator (NIQE). Our model is compared with 18 saliency models in the LOL dataset, including the traditional supervised and unsupervised deep learning methods including state-of-the-art methods in this field. Compared with the model with the second-best performance, our method improves the PSNR value by 0.84 dB and reduces the LPIPS value (the smaller, the better) by 0.02. SSIM obtains the second-best value, decreases by 0.01, and NIQE decreases by 1.05. Saliency maps of each method are also provided for comparison. Our method better preserves rich detail and color information while enhancing image brightness, where artifacts are rarely observed, achieving better perceptual quality. In the MIT-Adobe FiveK dataset, the five most advanced methods are compared. Compared with the model with the second-best performance, the PSNR value increases by 0.58 dB, and the SSIM value is also tied for first place. In addition, a series of ablation experiments and cross-dataset tests in the LOL dataset are conducted to verify the effectiveness of each algorithm module. Experimental results prove our proposed algorithm improves the effect of low-light image enhancement.ConclusionIn this paper, a hybrid explicit-implicit generative model is proposed. The model inherits the flow-based explicit generative model, which can accurately complete the free conversion between the natural image space and a simple Gaussian distribution and flexibly generate diverse samples. The adversarial training strategy is further used to improve the detailed information of the generated image, enrich the saturation, and reduce the color distortion. The proposed approach can achieve competitive performance compared with representative state-of-the-art low-light image enhancement methods.  
      关键词:low-light images enhancement;flow model;generative adversarial network (GAN);bidirectional mapping;complex illumination distribution   
      6
      |
      5
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47621953 false
      发布时间:2024-01-16
    • Fang Lei,Shi Zelin,Liu Yunpeng,Li Chenxi,Zhao Enbo,Zhang Yingdi
      Vol. 29, Issue 1, Pages: 80-94(2024) DOI: 10.11834/jig.230113
      Joint geometric and piecewise photometric line-scan image registration
      摘要:ObjectiveImage registration is a fundamental problem in computer vision and image processing. It aims to eliminate the geometric difference of an object in an image collected by different cameras at various times and poses. Image registration has been widely used in several visual applications, such as image tracking, image fusion, image analysis, and anomaly detection. Image registration methods can be classified into feature-based and direct registration methods. The former calculates the parameters in a geometric transformation model by extracting and matching features, such as corners or edges, while the latter directly uses image intensity to infer the parameters. Evidently, choosing a reasonable geometric transformation model is the key to image alignment. The principles of line-scan and area-scan cameras are identical, and both cameras conform to the principle of pin-hole imaging. However, the imaging model of a line-scan camera is different from that of an area-scan camera due to the characteristics of its sensor. With the same change in camera pose, the locations of the same 3D world points mapped to the two types of images are different. That is, the geometric transformation law of an object in the images caused by the pose change of the two types of cameras is different. When the image plane of a line-scan camera is nonparallel to the object plane, geometric transformation models commonly used for area-scan image registration, such as the rigid, affine, and projection transformation models, cannot conform to the geometric transformation law of line-scan images. The direct registration method based on the geometric transformation model of an area-scan image cannot realize the geometric alignment of a line-scan image. Moreover, most existing direct image registration methods for solving the image alignment problem is based on the brightness constancy assumption and only geometric transformation is considered. In real-world applications, the variation of brightness is unavoidable and the brightness constancy assumption cannot address the problem of brightness attenuation when capturing images with a large-angle lens. Therefore, the line-scan image registration problem, which estimates geometric and photometric transformations between two images, is considered. Moreover, a direct registration method for line-scan images based on geometric and piecewise photometric transformations is proposed in this study.MethodFirst, the optimization objective function of line-scan image registration is constructed by using the sum of squares difference of image intensity. In accordance with the geometric transformation model of line-scan images and the piecewise gain-bias photometric transformation model, the registration problem of a line-scan image is expressed as a nonlinear least squares problem. Second, the Gauss-Newton method is used to optimize the geometric and photometric transformation parameters in the registration problem. The nonlinear optimization objective function is linearized by performing a first-order Taylor expansion. The Jacobian of the warp and photometric transformation is derived on the basis of the geometric transformation model of a line-scan image and the gain-bias model. Finally, to obtain the optimal geometric and photometric transformation parameters, the increments of the warp and photometric transformation are repeatedly computed until they are below the threshold in accordance with the normal equation. As the initial value, the identity warp cannot be guaranteed near the optimal solution, and the iteration does not converge during registration. This problem is solved by designing an initial value fast matching method that provides an initial solution closer to the optimal one. The process of the initial value fast matching method is as follows: fixed-size areas are selected from the four corners of the template image and then matched to the target image in the corresponding position. The minimum and maximum coordinates of the optimal matching position in the horizontal and vertical directions are selected. Then, the scale and translation factors in the horizontal and vertical directions are solved, and the result is regarded as the initial value for the iteration. The initial value provided by the initial value fast matching method reduces geometric difference between the template and target images, and the success rate of the registration method is improved.ResultTo verify the proposed line-scan image registration method, a line-scan image acquisition system was built to obtain line-scan images of a planar object under different imaging poses and illumination variations. The experimental data also included electric multiple units (EMU) train line-scan images, which were collected by a line-scan camera in a natural environment. The images collected by the line-scan image acquisition system and the EMU train line-scan images were annotated separately, and the root-mean-square error (RMSE) of the annotated point coordinates was used as the evaluation index of the geometric error. The performance of the initial value fast matching method was verified on the line-scan image dataset collected in this study. The geometric error between the template image and the warped target image based on the initial value provided by the fast template block matching method was smaller than that based on the identity warp. This finding indicates that the initial value provided by the initial value fast matching method is closer to the optimal solution of the geometric transformation. Through the registration experiments on the collected dataset and the EMU train line-scan image, the results show that the RMSE of the annotated point coordinates is less than 1 pixel, and registration accuracy is excellent.ConclusionOur algorithm is more robust to lighting changes, and it improves the success rate of line-scan image registration. The joint geometric and piecewise photometric line-scan image registration method proposed in this study can accurately align the images collected in practical application scenes. This condition is also a foundation for train anomaly detection based on line-scan images. Therefore, the direct registration method proposed in this study can accurately and robustly align line-scan images collected under nonparallel poses.  
      关键词:line-scan camera;line-scan image;direct registration method;geometric transformation;photometric transformation   
      7
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47621790 false
      发布时间:2024-01-16
    • Chen Jiani,Xu Dawen
      Vol. 29, Issue 1, Pages: 95-110(2024) DOI: 10.11834/jig.221182
      Reversible data hiding in encrypted images using variable prediction
      摘要:ObjectiveWith the growing demand of users for privacy protection in cloud computing and cloud storage scenarios, reversible data hiding in encrypted images (RDHEI) has gained widespread attention. RDHEI has three independent roles, namely, image holder, service provider, and image receiver. Before uploading the images to the cloud, the holder encrypts it, and the service provider embeds some necessary information into the encrypted image for management and other purposes in the cloud server. After obtaining the marked encrypted image, the receiver can opt to extract the embedded data or recover the image according to its own secret key. The existing RDHEI schemes can be divided into two categories: vacating room after encryption (VRAE) and reserving room after encryption (RRBE). The largest difference between two types of methods lies in the different processes before and after the encryption phase. For RRBE, before uploading the encrypted image to the service provider, essential preprocessing needs to be operated by the image holder for reserving the data space. For VRAE, the image holder can upload the encrypted image directly, and the service provider conducts the data space for the preparation of the subsequent embedding. The operation of vacating room in two types of methods is a redundant processing of images. The core is to select the point to use pixel correlation, and it is also a compromise between embedded capacity and transmission security. However, most RDHEI algorithms complicate the preprocessing of images to improve the embedding rate and ensure the security of image encryption. This paper proposes a reversible data hiding algorithm based on variable prediction and multi-most significant bit (MSB) replacement in encrypted images.MethodThe proposed algorithm mainly consists of the following parts: image preprocessing, exclusive encryption, data embedding, data extraction, and image recovery. This paper proposes a variable prediction bit-plane inversion (VPBI) strategy, which aims to make full use of the whole relevance of the image. First, VPBI is designed to predict iteratively multiple relevant bit-planes of the current pixel value with adjacent pixel values. When the prediction value is closer to the target pixel than the inverted value, the prediction is accurate. The current prediction bit-plane can be used for data hiding and modify its bit value to zero. Because the method of VPBI works from the second row and second column of the image, to increase the number of embeddable pixels as much as possible, the linear prediction method is designed to obtain the prediction error in the first row and column except for the first pixel. The positive and negative signs of the prediction error are stored in the last bit of its binary sequence with one bit, which means the absolute value of the prediction error cannot be greater than 127 or -127. Then, a sign indication map is designed to record this type of prediction error. At the same time, the location map is used to mark adaptively the embeddable position, which is sparse and can be lossless compressed using arithmetic coding. After reserving space, the image is XOR-encrypted by the image holder, and the holder inserts the side information such as the compressed location map, sign indication map, and the first MSB back into the first MSB bit-plane. At the data embedding phase, the service provider embeds the secret data and the compressed location map through the multiMSB replacement strategy. Finally, the extraction of secret data and the recovery of image are reverse processes. Therefore, if the image receiver holds the corresponding key, the secret data can be extracted without loss or the original image can be recovered perfectly.ResultTo evaluate the performance of the proposed algorithm, experiments compare the proposed algorithm with five other state-of-the-art RDHEI algorithms on six common grayscale images and one public database: Break our watermarking system 2nd (BOWS-2). The information entropy, embedding capacity, embedding rate, peak signal-to-noise ratio (PSNR), and structure similarity index measure (SSIM) are used as the quantitative evaluation metrics. First, the experiment tests the information entropy of each bit-plane of 1 000 images before and after preprocessing. The information entropy of the highest bit-plane is 0.76 lower than that of the original MSB, and the second, third, fourth, and fifth bit-planes decrease by 0.25, 0.45, 0.61, and 0.75, respectively, indicating VPBI generates more zeros for multiple significant bit-planes, makes them sparse, and effectively increases the embedded space. Experimental results show the average embedding rates of the proposed algorithm on the BOWS-2 reach 2.953 bit/pixel, which is 0.423 bit/pixel higher than the latest algorithm. The secret data can be extracted without error, and the PSNR and SSIM are constant values that equal to ∞ and 1, respectively, which show the proposed algorithm is reversible.ConclusionIn this paper, a reversible data hiding algorithm in encrypted images based on variable prediction and multiMSB replacement is proposed. By using the redundancy between pixels and reducing the space occupation of sign indication map, VPBI is proposed to deal the multiMSB planes. The comparison of the variable prediction value, the inverse value of the target pixel, and the target pixel can provide considerable spaces to embed data. In the embedding stage, the method of multiMSB replacement is used to hide secret data. The adaptive location map and other side information are saved to the highest bit-plane to ensure no additional data are required when the image is transmitted to the cloud server. Experiments show the proposed method has high embedding rate and can ensure reversibility and security. In the future, an effective scheme for optimizing high-texture images will be further developed.  
      关键词:reversible data hiding;image encryption;variable prediction;multi-most significant bit replacement;adaptive location map   
      3
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47621793 false
      发布时间:2024-01-16

      Image Analysis and Recognition

    • Zhao Minghua,Dong Shuangshuang,Hu Jing,Du Shuangli,Shi Cheng,Li Peng,Shi Zhenghao
      Vol. 29, Issue 1, Pages: 111-122(2024) DOI: 10.11834/jig.230053
      Attention-guided three-stream convolutional neural network for microexpression recognition
      摘要:ObjectiveIn recent years, microexpression recognition has remarkable application value in various fields such as psychological counseling, lie detection, and intention analysis. However, unlike macro-expressions generated in conscious states, microexpressions often occur in high-risk scenarios and are generated in an unconscious state. They are characterized by small action amplitudes, short duration, and usually affect local facial areas. These features also determine the difficulty of microexpression recognition. Traditional methods used in early research mainly include methods based on local binary patterns and methods based on optical flow. The former can effectively extract the texture features of microexpressions, whereas the latter calculates the pixel changes in the temporal domain and the relationship between adjacent frames, providing rich, key input information for the network. Although the traditional methods based on texture features and optical flow features have made good progress in early microexpression recognition, they often require considerable cost and have room for improvement in recognition accuracy and robustness. Later, with the development of machine learning, microexpression recognition based on deep learning gradually became the mainstream of research in this field. This method uses neural networks to extract features from input image sequences after a series of preprocessing operations (facial cropping and alignment and grayscale processing) and classifies them to obtain the final recognition result. The introduction of deep learning has substantially improved the recognition performance of microexpressions. However, given the characteristics of microexpressions themselves, the recognition accuracy of microexpressions can still be improved considerably, while the limited scale of existing microexpression datasets also restricts the recognition effect of such emotional behaviors. To solve these problems, this paper proposes an attention-guided three-stream convolutional neural network (ATSCNN) for microexpression recognition.MethodFirst, considering that the motion changes between adjacent frames of microexpressions are very subtle, to reduce redundant information and computation in the image sequence while preserving the important features of microexpressions, this paper only performs preprocessing operations such as facial alignment and cropping on the two key frames of microexpressions (onset frame and apex frame) to obtain a single-channel grayscale image sequence with a resolution of 128 × 128 pixels and to reduce the influence of nonfacial areas on microexpression recognition. Then, because optical flow can capture representative motion features between two frames of microexpressions, it can obtain a higher signal-to-noise ratio than the original data and provide rich, critical input features for the network. Therefore, this paper uses the total variation-L1 (TV-L1) energy functional to extract optical flow features between two frames of microexpressions (the horizontal component of optical flow, the vertical component of optical flow, and the optical strain). Next, in the microexpression feature extraction stage, to overcome the overfitting problem caused by limited sample size, three identical four-layer convolutional neural networks are used to extract the features of the input optical flow horizontal component, optical flow vertical component, and optical strain, (the input channel numbers of the four convolutional layers are 1, 3, 5, and 8, and the output channel numbers are 3, 5, 8, and 16), thus improving the network performance. Afterward, because the image sequences in the microexpression dataset used in this paper inevitably contain some redundant information other than the face, a convolutional block attention module(CBAM) with channel attention and spatial attention serially connected is added after each shallow convolutional neural network in each stream to focus on the important information of the input and suppress irrelevant information, while paying attention to both the channel dimension and the spatial dimension, thereby enhancing the network’s ability to obtain effective features and improving the recognition performance of microexpressions. Finally, the extracted features are fed into a fully connected layer to achieve microexpression emotion classification (including negative, positive, and surprise). In addition, the entire model architecture uses the scaled exponential linear unit (SELU) activation function to overcome the potential problems of neuron death and gradient disappearance in the commonly used rectified linear unit (ReLU) activation function to speed up the convergence speed of the neural network.ResultThis paper conducted experiments on the microexpression combination dataset using the leave-one-subject-out (LOSO) cross-validation strategy. In this strategy, each subject serves as the test set, and all remaining samples are used for training. This validation method can fully utilize the samples and has a certain generalization ability. This method is the most commonly used validation in current microexpression recognition research. The results of this paper’s experiments on the unweighted average recall (UAR) and unweighted F1-score (UF1) reached 0.735 1 and 0.720 5, respectively. Compared with the Dual-Inception model, which performed best in the comparative methods, UAR and UF1 increased by 0.060 7 and 0.068 3, respectively. To verify further the effectiveness of the ATSCNN neural network architecture proposed in this paper, several ablation experiments were also conducted on the combined dataset, and the results confirmed the feasibility of this paper’s method.ConclusionThe microexpression recognition network proposed in this paper can effectively alleviate overfitting, focus on important information of microexpressions, and achieve state-of-the-art(SOTA) recognition performance on small-scale microexpression datasets using LOSO cross-validation. Compared with other mainstream models, the proposed method achieved state-of-the-art recognition performance. In addition, the results of several ablation experiments made the proposed method more convincing. In conclusion, the proposed method remarkably improved the effectiveness of microexpression recognition.  
      关键词:microexpression recognition;optical flow;three-stream convolution neural network;convolutional block attention module(CBAM);SELU activation function   
      2
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47621871 false
      发布时间:2024-01-16
    • Cui Xinyu,He Chong,Zhao Hongke,Wang Meili
      Vol. 29, Issue 1, Pages: 123-133(2024) DOI: 10.11834/jig.230043
      Combining ViT with contrastive learning for facial expression recognition
      摘要:ObjectiveFacial expression is one of the important factors in human communication to help understand the intentions of others. The task of facial expression recognition is to output the category of facial expression corresponding to a given face picture. Facial expression has broad applications in areas such as security monitoring, education, and human-computer interaction. Currently, facial expression recognition under uncontrolled conditions suffers from low accuracy due to factors such as pose variations, occlusions, and lighting differences. Addressing these issues will remarkably advance the development of facial expression recognition in real-world scenarios and hold great relevance in the field of artificial intelligence. Self-supervised learning is proposed to utilize specific data augmentations on input data and generate pseudo labels for training or pretraining models. Self-supervised learning leverages a large amount of unlabeled data and extracts the prior knowledge distribution of the images themselves to improve the performance of downstream tasks. Contrast learning belongs to self-supervised learning, which can further learn the intrinsic consistent feature information between similar images under the change of posture and light by increasing the difficulty of the task. This paper proposes an unsupervised contrastive learning-based facial expression classification method to address the problem of low accuracy caused by occlusion, pose variation, and lighting changes in facial expression recognition.MethodTo address the issue of occlusions in facial expression recognition datasets under real-world conditions, a method based on negative sample-based self-supervised contrastive learning is employed. The method consists of two stages: contrastive learning pretraining and model fine-tuning. First, in the pretraining stage of contrastive learning, an unsupervised contrastive loss is introduced to reduce the distance between images of the same type and increase the distance between images of different classes to improve the discrimination ability of intraclass diversity and interclass similarity images of facial expression images. This method involves adding positive sample pairs for contrastive learning between the original images and occlusion-augmented images, enhancing the robustness of the model to image occlusion and illumination changes. Additionally, a dictionary mechanism is applied to MoCo v3 to overcome the issue of insufficient memory during training. The recognition model is pretrained on the ImageNet dataset. Next, the model is fine-tuned on the facial expression recognition dataset to improve the classification accuracy for facial expression recognition tasks. This approach effectively enhances the performance of facial expression recognition in the presence of occlusions. Moreover, the Transformer-based vision Transformer (ViT) network is employed as the backbone network to enhance the model’s feature extraction capability.ResultExperiments were conducted on four datasets to evaluate the performance of the proposed method compared with the latest 13 methods. In the RAF-DB dataset, compared with the Face2Exp model, the recognition accuracy increased by 0.48%; in the FERPlus dataset, compared with the knowledgeable teacher network (KTN) model, The recognition accuracy increased by 0.35%; in the AffectNet-8 dataset, compared with the self-cure network (SCN) model, the recognition accuracy increased by 0.40%; in the AffectNet-7 dataset, compared with the deep attentive center loss (DACL) model, the recognition accuracy was slightly lower by 0.26%, which proves the effectiveness of the method in this paper.ConclusionA self-supervised contrastive learning-based method for facial expression recognition is proposed to address the challenges of occlusion, pose variation, and illumination changes in uncontrolled conditions. The method consists of two stages: pretraining and fine-tuning. The contributions of this paper lie in the integration of ViT into the contrastive learning framework, which enables the utilization of a large amount of unlabeled, noise-occluded data to learn the distribution characteristics of facial expression data. The proposed method achieves promising accuracy on facial expression recognition datasets, including RAF-DB, FERPlus, AffectNet-7, and AffectNet-8. By leveraging the contrastive learning framework and advanced feature extraction networks, this work enhances the application of deep learning methods in everyday visual tasks.  
      关键词:facial expression recognition;comparative learning;self-supervised learning;Transformer;positive and negative samples   
      3
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47621955 false
      发布时间:2024-01-16
    • Lai Jie,Peng Ruihui,Sun Dianxing,Huang Jie
      Vol. 29, Issue 1, Pages: 134-146(2024) DOI: 10.11834/jig.221189
      Detection of camouflage targets based on attention mechanism and multi-detection layer structure
      摘要:ObjectiveCamouflage target identification is a critical area of research in computer vision, and its major goal is to extract information about the target’s position and categorization from a complex backdrop environment. In addition to being widely used in the military, camouflage target identification has considerable application and research value in medical image segmentation, industrial defect detection, agricultural fruit detection, and other fields. Typically, a remarkable degree of fusion is between the disguised target and the surrounding background environment, a poor visual edge, low resolution, and insufficient feature information. As a result, the conventional target detection algorithms struggle to meet the requirements of camouflage target detection and typically have a high missed detection rate and low detection accuracy. To address these issues, this work provides a camouflage target recognition approach based on YOLOv5 (MAH-YOLOv5).MethodFirst, the YOLOv5 algorithm detects large, medium, and small objects in the network prediction head by using three different scale detection layers: 80 × 80, 40 × 40, and 20 × 20. The smallest detection layer can only recognize objects with pixel sizes greater than 8 × 8, so targets with an extremely low pixel ratio are overlooked. Consequently, a non-significant target detection layer can be added to the prediction head to improve the network’s perception of targets with insufficient feature information and reduce the possibility of missed detection and false alarms during the detection and recognition. Second, the convolutional neural network is used for feature extraction, but in the extraction, each component of the image is assigned the same weight. This strategy prevents focusing on the target’s effective information extraction, resulting in a waste of computer resources. Therefore, the convolutional block attention module (CBAM) can be implemented in the network feature extraction backbone to optimize target feature information use. CBAM is divided into two components: channel attention and spatial attention. CBAM fuses attention features from two dimensions of space and channel, adjusts the weight ratio of the network and the target with insufficient feature information, and causes the network to pay more attention to the camouflage target to be detected, improving the camouflage target’s average detection accuracy. Third, in the network training, a multiscale training method is utilized to expand the variety of the data set and improve the model’s robustness and generalization ability through scale shift during training. Finally, the indexes for missed alarms and false alarms for military target detection are determined. The comprehensive detection ability index of the camouflage target is proposed, which provides a mechanism for quantitative comparison between different techniques when combined with the average detection accuracy and speed.ResultThe experiment is trained and verified using the study group’s camouflage dataset, which includes 3 200 training sets and 1 100 test sets, and is compared with faster region convolutional neural network (Faster R-CNN), YOLOv4-tiny, single shot multibox detector (SSD), detection Transformer (DETR), YOLOx, YOLOv7, YOLOv8, and other algorithms. Results show the proposed method’s mean average precision (mAP) on the self-made dataset is 76.64%, the number of frames detected per second (FPS) is 53, the missed detection rate (MA) is 8.53%, the false alarm rate (FA) is only 0.14%, and the comprehensive detection index of camouflage targets is as high as 88.17%. Compared with the YOLOv5 algorithm, the mAP is 3.89% higher, the MA is 2.75% lower, the FA is 0.56% lower, and the comprehensive detection index of camouflage targets is 0.74% higher. Furthermore, by adding a small target detection layer, integrating an attention mechanism, and training using a multiscale method, the detection effect of YOLOv5 can be increased. After adding a small target detection layer, the mAP of YOLOv5 is raised by 4.12%, the FA is lowered by 0.71%, and the MA and comprehensive detection index of the camouflage target change slightly. After using the attention mechanism, the mAP of YOLOv5 is improved by 3.89%, the FA is lowered by 0.63%, the MA is reduced by 0.71%, and the comprehensive detection index of the camouflage targets is enhanced by 0.29%. After using the multiscale training methods, the mAP is increased by 3.13%, the MA is lowered by 2.85%, and the FA is reduced by 0.56%. To demonstrate the usefulness of the suggested technique, the MAH-YOLOv5 algorithm is compared with Faster R-CNN, SSD, YOLOv4-tiny, DETR, YOLOx, YOLOv7, YOLOv8, and other algorithms. Testing results reveal the suggested method outperforms the most advanced YOLOv8 algorithm in terms of mAP, FPS, MA, FA, and other indicators, and its comprehensive detection index is second only to that of the most advanced YOLOv8 algorithm.ConclusionThis paper improves the YOLOv5 method by adding a small target detection layer and a fusion attention mechanism and proposes a camouflage target recognition method. The experimental results show the proposed method has great improvement in detection accuracy and recognition rate and can effectively identify camouflage targets in complex background environment. Comparisons show the comprehensive detection performance of this method is much better than that of Faster R-CNN, SSD, YOLOv4-tiny, DETR, YOLOv7, and other algorithms. The proposed method can provide technical assistance and reference for the rapid, accurate identification of battlefield camouflage targets.  
      关键词:camouflage target detection;non-significant target detection layer;attention mechanism;multi-scale training;composite detection index   
      3
      |
      2
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47621957 false
      发布时间:2024-01-16
    • Peng Hong,Wang Qian,Jia Di,Zhao Jinyuan,Pang Yuheng
      Vol. 29, Issue 1, Pages: 147-162(2024) DOI: 10.11834/jig.230096
      Fast convergence network for target posetracking driven by synthetic data
      摘要:ObjectiveRigid object pose estimation is one of the fundamental, most challenging problems in computer vision, which has garnered substantial attention in recent years. Researchers are seeking methods to localize the multiple degrees of freedom of rigid objects in a 3D scene, such as position translation and directional rotation. At the same time, progress in the field of rigid object pose estimation has been considerable with the development of computer vision techniques. This task has become increasingly important in various applications, including robotics, space orbit servicing, autonomous driving, and augmented reality. Rigid object pose estimation can be divided into two stages: the traditional pose estimation stage (e.g., feature-based, template matching, and 3D coordinate-based methods) and the deep learning-based pose estimation stage (e.g., improved traditional methods and direct or indirect estimation methods). Despite the achievement of high tracking accuracy by existing methods and their improved variants, the tracking precision substantially deteriorates when they are applied to new scenes or novel target objects, exhibiting poor performance in complex environments. In such cases, a large amount of training data is required for deep learning across multiple scenarios, incurring high costs for data collection and network training. To address this issue, this paper proposes a real-time tracking network for rigid object 6D pose with fast convergence and high robustness, driven by synthetic data. The network provides long-term stable 6D pose tracking for target rigid objects, greatly reducing the cost of data collection and the time required for network convergence.MethodThe network convergence speed is mainly improved by the overall design of the network, the residual sampling filtering module, and the characteristic aggregation module. The rigid 6D pose transformation is calculated using Lie algebra and Lie group theory. The current frame RGB-D image and the previous frame’s pose estimation result are transformed into a pair of 4D tensors and input into the network. The pose difference is obtained through residual sampling filtering processing and feature encoder, and the current 6D pose of the target is calculated jointly with the previous frame’s pose estimation. In the design of the residual sampling filtering module, the self-gated swish activation function is used to retain target detail features, and the displacement and rotation matrix is obtained by decoupling the target pose through feature encoding and decoder, which improves the accuracy of target pose tracking. In the design of the characteristic aggregation module, the features are decomposed into horizontal and vertical components, and a 1D feature encoding is obtained through aggregation, capturing long-term dependencies and preserving position information from time and space. A set of complementary feature maps with position and time awareness is generated to strengthen the target feature extraction ability, thereby accelerating the convergence of the network.ResultTo ensure consistent training and testing environments, all experiments are conducted on a desktop computer with an Intel Core i7-8700@3.2 GHz processor and NVIDIA RTX 3060 GPU. Each target in the complete dataset contains approximately 23 000 sets of images with a size of 176 × 176 pixels, totaling about 15 GB in capacity. During training and validation, the batch size is set to 80, and the model is trained for 300 epochs. The initial learning rate is set to 0.01, with decay rate parameters of 0.9 and 0.99 applied starting from the 100th and the 200th epochs, respectively. When evaluating the tracking performance, the average distance of model points (ADD) metric is commonly used to assess the accuracy of pose estimation for non-symmetric objects. This approach involves calculating the Euclidean distance between each predicted point and the corresponding ground truth point, followed by summing these distances and taking their average. However, the ADD metric is not suitable for evaluating symmetric objects because multiple correct poses may exist for a symmetric object in the same image. In such cases, the ADD-S metric is used, which projects the ground truth and predicted models onto the symmetry plane and calculates the average distance between the projected points. This metric is more appropriate for evaluating the pose tracking results of symmetric objects. The Yale-CMU-Berkeley-video (YCB-Video) dataset and Yale-CMU-Berkeley in end-of-arm-tooling (YCBInEoAT) dataset are used to evaluate the performance of relevant methods in the experiments. The YCB-Video dataset contains complex scenes captured by a moving camera under severe occlusion conditions, whereas the YCBInEoAT dataset involves tracking rigid objects with a robotic arm. These two datasets are utilized to validate the generality and robustness of the network across different scenarios. Experimental results show the tracking speed of the proposed method reaches 90.9 Hz, and the average distance of model points (ADD) and the average distance of nearest points (ADD-S) reach 93.24 and 95.84, respectively, which are higher than similar related methods. Compared with the se(3)-TrackNet method, which has the highest tracking accuracy, the ADD and ADD-S of this method are 25.95 and 30.91 higher under the condition of 6 000 sets of synthetic data, respectively, 31.72 and 28.75 higher under the condition of 8 000 sets of synthetic data, respectively, and 35.75 higher under the condition of 10 000 sets of synthetic data. The method achieves highly robust 6D pose tracking for targets in severely occluded scenes.ConclusionA novel fast-converging network is proposed for tracking the pose of rigid objects, which combines the residual sampling filtering module and the characteristic aggregation module. This network can provide long-term effective 6D pose tracking of objects with only one initialization. By utilizing a small amount of synthetic data, the network quickly reaches a state of convergence and achieves desirable performance in complex scenes, including severe occlusion and drastic displacement. The network demonstrates outstanding real-time pose tracking efficiency and tracking accuracy. Experimental results on different datasets validate the superiority and reliability of this approach. In future work, we will continue to optimize our model, further improve object tracking accuracy and network convergence speed, address the limitation of requiring computer-aided design (CAD) models for the network, and achieve category-level pose tracking.  
      关键词:6D pose estimation;real-time tracking;synthetic data;image processing;feature fusion   
      2
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622120 false
      发布时间:2024-01-16
    • Liu Xiang,Li Hui,Cheng Yuanzhi,Kong Xiangzhen,Chen Shuangmin
      Vol. 29, Issue 1, Pages: 163-178(2024) DOI: 10.11834/jig.221003
      3D multi-object tracking based on image and point cloud multi-information perception association
      摘要:Objective3D multi object tracking is a challenging task in autonomous driving, which plays a crucial role in improving the safety and reliability of the perception system. RGB cameras and LiDAR sensors are the most commonly used sensors for this task. While RGB cameras can provide rich semantic feature information, they lack depth information. LiDAR point clouds can provide accurate position and geometric information, but they suffer from problems such as dense near distance and sparse far distance, disorder, and uneven distribution. The multimodal fusion of images and point clouds can improve multi object tracking performance, but due to the complexity of the scene and multimodal data types, the existing fusion methods are less effective and cannot obtain rich fusion features. In addition, existing methods use the intersection ratio or Euclidean distance between the predicted and detected bounding boxes of objects to calculate the similarity between objects, which can easily cause problems such as trajectory fragmentation and identity switching. Therefore, the adequacy of multimodal data fusion and the robustness of data association are still urgent problems to be solved. To this end, a 3D multi object tracking method based on image and point cloud multi-information perception association is proposed.MethodFirst, a hybrid soft attention module is proposed to enhance the image semantic features using channel separation techniques to improve the information interaction between channel and spatial attention. The module includes two submodules. The first one is the soft channel attention submodule, which first compresses the spatial information of image features into the channel feature vector after the global average pooling layer, followed by two fully connected layers to capture the correlation between channels, followed by the Sigmoid function processing to obtain the channel attention map, and finally multiplies with the original features to obtain the channel enhancement features. The second is the soft spatial attention submodule. To make better use of the channel attention map in spatial attention, first, according to the channel attention map, the channel enhancement features are divided into two groups along the channel axis using the channel separation mechanism, namely, the important channel group and the minor channel group, noting the channel order is not changed in the separation. Then, the two groups of features are enhanced separately using spatial attention and summed to obtain the final enhanced features. Then, a semantic feature-guided multimodal fusion network is proposed, in which point cloud features, image features, and point-by-image features are fused in a deep adaptive way to suppress the interference information of different modalities and to improve the tracking effect of the network on small and obscured objects at long distances by taking advantage of the complementary point cloud and image information. Specifically, the network first maps point cloud features, image features, and point-by-image features to the same channel, stitches them together to obtain the stitched features, uses a series of convolutional layers to obtain the correlation between the features, obtains the respective adaptive weights after the sigmoid function, multiplies them with the respective original features to obtain the respective attention features, and adds the obtained attention features to obtain the final fused features. The attention features are summed to obtain the final fused features. Finally, a multiple information perception affinity matrix is constructed to combine multiple information such as intersection ratio, Euclidean distance, appearance information, and direction similarity for data association to increase the matching rate of trajectory and detection and improve tracking performance. First, the Kalman filter is used to predict the state of the trajectory in the current frame. Then, the intersection ratio, Euclidean distance, and directional similarity between the detection frame and the prediction frame are calculated and combined to represent the position affinity between objects, and the appearance affinity matrix and the position affinity matrix are weighted and summed as the final multiple information perception affinity matrix. Finally, based on the obtained affinity matrix, the Hungarian matching algorithm is used to complete the association matching task between objects in two adjacent frames.ResultFirst, the proposed modules are validated on the KITTI validation set, and the results of the ablation experiments show each of the proposed modules, namely, hybrid soft attention, semantic feature-guided multimodal fusion, and multiple information perception affinity matrix, can improve the tracking performance of the model to different degrees, which proves the effectiveness of the proposed modules. Then, they are evaluated on two benchmark datasets, KITTI and NuScenes, and compared with existing advanced 3D multi object tracking methods. On the KITTI dataset, the higher order tracking accuracy (HOTA) and multi-object tracking accuracy (MOTA) metrics of the proposed method reach 76.94% and 88.12%, respectively, which are 1.48% and 3.49% improvement compared with the best performing model of the compared methods, respectively. On the NuScenes dataset, the average MOTA (AMOTA) and MOTA metrics of the proposed method reach 68.3% and 57.9%, respectively, with 0.6% and 1.1% improvement, respectively, compared with the best-performing model in the comparison method, and the overall performance on both datasets surpasses that of the advanced tracking methods.ConclusionThe proposed method effectively solves the problems of missed detection of obscured objects and small long-range objects, object identity switching, and trajectory fragmentation and can accurately and stably track multiple objects in complex scenes. Compared with existing competing methods, the proposed method has more advanced tracking performance and better tracking robustness and is more suitable for application in scenarios such as autonomous driving environment awareness and intelligent transportation.  
      关键词:point cloud;3D multi-object tracking;attention;multimodal fusion;data association   
      2
      |
      1
      |
      1
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622203 false
      发布时间:2024-01-16
    • Yang Qiang,Luo Jian,Huang Yuchen
      Vol. 29, Issue 1, Pages: 179-191(2024) DOI: 10.11834/jig.221142
      Gait image spatio-temporal restoration network and its application under occlusion conditions
      摘要:ObjectiveGait recognition is a kind of identity recognition method based on human walking mode, which has been widely used in the field of video surveillance and public security. Compared with the face, fingerprint, and other biometric features, it has the advantages of long-distance recognition without the need for participant cooperation and the difficulty of camouflaging and hiding. At present, gait recognition algorithms are based on vision and deep learning, most of which use gait sequences without occlusion to form gait features for recognition. However, in reality, people under the monitoring of various public places are inevitably blocked, so the gait sequences obtained are usually occluded. The occlusion sequence has a great effect on gait recognition, such as the inability to obtain accurate gait periods from the sequence, and the lack of gait spatio-temporal information is also more serious, which leads to a substantial reduction in recognition performance. The existing occlusion gait processing algorithmsare divided into two kinds: One is to extract directly the features of occlusion robustness to identify from the occlusion sequence. It often needs to know the gait period in advance, but it is difficult in the occlusion gait sequence. The other algorithms perform identification by reconstructing the gait silhouette or repairing the gait features, but the existing algorithms often have poor performance when the occlusion area is large or the entire sequence is occluded.MethodA gait spatio-temporal reconstruction network (GSTRNet), which consists of the occlusion detection network you only look once (YOLO), the spatio-temporal codec network, and the feature extraction network (Gaitset), based on prior knowledge is proposed to repair occluded gait sequences. GSTRNet uses YOLOv5 to detect the occlusion region in sequence (assigning the occlusion area to 1 and the nonocclusion area to 0) as a piece of prior knowledge to assign a higher weight to the loss of the occlusion area. Spatiotemporal codec network consists of 3D convolutional neural network (3DCNN) and Transformers. The 3DCNN can repair the spatial information of each gait image while maintaining the time coherence between frames. The encoder uses 3DCNN with a stride of 2 to reduce the dimension of the data so that each element can participate in the current sampling and more detailed information can be retained. The decoder uses skip connection to stitch together features in the encoder to reduce further the loss of detail due to encoder down sampling. To ensure the time and space consistency of the entire repair sequence, multiple Transformers composed of multiscale self-attention module are added between the encoder and decoder, extracting useful information from the global and local scope to repair the gait sequence. Because the 3DCNN is a global repair, the nonocclusion region data in the repaired gait sequence also change, and GSTRNet uses prior knowledge to take only the occlusion region repair result from the decoder output and then adds it to the original sequence as the output of the network. The Gaitset network is also introduced to extract features from three sequences as triplet losses to maintain feature consistency between the repair sequence and the original sequence, namely, occlusion sequences, genuine sequences (other nonocclusion sequences with the same identity as occlusion sequences), and imposter sequences (sequences that have different identities from occlusion sequences). In the OU-ISIR gait database, multiview large population dataset (OU_MVLP), 24 occlusion gait sequences are synthesized as experimental data by simulating various occlusion types in real life, and our algorithm is evaluate dusing three sets of experiments: the occlusion mode is known and the gallery and probe occlusion modes are consistent, the occlusion mode is known but the gallery and probe occlusion modes are inconsistent, and the occlusion mode is unknown.ResultResults show the proposed algorithm performs better than the existing occlusion sequence repair algorithms. Compared with other repair algorithms, the rank1 recognition rates of our algorithm in single occlusion mode and non-single occlusion mode when the occlusion mode is known are 4.1% and 4.1% higher, respectively and have a maximum recognition accuracy improvement of 6.7% in the case of large area occlusion of the gait silhouette when the occlusion mode is unknown. Compared with gait recognition algorithms such as 3D local convolutional neural networks, the recognition rate in non-single occlusion mode has a maximum increase of about 50%.ConclusionThe proposed GSTRNet model has a good effect on the repair of gait sequences to varying degrees in various occlusion modes and has a strong feasibility in reality.  
      关键词:gait recognition;occluded silhouette reconstruction;prior knowledge;three-dimensional convolutional neural network(3DCNN);Transformer   
      2
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622201 false
      发布时间:2024-01-16
    • Li Sicong,Zhu Feng,Wu Qingxiao
      Vol. 29, Issue 1, Pages: 192-204(2024) DOI: 10.11834/jig.221183
      Precise positioning method of deformed workpiece by fusing edge and grayscale features
      摘要:ObjectiveIn industrial robot vision, accurately detecting deformed workpieces caused by assembly, stamping, or lamination is often necessary. These workpieces sometimes show non-rigid characteristics, such as dislocation or twist deformation. Most features do not maintain good shape consistency, while the remaining undeformed parts are generally simple, for example, sparse edges, which are not globally unique. In addition, obtaining massive training samples before workpieces are mass produced is not realistic. Hence, some commonly used object detection methods have insufficient accuracy or weak robustness, challenging meeting the actual needs.MethodTo address the problem, a two-stage precise positioning method of deformed workpieces by fusing edge and grayscale features is proposed. The first stage is the coarse position detection of deformed targets based on grayscale features, and the second stage is precise positioning based on shape features. The innovation of the first stage lies in that a multi normalized cross correlation (MNCC) matching method is proposed, which includes offline and online parts. In the offline part, the grayscale clustering algorithm at cosine distance is used to obtain the class-mean template, which characterizes a class center of similar features in the target deformation space. Therefore, fewer class-mean templates can be used to represent the grayscale features of the target’s deformation after discretization. In the online part, by sliding window combined with pyramid tracking, the class-mean template is searched preferentially from top to bottom to acquire the class-mean candidates. Then, a detailed search of the candidates within the class is carried out to obtain the best match, achieve the efficient matching of deformed workpieces, and complete the task of coarse position detection during the first stage. A truncated shape-based matching (T-SBM) method is proposed in the second stage to achieve precise positioning using the target edge. By changing the similarity measurement based on the gradient’s inner product, the gradient vector of opposite direction is truncated, so no negative evaluation of the local edge points exists. The improvement restricts the negative contribution on the overall similarity score when the local gradient direction is inverted due to the inconsistency of the target background. The simple representations of sparse edges leading to low robustness are prevented. Finally, a 2D Gaussian conditional density evaluation is proposed to combine grayscale features, shape features, and deformation quantity reasonably. The candidate with the top score wins the best match. The proposed evaluation provides a comprehensive estimation for the ideal target position detection under the 3-sigma principle, realizes the precise positioning of the deformed workpiece, and completes the sequential detection.ResultIn the experimental part, the proposed method is compared with classical shape-based matching (SBM), normalized cross-correlation (NCC), linearizing the memory 2D (LINE2D), and you only look once version 5 small (YOLOv5s) on 472 authentic industrial images consisting of five types of workpieces, namely, TV back, led panel, screw hole, metal tray, and aluminum plate. Industry vision software, HALCON, provides the implementation of SBM and NCC, and LINE2D is from OpenCV. The evaluation contains F1-score, recall, detection accuracy, and average pixel distance, where the first three and the last regards detection robustness and positioning accuracy, respectively. At intersection over union (IoU) of 0.9, a strict enough threshold for precise positioning, the average detection accuracy and the F1-score of the proposed method are 81.7% and 95%, respectively, and improve by 10.8% and 8.3%, compared with other test methods. When the minscore threshold is less than 0.8, the recall of the proposed method is slightly inferior to that of the NCC method. However, when the minscore is greater than 0.8, a commonly used threshold interval, the proposed method substantially outperforms the other methods. In terms of average positioning accuracy, the positioning error based on the Euclidean distance of the proposed method is as low as 2.44 pixels at the IoU threshold of 0.9, which is muchbetter than the that of the other test methods.ConclusionA two-stage precision positioning method for deformed workpieces made by assembly, stamping, or lamination is proposed. In the experiment, the proposed method outperforms the other test methods on detection robustness and positioning accuracy, which shows the proposed method is suitable for precisely positioning deformed workpieces in industrial scenes.  
      关键词:machine vision;target positioning;two-stage detection;normalized cross correlation matching;shape-based matching (SBM)   
      2
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622122 false
      发布时间:2024-01-16

      Image Understanding and Computer Vision

    • Sun Liujie,Zeng Tengfei,Fan Jingxing,Wang Wenju
      Vol. 29, Issue 1, Pages: 205-217(2024) DOI: 10.11834/jig.220943
      Double-view feature fusion network for LiDAR semantic segmentation
      摘要:ObjectivePoint cloud semantic segmentation, as the basic technology of 3D point cloud data target detection, point cloud classification, and other projects, is an important part of the current 3D computer vision. At the same time, point cloud segmentation technology is the key for the computer to understand scenes, and it has been widely used in many fields such as autonomous driving, robotics, and augmented reality. Point cloud semantic segmentation refers to the point-by-point classification operation of points in the point cloud scene, that is, to judge the category of each point in the point cloud and finally segment and integrate accordingly. Generally, point cloud semantic segmentation technology can be divided into two categories according to different application scenarios: small-scale point cloud semantic segmentation and large-scale point cloud semantic segmentation. Small-scale point cloud semantic segmentation only performs semantic segmentation operations on indoor point cloud scenes or small-scale point cloud scenes, whereas the large-scale point cloud semantic segmentation replaces the deployment environment of the algorithm with outdoor large-scale point cloud data. Classification and integration for point clouds are usually performed on driving scenes or urban scenes. Compared with the point cloud semantic segmentation of small scenes, the semantic segmentation of large-scale point clouds has a wider range of applications and is extensively used in driving scene understanding, urban scene reconstruction, and other fields. However, due to the large amount of data and the complexity of point cloud data, the task of semantic segmentation for point cloud in large scenes is more difficult. To improve the extraction quality of point features in a large-scale point cloud, a semantic segmentation method based on double-view feature fusion network for LiDAR semantic segmentation is proposed.MethodOur method is composed of two parts, double-view feature fusion module and feature integration based on asymmetric convolution. In the down sampling stage, a double-view feature fusion module, which includes a double-view point cloud feature extraction module and a feature fusion block, is suggested. The double-view feature fusion module combines the cylindrical feature with the global feature of key points to reduce the feature loss caused by downsampling. The features in different views of the point cloud are combined by feature splicing in this module. Finally, the combined point cloud features are placed into the feature fusion block for feature dimensionality reduction and fusion. In the feature integration stage, a point cloud feature integration module is proposed based on asymmetric convolution, including asymmetric convolution and multiscale dimension-decomposition context modeling, achieving the enhancement and reconstruction of point cloud features by the operation of asymmetric point cloud feature processing and multi-scale context feature integration. The feature integration processes the double-view feature by asymmetric convolution and then uses multi-dimensional convolution and multiscale feature integration for feature optimization.ResultIn our experimental environment, our algorithm has the second-highest frequency weighted intersection over union accuracy rate and the highest mean intersection over union (mIoU) accuracy rate among recent algorithms. Our work focuses on the improvement of segmentation accuracy and achieves the highest segmentation accuracy in multiple categories. In vehicle categories such as cars and trucks, our method achieves a high segmentation accuracy. In categories such as bicycles, motorcycles, and pedestrians with small individuals and complex shapes, our method performs better than other methods. In buildings, railings, vegetation, and other categories that are at the edge of the point cloud scene and where the point cloud distribution is relatively sparse, the double-view feature fusion module in our method not only retains the geometric structure of the point cloud but also extracts the global features of the data, thereby achieving the high-precision segmentation of these categories. Our method achieves 63.9% mIoU on the SemanticKITTI dataset and leads the open-source segmentation methods. Compared with CA3DCNet, our method also achieves better segmentation results on the nuScenes dataset, and the mIoU accuracy is improved by 0.7%.ConclusionOur method achieves a high-precision semantic segmentation for a large-scale point cloud. A double-view feature fusion network is proposed for LiDAR semantic segmentation, which is suitable for the segmentation for a large-scale point cloud. Experimental results show the double-view feature fusion module can reduce the loss of edge information in a large-scale point cloud, thereby improving the segmentation accuracy of edge object in the scene. Experiments prove the feature integration module based on asymmetric convolution can effectively segment small-sized objects, such as pedestrians, bicycles, motorcycles, and other categories. Our method is compared with a variety of semantic segmentation methods for large-scale point cloud. In terms of accuracy, our method performs better and achieves an average segmentation accuracy of 63.9% on the SemanticKITTI dataset.  
      关键词:deep learning;semantic segmentation;point cloud;cylindrical voxel;context information   
      4
      |
      2
      |
      1
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622205 false
      发布时间:2024-01-16
    • Wan Junhui,Liu Xinpu,Chen Lili,Ao Sheng,Zhang Peng,Guo Yulan
      Vol. 29, Issue 1, Pages: 218-230(2024) DOI: 10.11834/jig.230106
      Geometric attribute-guided 3D semantic instance reconstruction
      摘要:ObjectiveThe objective of 3D vision is to capture the geometric and optical features of the real world from multiple perspectives and convert this information into digital form, enabling computers to understand and process it. 3D vision is an important aspect of computer graphics. Nonetheless, sensors can only provide partial observations of the world due to viewpoint occlusion, sparse sensing, and measurement noise, resulting in a partial and incomplete representation of a scene. Semantic instance reconstruction is proposed to solve this problem. It converts 2D/3D data obtained from multiple sensors into a semantic representation of the scene, including modeling each object instance in the scene. Machine learning and computer vision techniques are applied to achieve high-precision reconstruction results, and point cloud-based methods have demonstrated prominent advantages. However, existing methods disregard prior geometric and semantic information of objects, and the subsequent simple max-pooling operation loses key structural information of objects, resulting in poor instance reconstruction performance.MethodIn this study, a geometric attribute-guided semantic instance reconstruction network (GANet), which consists of a 3D object detector, a spatial Transformer, and a mesh generator, is proposed. We design the spatial Transformer to utilize the geometric and semantic information of instances. After obtaining the 3D bounding box information of instances in the scene, box sampling is used to obtain the real local point cloud of each target instance in the scene on the basis of the instance scale information, and then semantic information is embedded for foreground point segmentation. Compared with ball sampling, box sampling reduces noise and obtains more effective information. Then, the encoder’s feature embedding and Transformer layers extract rich and crucial detailed geometric information of objects from coarse to fine to obtain the corresponding local features. The feature embedding layer also utilizes the prior semantic information of objects to help the algorithm quickly approximate the target shape. The attention module in the Transformer integrates the correlation information between points. The algorithm also uses the object’s global features provided by the detector. Considering the inconsistency between the scene space and the canonical space, a designed feature space Transformer is used to align the object’s global features. Finally, the fused features are sent to the mesh generator for mesh reconstruction. The loss function of GANet consists of two parts: detection and shape losses. Detection loss is the weighted sum of the instance confidence, semantic classification, and bounding box estimation losses. Shape loss consists of three parts: Kullback-Leibler divergence between the predicted and standard normal distributions, foreground point segmentation loss, and occupancy point estimation loss. Occupancy point estimation loss is the cross-entropy between the predicted occupancy value of the spatial candidate points and the real occupancy value.ResultThe experiment was compared with the latest methods on the ScanNet v2 datasets. The algorithm utilized computer aided design (CAD) models provided by Scan2CAD, which included 8 categories, as ground truth for training. The mean average precision of semantic instance reconstruction increased by 8% compared with the second-ranked method, i.e., RfD-Net. The average precision of bathtub, trash bin, sofa, chair, and cabinet is better than that from RfD-Net. In accordance with the visualization results, GANet can reconstruct object models that are more in line with the scene. Ablation experiments were also conducted on the corresponding dataset. The performance of the complete network was better than the other four networks, which included a GANet that replaced ball sampling with box sampling, replaced the Transformer with PointNet, and removed the semantic embedding of point cloud features and feature transformation. The experimental results indicate that box sampling obtains more effective local point cloud information, the Transformer-based point cloud encoder enables the network to use more critical local structural information of the foreground point cloud during reconstruction, and semantic embedding provides prior information for instance reconstruction. Feature space transformation aligns the global prior information of an object, further improving the reconstruction effect.ConclusionIn this study, we proposed a geometric attribute-guided network. This network considers the complexity of scene objects and can better utilize the geometric and attribute information of objects. The experiment results show that our network outperforms several state-of-the-art approaches. Current 3D-based semantic instance reconstruction algorithms have achieved good results, but acquiring and annotating 3D data are still relatively expensive. Future research can focus on how to use 2D data to assist in semantic instance reconstruction.  
      关键词:scene reconstruction;three-dimensional point cloud;semantic instance reconstruction;mesh generation;object detection   
      3
      |
      2
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622207 false
      发布时间:2024-01-16
    • Xiao Zhaolin,Su Zhan,Zuo Fengyuan,Jin Haiyan
      Vol. 29, Issue 1, Pages: 231-242(2024) DOI: 10.11834/jig.230093
      Low-light optical flow estimation with hidden feature supervision using a Siamese network
      摘要:ObjectiveOptical flow estimation has been widely used in target tracking, video time-domain super-resolution, behavior recognition, scene depth estimation, and other vision applications. However, imaging under low-light conditions can hardly avoid low signal-to-noise ratio and motion blur, making low-light optical flow estimation very challenging. Applying a pre-stage low-light image enhancement can effectively improve the image visual perception, but it may not be helpful for further optical flow estimation. Unlike the “low light enhancement first and optical flow estimation next” strategy, the low-light image enhancement should be considered with the optical flow estimation simultaneously to prevent the loss of scene motion information. The optical flow features are encoded into the latent space, which enables supervised feature learning for paired low-light and normal-light datasets. This paper also reveals the post-task-oriented feature enhancement outperforms the general visual enhancement of low-light images. The main contributions of this paper can be summarized as follows: 1) A dual-branch Siamese network framework is proposed for low-light optical flow estimation. A weight-sharing block is used to establish the correlation of motion features between low-light images and normal-light images. 2) An iterative low-light flow estimation module, which can be supervised using normal-light hidden features, is proposed. Our solution is free of explicit enhancement of low-light images.MethodThis paper proposes a dual-branch Siamese network to encode low-light and the normal-light optical flow features. Then, the encoded features are used to estimate the optical flow in a supervised manner. Our dual-branch feature extractor is constructed using a weight-sharing block, which encodes the motion features. Importantly, our algorithm does not need a pre-stage low-light enhancement, which is usually employed in most existing optical flow estimations. To overcome the high spatial-temporal computational complexity, this paper proposes to compute the K-nearest neighbor correlation volume instead of the 4D all-pair correlation volume. To fuse local and global motion features better, an attention mechanism for the 2D motion feature aggregation is introduced. After the feature extraction, a discriminator is used to distinguish the low-light image features from the normal-light image features. The feature extractor training is completed when the discriminator is incapable to recognize the two. To avoid the explicit enhancement of low-light images, the final optical flow estimation module is composed of a feature enhancement block and a gated recurrent unit (GRU). In an iterative way, the optical flow is decoded from the enhanced feature in the block. A latent feature supervised loss and an iterative similarity loss are used to keep the convergence of the training stage. In the experiment part, the network is trained on an NVIDIA GeForce RTX 3080Ti GPU. The input images are uniformly cropped to 496 × 368 pixels in spatial resolution. Because the low-light and normal-light image paired datasets are limited, the flying chairs dark noise (FCDN) and the various brightness optical flow (VBOF) datasets are jointly used for the model training.ResultThe proposed algorithm is compared with three state-of-the-art optical flow estimation models on several low-light datasets and normal-light datasets, including FCDN, VBOF, Sintel, and KITTI datasets. Besides the visual comparison, quantitative evaluation with the end-point-error (EPE) metric is conducted. Experimental results show the proposed method achieves a performance comparable with the best available optical flow estimation under normal illumination conditions. The proposed solution improves up to 0.16 in terms of the EPE index on the FCDN dataset compared with the second-best solution under the low-light condition. On the VBOF dataset, the proposed solution improves 0.08 in terms of the EPE index compared with the second-best algorithm. Visual comparisons with all the compared methods are also provided. The results show the proposed model preserves more accurate details than other optical flow estimations, especially under low-light conditions.ConclusionIn this paper, a dual-branch Siamese network is proposed for realizing the accurate encoding of the optical flow features under normal-light and low-light conditions. The feature extractor is constructed with a weight-sharing block, which enables better-supervised learning for low-light optical flow estimation. The proposed model has remarkable advantages in accuracy and generalizability for the flow estimation. The experimental results indicate the proposed supervised low-light flow estimation outperforms the state-of-the-art solutions in terms of precision.  
      关键词:optical flow estimation;siamese network;correlation volume;global motion aggregation;low-light image enhancement   
      2
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622247 false
      发布时间:2024-01-16

      Computer Graphics

    • Ren Jingwen,Dai Junfei,Lin Hongwei
      Vol. 29, Issue 1, Pages: 243-255(2024) DOI: 10.11834/jig.221199
      Simulation of cloth with thickness based on isogeometric continuum elastic model
      摘要:ObjectiveCloth simulation is a research hotspot and difficulty in the field of computer animation. Cloth simulation can be seen in a variety of topics such as visual effects, game development, industrial design, and interactive virtual environments. With the demand for high-quality experience from users today, various models have been proposed to improve simulation performance. Although the models based on the particle system are fast and efficient, they have difficulty accurately capturing the behaviors in accordance with the real physical properties of cloth. These physical properties can be described by the elastic model of continuum employing finite element method (FEM). However, solving with FEM in cloth simulation requires a number of degrees of freedom (or elements), and it is much more complex and timeconsuming. Therefore, existing methods usually model cloth as a surface or a shell, which leads to weak simulation ability of thick cloth. To ease the awkwardness of compromising the geometric modeling, physical authenticity, and computation speed in these models, a new cloth simulation model with thickness is proposed, which describes the deformation behavior of the cloth with different thicknesses more appropriately, and a fast dynamic physically based cloth simulation algorithm is carried out by isogeometric analysis (IGA). IGA treats the physical domain (the geometry) as the computational field, avoiding the mesh generation that has approximating error and is timeconsuming in classical FEM. IGA uses the nonuniform rational B-splines (NURBS) basis functions for the physical domain and the solution field, which has the merit of higher-order continuous solution compared with the traditional linear basis. The direct computation on the control mesh of the physical domain makes solving the physical problems more accurate and faster.MethodThe thick cloth is initially modeled as a very thin plate expressed by a trivariate B-spline solid. The weft direction and the warp direction of the fabric are free to design, while the basis for the thickness direction is usually linear to decrease the degrees of freedom, or higher order for thicker cloth. The deformations of the B-spline solid with elasticity represent the behaviors of displacement of the cloth. Focused on IGA-Galerkin method, the weak form of the linear elastic equations of 3D continuum is derived under the given boundary conditions. Then, the integrals in the weak form are computed by Gauss quadrature. By assembling the global stiffness matrix from the local element matrices, a linear system is yielded. The Dirichlet boundary conditions are dealt with Gauss elimination, and the preprocessing of the matching between the index of the local basis and the index of the global basis is also needed. The control mesh is simulated and analyzed as the computing mesh, so the unknowns in the linear system are the control coefficients in the B-splines-expressed solution of the displacement. The damping behavior caused by the dissipation of the energy of the system is modeled as the damping coefficient to the velocity of the control points to simplify the simulation. Considering the dynamic process, the time integration is realized by the Newmark implicit method to allow a larger time step and enhance the stability of the system. Finally, the linear equations are solved directly due to the less degrees of freedom compared with other models, and the displacements, velocities, and accelerations of the control points of the cloth are updated for each time step. The current state of the cloth can be visualized through the current positions of the control points.ResultOur IGA continuum elastic cloth model is discussed in various aspects. First, the smoothness of the simulation results is compared with commonly used discrete models, which displays remarkable smoothness advantage due to NURBS construction, and the computational time is compared with the classical finite element continuum model at different degrees of freedom, which shows that when the root mean squared error (RMSE) of the simulation results of the two models is less than 0.04, the method can reduce at most 90.23% of the degrees of freedom and 99.43% of the computational time. Compared with the continuum shell-based model of the same thickness, the computational time can be improved by about 30%. Second, for classic scenarios such as hanging cloth, falling flag, and contact problems, realistic, fast dynamic simulation effects are achieved. In addition, the influences of the density of the control mesh, the order of the basis function, and the selection of physical parameters on the simulation effect are demonstrated and discussed. Using higher-resolution control mesh or higher-order basis with appropriate geometric and physical parameters promotes more detailed simulation effects.ConclusionIn this paper, an IGA-Galerkin-based cloth model with thickness is proposed to improve the physically based simulation, which is very intuitive and easy to implement. The trivariate B-splines-expressed model of a very thin plate can keep the smoothness of the cloth and uses less degrees of freedom and elements. The focus on solving the elastic equilibrium equations of the continuum enables matching the simulated cloth with the fabrics in the real world. The proposed IGA-Galerkin cloth model is an effective approach to meet the basic requirements of simulation accuracy and speed, which achieves a higher dynamic physical simulation efficiency.  
      关键词:isogeometric analysis (IGA);finite element method (FEM);elastic mechanics;physically based simulation;cloth simulation   
      2
      |
      1
      |
      1
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622370 false
      发布时间:2024-01-16

      Medical Image Processing

    • Zhao Xiaoming,Liao Yuehui,Zhang Shiqing,Fang Jiangxiong,He Xiaxia,Wang Guoyu,Lu Hongsheng
      Vol. 29, Issue 1, Pages: 256-267(2024) DOI: 10.11834/jig.230092
      Method of classifying benign and malignant breast tumors by DCE-MRI incorporating local and global features
      摘要:ObjectiveAmong women in the United States, breast cancer (BC) is the most frequently detected type of cancer, except for nonmelanoma skin cancer, and it is the second-highest cause of cancer-related deaths in women, following lung cancer. Breast cancer cases have been on the rise in the past few years, but the number of deaths caused by breast cancer has either remained steady or decreased. This outcome could be due to improved early detection techniques and more effective treatment options. Magnetic resonance imaging (MRI), especially dynamic contrast-enhanced (DCE)-MRI, has shown promising results in the screening of women with a high risk of breast cancer and in determining the stage of breast cancer in newly diagnosed patients. As a result, MRI, especially DCE-MRI, is becoming increasingly recognized as a valuable adjunct diagnostic tool for the timely detection of breast cancer. With the development of artificial intelligence, many deep learning models based on convolutional neural network (CNN) have been widely used in medical image analysis such as VGG and ResNet. These models can automatically extract deep features from images, eliminating the need for hand-crafted feature extraction and saving much time and effort. However, CNN cannot obtain global information, and global information of medical images is very useful for the diagnosis of breast tumors. To acquire global information, the vision Transformer (ViT) has been proposed and achieved magnificent results in computer vision tasks. ViT uses convolution operation to separate the entire input image into many small image patches. Then, ViT can simultaneously process these image patches by multihead self-attention layers and capture global information in different regions of the entire input image. However, ViT inevitably loses local information while capturing global information. To integrate the advantages of CNN and ViT, studies have been proposed to combine the advantages of CNN and ViT to obtain more comprehensive feature representations for achieving better performance in breast tumor diagnosis tasks.MethodBased on the above observations and inspired by integrating the CNN and ViT, a novel cross-attention fusion network is proposed based on CNN and ViT, which can simultaneously extract local detail information from CNN and global information from ViT. Then, a nonlocal block is used to fuse this information to classify breast tumor DCE-MR images. The model structure mainly contains three parts: local CNN and global ViT branches, feature coupling unit (FCU), and cross-attention fusion. The CNN subnetwork uses SENet for capturing local information, and the ViT subnetwork captures global information. For the extracted feature maps from these two branches, their feature dimensions are usually different. To address this issue, an FCU is adopted to eliminate feature dimension misalignment between these two branches. Finally, the nonlocal block is used to compute the correspondences on the two different inputs. The former two stages (stage-1 and stage-2) of SENet50 as our local CNN subnetwork and a 7-layer ViT (ViT-7) as our global subnetwork are adopted. Each stage in SENet50 is composed of some residual blocks and SEblocks. Each residual block contains a 1×1 convolution layer, a 3×3 convolution layer, and a 1×1 convolution layer. Each SEblocks contains a global average pooling layer, two fully connected (FC) layers, and a sigmoid activation function. Here, it is separately set to be 3 in stage-1 and 4 in stage-2 for the number of residual blocks and SEblocks. The 7-layer ViT contains seven encoder layers, which include two LayerNorms, a multihead self-attention module, and a simple multi-layer perception (MLP) block. The FCU contains a 1 × 1 convolution, a BatchNorm layer, and a nearest neighbor Interpolation. The nonlocal block consists of four 1 × 1 convolutions and a softmax function.ResultThe model performance is compared with the five other deep learning models such as VGG16, ResNet50, SENet50, ViT and Swin-S (swin-Transformer-small), and two sets of experiments that use different breast tumor DCE-MRI sequences are conducted to evaluate the robustness and generalization of the model. The quantitative evaluation metrics contain accuracy and area under the receiver operating characteristic (ROC) curve (AUC). Compared with VGG16 and ResNet50 in two sets of experiments, the accuracy increases by 3.7%, 3.6% and AUC increases by 0.045, 0.035 on average, respectively. Compared with SENet50 and ViT-7 in two sets of experiments, the accuracy increases by 3.2% and 1.1%, and AUC increases by 0.035 and 0.025 on average, respectively. Compared with Swin-S in two sets of experiments, the accuracy increases by 3.0% and 2.6%, and AUC increases by 0.05 and 0.03. In addition, the class activation map of learned feature representations of models is generated to increase the interpretability of the models. Finally, a series of ablation experiments is conducted to prove the effectiveness of our proposed method. Specially, different fusion methods such as feature- and decision-level fusion are compared with our cross-attention fusion module. Compared with the feature-level fusion method in two sets of experiments, the accuracy increases by 1.6% and 1.3%, and AUC increases by 0.03 and 0.02. Compared with the decision-level fusion method in two sets of experiments, the accuracy increases by 0.7% and 1.8%, and AUC increases by 0.02 and 0.04. In the end, comparative experiments with three recent methods such as RegNet, ConvNext, and MobileViT are also performed. Experimental results fully demonstrate the effectiveness of our method in the breast tumor DCE-MR image classification task.ConclusionIn this paper, a novel cross-attention fusion network based on local CNN and global ViT (LG-CAFN) is proposed for the benign and malignant tumor classification of breast DCE-MR images. Extensive experiments demonstrate the superior performance of our method compared with several state-of-the-art methods. Although the LG-CAFN model is only used for the diagnosis of breast tumor DCE-MR images, this approach can be very easily transferred to other medical image diagnostic tasks. Therefore, in future work, our approach will be extended to other medical image diagnostic tasks, such as breast ultrasound images and breast CT images. In addition, automatic segmentation tasks for breast DCE-MR images will be explored to analyze breast DCE-MR images more comprehensively and to help radiologists make more accurate diagnoses.  
      关键词:breast tumor;dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI);vision Transformer (ViT);convolutional neural network (CNN);attention fusion   
      2
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622305 false
      发布时间:2024-01-16
    • Jiang Ting,Li Xiaoning
      Vol. 29, Issue 1, Pages: 268-279(2024) DOI: 10.11834/jig.221032
      Segmentation of abdominal CT and cardiac MR images with multi scale visual attention
      摘要:ObjectiveMedical image segmentation is one of the important steps in computer-aided diagnosis and surgery planning. However, due to the complex, diverse structure of various human organs, blurred tissue edges, size, and other problems, the segmentation performance is poor and the segmentation effect needs to be further improved, while more accurate segmentation performance can more effectively help doctors to carry out treatment and provide advice. Recently, deep-learning-based methods have become a hot spot for researching medical image segmentation. Vision Transformer (ViT), which has achieved great success in the field of natural language processing, has also flourished in the field of computer vision; therefore, it is favored by medical image segmentation researchers. However, current medical image segmentation networks based on ViT flatten image features into 1D sequences, ignoring the 2D structure of images and the connections between them. Moreover, the secondary computational complexity of the multihead self-attention (MHSA) mechanism of ViT increases the required computational overhead.MethodTo address the above problems, this paper proposes MSVA-TransUNet, a U-shaped network structure with Transformer as the backbone network based on multi scale vision attention, an attention mechanism implemented by multiple stripe convolutions. The structure is similar to the multihead attention mechanism, which uses convolutional operations to obtain long-distance dependencies. First, the network uses convolution kernels of different sizes to extract features of images of dissimilarsizes, uses a pair of strip convolution operations to approximate a large kernel convolution instead, and does not use dissimilarsizes of strip convolution to approximate diverse large kernel convolutions, which can capture local information using convolution, while large convolution kernels can also learn long-distance dependence of images. Second, strip convolution belongs to lightweight convolution, which can remarkably reduce the number of parameters and floating-point operations of the network and lower the required computational overhead, because the computational overhead of convolution is much smaller than the overhead required by the secondary computational complexity of multihead attention. Further, it avoids converting the image into a 1D sequence for input to vision Transformer and makes full use of the 2D structure of the image to learn the features of the image. Finally, replacing the first patch embedding in the encoding stage with a convolution stem avoids directly converting low channel counts to high channel counts, which runs counter to the typical structure of convolutional neural networks (CNNs) while retaining the structure of patch embeddings elsewhere.ResultExperimental results on the abdominal multiorgan segmentation dataset (mainly containing eight organs) and the heart segmentation dataset (comprising three parts of the heart) show the segmentation accuracy of the proposed network in this paper is improved compared with the baseline model. The average Dice of the abdominal multiorgan segmentation dataset is improved by 3.74%, and the average Dice of the heart segmentation dataset is improved by 1.58%. Their floating-point operations and number of parameters are reduced compared with the MHSA mechanism and the large kernel convolution. The MHSA mechanism’s floating-point operation is 1/278 of the self-attention mechanism, and the number of network parameters is 15.31 M, which is 1/6.88 of the TransUNet.ConclusionExperimental results show the proposed network is comparable to or even exceeds the current state-of-the-art networks. The multiscale visual attention mechanism is used instead of the multihead self-attention mechanism, which can also capture long-distance relationships and extract graphic long-distance features. Segmentation performance is improved while reducing computational overhead, that is, the proposed network exhibit certain advantages. However, due to the specificity of the location and small size of some organs, the networks do not have enough feature learning ability for this part of the organs; hence, its segmentation accuracy still needs to be further improved, and we will continue to study how to improve the segmentation performance of this part of the organs in depth. The code of this paper will be open source soon:https://github.com/BeautySilly/VA-TransUNet.  
      关键词:medical image segmentation;visual attention;Transformer;attention mechanism;abdominal multi-organ segmentation;cardiac segmentation   
      2
      |
      1
      |
      1
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622403 false
      发布时间:2024-01-16

      Remote Sensing Image Processing

    • Tu Kun,Xiong Fengchao,Fu Guanyiman,Lu Jianfeng
      Vol. 29, Issue 1, Pages: 280-292(2024) DOI: 10.11834/jig.221109
      Multitask hyperspectral image convolutional sparse coding-denoising network
      摘要:ObjectiveHyperspectral images (HSIs) are contaminated by noises due to the imaging mechanism, equipment errors, and imaging environment. Because of the diverse sensitivity of sensors at different wavelengths, the noise intensities among bands are always dissimilar, that is, spectrally non-independent and identically distributed noises exist. Noise interference greatly limits the interpretation and application of HSIs. Therefore, HSI denoising is an indispensable preprocessing step to improve the utility of HSIs. Sparse-representation (SR)-based methods assume clean HSIs are structural and can be linearly represented by a few atoms in the dictionary, while structureless random noise cannot be represented. However, most SR-based methods follow the pipeline to break the compete HSIs into many overlapped, small local patches, sparsely representing each small patch independently, and average overlapped pixels between each patch to recover HSIs globally. Such a “local-global” denoising mechanism ignores dependencies between overlapping patches, producing lower denoising effectiveness and visual defects. Differently, convolutional sparse coding (CSC) employs convolution kernels as atoms and can represent the image without patch division thanks to the shift-invariant property of the convolution operators. In this way, the spatial relationships between different patches are naturally retained. Inspired by this, this paper introduces a multitask convolutional sparse coding network (MTCSC-Net) for HIS denoising.MethodIn this paper, the denoising problem of an individual band is regarded as a single task and the CSC model is used to describe the local spatial structure correlation within each band. The denoising of all the bands is regarded as a multitask problem. All the bands are connected by sharing the sparse coding coefficients to depict the global spectral correlation between different bands, forming a multitask convolutional sparse coding (MTCSC) model. The MTCSC model can realize joint spatial-spectral relationship modeling of HSIs. Moreover, the MTCSC model takes the HSIs as whole and can naturally remain the spatial relationship between pixels; thus, it has a strong denoising ability. Drawing on the powerful learning ability of deep learning, this paper transforms the iterative optimization of the MTCSC model into an end-to-end learnable deep neural network by the deep unfolding technique, that is, MTCSC-Net, to improve the model denoising ability and efficiency further.ResultIn this paper, our method is evaluated on the ICVL and CAVE datasets. In both experiments, different levels of Gaussian noises are added to clean HSI to produce noise-clean pairs. Besides the synthetic experiment, MTCSC-Net is tested on the real-world HYDICE Urban Dataset (Urban) dataset. Eight methods are selected for comparison to prove the effectiveness of the proposed denoising method. Experimental results show peak signal-to-noise ratio (PSNR) is improved by 1.38 dB on the CAVE dataset and 0.64 dB on the ICVL dataset, compared with the traditional patch-based SR method. The visual results show MTCSC-Net can produce cleaner spatial images and more accurate spectral reflectance with a better match with the reference ones.ConclusionThe MTCSC-Net proposed in this paper can effectively utilize the spatial-spectral correlation information of HSIs and has a strong denoising ability.  
      关键词:hyperspectral image (HSI);image denoising;convolutional sparse coding (CSC);multitask learning;deep unfolding   
      3
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 47622401 false
      发布时间:2024-01-16
    0