结合文本自训练和对抗学习的领域自适应工业场景文本检测
Text self-training and adversarial learning-relevant domain adaptive industrial scene text detection
- 2024年29卷第10期 页码:3090-3103
纸质出版日期: 2024-10-16
DOI: 10.11834/jig.230519
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-10-16 ,
移动端阅览
吕学强, 权伟杰, 韩晶, 陈玉忠, 才藏太. 2024. 结合文本自训练和对抗学习的领域自适应工业场景文本检测. 中国图象图形学报, 29(10):3090-3103
Lyu Xueqiang, Quan Weijie, Han Jing, Chen Yuzhong, Cai Zangtai. 2024. Text self-training and adversarial learning-relevant domain adaptive industrial scene text detection. Journal of Image and Graphics, 29(10):3090-3103
目的
2
快速检测工业场景中的文本,可以提高生产效率、降低成本,然而数据的标注耗时耗力,鲜有标注信息可用,针对目前方法在应用到工业数据时存在伪标签质量低和域差距较大等问题,本文提出了一种结合文本自训练和对抗学习的领域自适应工业场景文本检测方法。
方法
2
首先,针对伪标签质量低的问题,采用教师学生框架进行文本自训练。教师和学生模型应用数据增强和相互学习缓解域偏移,提高伪标签的质量;其次,针对域差距,提出图像级和实例级对抗学习模块来对齐源域和目标域的特征分布,使网络学习域不变特征;最后,在两个对抗学习模块之间使用一致性正则化进一步缓解域差距,提高模型的域适应能力。
结果
2
实验证明,本文的方法在工业铭牌数据集的精确率、召回率和F1值分别达到96.2%、95.0%和95.6%,较基线模型分别提高了10%、15.3%和12.8%。同时在ICDAR15和MSRA-TD500数据集上也表现出良好性能,与当前先进的方法相比,F1值分别提高0.9%和3.1%。此外,本文的方法在应用到EAST(efficient and accurate scene text detector)文本检测模型后,铭牌数据集的各指标分别提升5%,11.8%和9.5%。
结论
2
本文提出的方法成功缓解了源域与目标域数据之间的差距,显著提高了模型的泛化能力,并且具有良好的通用性,同时模型推理阶段不会增加计算成本。
Objective
2
The surface of industrial equipment records important information, such as equipment model, specifications, and functions, which are crucial for equipment management and maintenance. Traditional information collection relies on workers taking photos and recording, which is inefficient and hardly meets the current high-efficiency and low-cost production requirements. By utilizing scene text detection technology to detect text in industrial scenarios automatically, production efficiency and cost effectiveness can be improved, which is crucial for industrial intelligence and automation. The success of scene text detection algorithms relies heavily on the availability of large-scale, high-quality annotation data. However, in industrial scenarios, data collection and annotation are time consuming and labor intensive, resulting in a small amount of data and no annotation information, severely limiting model performance. Furthermore, substantial domain gaps exist between the “source domain” (public data) and the “target domain” (industrial scene data), making it difficult for models trained on public datasets to generalize directly to industrial scene text detection tasks. Therefore, we focus on researching domain adaptive scene text detection algorithms. However, when applied to industrial scene text detection, these methods encounter the following problems: 1) Image translation methods achieve domain adaptation by generating similar target domain images, but this method focuses on adapting to low-frequency appearance information and are not effective in handling text detection tasks. 2) The quality of pseudo labels generated by self-training methods is low and cannot be adaptively improved during training, limiting the model’s domain adaptability. 3) The adversarial feature alignment method disregards the influence of background noise and cannot effectively mitigate domain gaps. To address these issues, we propose a domain adaptive industrial scene text detection method called DA-DB++, which stands for domain-adaptive differentiable binarization, based on text self-training and adversarial learning.
Method
2
In this study, we address the issues of low-quality pseudo labels and domain gaps. First, we introduce a teacher–student self-training framework. Applying data augmentation and mutual learning between teacher and student models, which enhances the robustness of the model, reduces domain bias and gradually generates high-quality pseudo labels during training. Specifically, the teacher model generates pseudo labels for data in the target domain, while the student model uses source domain data and pseudo labels for training. The exponential moving average of the student model is used to update the teacher model. Second, we propose image-level and instance-level adversarial learning modules in the student model to address the large domain gap. These modules align the feature distributions of the source and target domains, achieving domain-invariant learning within the network. Specifically, an image-level alignment module is added after the feature extraction network, and the coordinate attention mechanism is used to aggregate features along the horizontal and vertical spatial directions, improving the extraction of global-level features. This process helps reduce shifts caused by global image differences, such as image style and shape. The alignment of advanced semantic features can help the model better learn feature representations, effectively reduce domain gaps, and improve the model’s generalization ability. Instance-level alignment is implemented by using text labels for mask filtering. This process forces the network to focus on the text area and suppresses background noise interference. Finally, two adversarial learning modules are regularized to alleviate domain gaps and improve the model’s domain adaptability further.
Result
2
We conducted experiments and analysis with other domain adaptive text detection methods on the industrial nameplate dataset and public dataset to verify the effectiveness and robustness of our method. The experiments showed that each module of our proposed method contributes to the overall performance to varying degrees. When the ICDAR2013 and nameplate datasets were respectively used as the source and target domains, our method attained accuracy, recall, and F1 values of 96.2%, 95.0%, and 95.6%, respectively. These values were 10%, 15.3%, and 12.8% higher than the baseline model DBNet++. This result indicates that our method alleviates domain gaps and offsets, generates high-quality pseudo labels, and improves the model’s domain adaptability. Additionally, it demonstrates good performance on the ICDAR15 and MSRA-TD500 datasets, with F1 values increased by 0.9% and 3.1%, respectively, compared with state-of-the-art methods. In addition, applying our method to the efficient and accurate scene text detector (EAST) model results in a 5%, 11.8%, and 9.5% increase in the accuracy, recall, and F1 values, respectively, on the nameplate dataset.
Conclusion
2
In this study, we propose a domain adaptive industrial scene text detection method to address the issue of low quality in pseudo labels and domain gaps between source and target, improving the model’s domain adaptability on the target dataset. The experimental results and analysis indicate that the method proposed in this study remarkably enhances the domain adaptability of the DBNet++ text detection model. It achieves state-of-the-art results in domain adaptation tasks for industrial nameplate and public text detection, thus verifying the effectiveness of the method proposed in this study. Additionally, experiments on the EAST model have demonstrated the universality of the proposed method. The model inference stage will not increase computational costs and time consumption.
场景文本检测领域自适应文本自训练特征对抗学习一致性正则化
scene text detectiondomain adaptationtext self-trainingfeature adversarial learningconsistency regularization
Arruda V F, Paixao T M, Berriel R F, De Souza A F, Badue C, Sebe N and Oliveira-Santos T. 2019. Cross-domain car detection using unsupervised image-to-image translation: from day to night//Proceedings of 2019 International Joint Conference on Neural Networks. Budapest, Hungary: IEEE: 1-8 [DOI: 10.1109/IJCNN.2019.8852008http://dx.doi.org/10.1109/IJCNN.2019.8852008]
Chen Y D, Wang W, Zhou Y, Yang F, Yang D B and Wang W P. 2021. Self-training for domain adaptive scene text detection//Proceedings of the 25th International Conference on Pattern Recognition. Milan, Italy: IEEE: 850-857 [DOI: 10.1109/ICPR48806.2021.9412558http://dx.doi.org/10.1109/ICPR48806.2021.9412558]
Chen Y H, Li W, Sakaridis C, Dai D X and Van Gool L. 2018. Domain adaptive faster R-CNN for object detection in the wild//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3339-3348 [DOI: 10.1109/CVPR.2018.00352http://dx.doi.org/10.1109/CVPR.2018.00352]
Deng D, Liu H F, Li X L and Cai D. 2018. PixelLink: detecting scene text via instance segmentation//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI: 6773-6780 [DOI: 10.1609/aaai.v32i1.12269http://dx.doi.org/10.1609/aaai.v32i1.12269]
Deng J H, Li W, Chen Y H and Duan L X. 2021. Unbiased mean teacher for cross-domain object detection//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 4089-4099 [DOI: 10.1109/CVPR46437.2021.00408http://dx.doi.org/10.1109/CVPR46437.2021.00408]
Ganin Y and Lempitsky V. 2015. Unsupervised domain adaptation by backpropagation//Proceedings of the 32nd International Conference on Machine Learning. Lile, France: PMLR: 1180-1189
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS: 2672-2680
Guo M Q, Xie H W and Zhang X L. 2022. Character detection and recognition of sewage treatment equipment nameplate. Computer Engineering and Design, 43(10): 2904-2910
郭毛琴, 谢红薇, 张效良. 2022. 污水处理设备铭牌中字符检测与识别. 计算机工程与设计, 43(10): 2904-2910 [DOI: 10.16208/j.issn1000-7024.2022.10.028http://dx.doi.org/10.16208/j.issn1000-7024.2022.10.028]
Hou Q B, Zhou D Q and Feng J S. 2021. Coordinate attention for efficient mobile network design//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 13708-13717 [DOI: 10.1109/CVPR46437.2021.01350http://dx.doi.org/10.1109/CVPR46437.2021.01350]
Hu L Q, Kan M N, Shan S G and Chen X L. 2018. Duplex generative adversarial network for unsupervised domain adaptation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1498-1507 [DOI: 10.1109/cvpr.2018.00162http://dx.doi.org/10.1109/cvpr.2018.00162]
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar V R, Lu S J, Shafait F, Uchida S and Valveny E. 2015. ICDAR 2015 competition on robust reading//Proceedings of the 13th International Conference on Document Analysis and Recognition. Tunis, Tunisia: IEEE: 1156-1160 [DOI: 10.1109/ICDAR.2015.7333942http://dx.doi.org/10.1109/ICDAR.2015.7333942]
Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda L G I, Mestre S R, Mas J, Mota D F, Almazàn J A and de las Heras L P. 2013. ICDAR 2013 robust reading competition//Proceeding of the 12th International Conference on Document Analysis and Recognition. Washington, USA: IEEE: 1484-1493 [DOI: 10.1109/ICDAR.2013.221http://dx.doi.org/10.1109/ICDAR.2013.221]
Li Y J, Dai X L, Ma C Y, Liu Y C, Chen K, Wu B C, He Z J, Kitani K and Vajda P. 2022. Cross-domain adaptive teacher for object detection//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: USA: 7571-7580 [DOI: 10.1109/CVPR52688.2022.00743http://dx.doi.org/10.1109/CVPR52688.2022.00743]
Liao M H, Shi B G and Bai X. 2018. TextBoxes++: a single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8): 3676-3690 [DOI: 10.1109/TIP.2018.2825107http://dx.doi.org/10.1109/TIP.2018.2825107]
Liao M H, Shi B G, Bai X, Wang X G and Liu W Y. 2017. TextBoxes: a fast text detector with a single deep neural network//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI: 4161-4167 [DOI: 10.1609/aaai.v31i1.11196http://dx.doi.org/10.1609/aaai.v31i1.11196]
Liao M H, Wan Z Y, Yao C, Chen K and Bai X. 2020. Real-time scene text detection with differentiable binarization//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 11474-11481 [DOI: 10.1609/aaai.v34i07.6812http://dx.doi.org/10.1609/aaai.v34i07.6812]
Liao M H, Zou Z S, Wan Z Y, Yao C and Bai X. 2023. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1): 919-931 [DOI: 10.1109/TPAMI.2022.3155612http://dx.doi.org/10.1109/TPAMI.2022.3155612]
Liu C Y, Chen X X, Luo C J, Jin L W, Xue Y and Liu Y L. 2021. Deep learning methods for scene text detection and recognition. Journal of Image and Graphics, 26(6): 1330-1367
刘崇宇, 陈晓雪, 罗灿杰, 金连文, 薛洋, 刘禹良. 2021. 自然场景文本检测与识别的深度学习方法. 中国图象图形学报, 26(6): 1330-1367 [DOI: 10.11834/jig.210044http://dx.doi.org/10.11834/jig.210044]
Maaten L and Hinton G. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(86): 2579-2605
Milletari F, Navab N and Ahmadi S A. 2016. V-Net: fully convolutional neural networks for volumetric medical image segmentation//Proceedings of the 4th International Conference on 3D Vision. Stanford, USA: IEEE: 565-571 [DOI: 10.1109/3DV.2016.79http://dx.doi.org/10.1109/3DV.2016.79]
Oza P, Sindagi V A, Sharmini V V and Patel V M. 2023. Unsupervised domain adaptation of object detectors: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6): 4018-4040 [DOI: 10.1109/TPAMI.2022.3217046http://dx.doi.org/10.1109/TPAMI.2022.3217046]
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. [s.l.]: PMLR: 8748-8763
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031]
Saito K, Ushiku Y, Harada T and Saenko K. 2019. Strong-weak distribution alignment for adaptive object detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6949-6958 [DOI: 10.1109/CVPR.2019.00712http://dx.doi.org/10.1109/CVPR.2019.00712]
Tarvainen A and Valpola H. 2017. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 1196-1205
Tian Z C, Xue C H, Zhang J Y and Lu S J. 2022. Domain adaptive scene text detection via subcategorization [EB/OL]. [2022-12-01]. https://arxiv.org/pdf/2212.00377.pdfhttps://arxiv.org/pdf/2212.00377.pdf
Wang D L, Kang B and Zhu R. 2023. Text detection method for electrical equipment nameplates based on deep learning. Journal of Graphics, 44(4): 691-698
王道累, 康博, 朱瑞. 2023. 基于深度学习的电力设备铭牌文本检测方法. 图学学报, 44(4): 691-698 [DOI: 10.11996/JG.j.2095-302X.2023040691http://dx.doi.org/10.11996/JG.j.2095-302X.2023040691]
Wang W H, Xie E Z, Li X, Hou W B, Lu T, Yu G and Shao S. 2019. Shape robust text detection with progressive scale expansion network//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9328-9337 [DOI: 10.1109/CVPR.2019.00956http://dx.doi.org/10.1109/CVPR.2019.00956]
Wang Z X, Xie H T, Wang Y X and Zhang Y D. 2023. Hierarchical semantics-fused scene text detection. Journal of Image and Graphics, 28(8): 2343-2355
王紫霄, 谢洪涛, 王裕鑫, 张勇东. 2023. 层级语义融合的场景文本检测. 中国图象图形学报, 28(8): 2343-2355 [DOI: 10.11834/jig.220902http://dx.doi.org/10.11834/jig.220902]
Wu W J, Lu N, Xie E Z, Wang Y X, Yu W W, Yang C and Zhou H. 2021. Synthetic-to-real unsupervised domain adaptation for scene text detection in the wild//Proceedings of the 15th Asian Conference on Computer Vision. Kyoto, Japan: Springer: 289-303 [DOI: 10.1007/978-3-030-69535-4_18http://dx.doi.org/10.1007/978-3-030-69535-4_18]
Wu Y H and Sang N. 2023. Consistency constraints and label optimization-relevant domain-unsupervised adaptive pedestrians’ re-identification. Journal of Image and Graphics, 28(5): 1372-1383
吴禹航, 桑农. 2023. 基于一致性约束和标签优化的无监督域适应行人重识别. 中国图象图形学报, 28(5): 1372-1383 [DOI: 10.11834/jig.220618http://dx.doi.org/10.11834/jig.220618]
Yao C, Bai X, Liu W Y, Ma Y and Tu Z W. 2012. Detecting texts of arbitrary orientations in natural images//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 1083-1090 [DOI: 10.1109/CVPR.2012.6247787http://dx.doi.org/10.1109/CVPR.2012.6247787]
Yi Y H, He J J, Lu L Q and Tang Z W. 2020. Association of text and other objects for text detection with natural scene images. Journal of Image and Graphics, 25(1): 126-135
易尧华, 何婧婧, 卢利琼, 汤梓伟. 2020. 顾及目标关联的自然场景文本检测. 中国图象图形学报, 25(1): 126-135 [DOI: 10.11834/jig.190179http://dx.doi.org/10.11834/jig.190179]
Yu W W, Liu Y L, Hua W, Jiang D Q, Ren B and Bai X. 2023. Turning a clip model into a scene text detector// Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 6978-6988 [DOI: 10.1109/CVPR52729.2023.00674http://dx.doi.org/10.1109/CVPR52729.2023.00674]
Zhai X H, Sun K, Zhao J F, Sun Y L, Xing Y, Guo K X and Wang H Y. 2023. Nameplate detection and recognition of smart meter communication module based on ResSE-SegNet. Journal of Harbin University of Science and Technology, 28(2): 136-144
翟晓卉, 孙凯, 赵吉福, 孙艳玲, 邢宇, 郭凯旋, 王海英. 2023. 基于ResSE-SegNet的智能电表通信模块铭牌检测与识别. 哈尔滨理工大学学报, 28(2): 136-144 [DOI: 10.15938/j.jhust.2023.02.016http://dx.doi.org/10.15938/j.jhust.2023.02.016]
Zhan F N, Xue C H and Lu S J. 2019. GA-DAN: geometry-aware domain adaptation network for scene text detection and recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 9104-9114 [DOI: 10.1109/ICCV.2019.00920http://dx.doi.org/10.1109/ICCV.2019.00920]
Zhang D, Li J J, Xiong L, Lin L, Ye M and Yang S M. 2019. Cycle-consistent domain adaptive faster RCNN. IEEE Access, 7: 123903-123911 [DOI: 10.1109/ACCESS.2019.2938837http://dx.doi.org/10.1109/ACCESS.2019.2938837]
Zhou X Y, Yao C, Wen H, Wang Y Z, Zhou S C, He W R and Liang J J. 2017. EAST: an efficient and accurate scene text detector//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2642-2651 [DOI: 10.1109/CVPR.2017.283http://dx.doi.org/10.1109/CVPR.2017.283]
Zhu J Y, Park T, Isola P and Efros A A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2242-2251 [DOI: 10.1109/ICCV.2017.244http://dx.doi.org/10.1109/ICCV.2017.244]
相关作者
相关机构