IHCCD: 非规范手写汉字识别数据集
IHCCD: dataset for identification of irregular handwritten Chinese characters
- 2024年29卷第11期 页码:3345-3356
纸质出版日期: 2024-11-16
DOI: 10.11834/jig.230047
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-11-16 ,
移动端阅览
季佳美, 邵允学, 季倓正. 2024. IHCCD: 非规范手写汉字识别数据集. 中国图象图形学报, 29(11):3345-3356
Ji Jiamei, Shao Yunxue, Ji Tanzheng. 2024. IHCCD: dataset for identification of irregular handwritten Chinese characters. Journal of Image and Graphics, 29(11):3345-3356
目的
2
随着深度学习技术的快速发展,规范手写汉字识别(handwritten Chinese character recognition, HCCR) 任务已经取得突破性进展,但对非规范书写汉字识别的研究仍处于萌芽阶段。受到书法流派和书写习惯等原因影响,手写汉字常常与打印字体差异显著,导致同类别文字的整体结构差异非常大,基于现有数据集训练得到的识别模型,无法准确识别非规范书写的汉字。
方法
2
为了推动非规范书写汉字识别的研究工作,本文制做了首套非规范书写的汉字数据集(irregular handwritten Chinese character dataset, IHCCD),目前共包含3 755个类别,每个类别有30幅样本。还给出了经典深度学习模型ResNet,CBAM-ResNet,Vision Transformer,Swin Transformer在本文数据集上的基准性能。
结果
2
实验结果表明,虽然以上经典网络模型在规范书写的CASIA-HWDB1.1数据集上能够取得良好性能,其中Swin Transformer在CASIA-HWDB1.1数据集上最高精度达到了95.31%,但是利用CASIA-HWDB1.1训练集训练得到的网络模型,在IHCCD测试集上的识别结果较差,最高精度也只能达到30.20%。在加入IHCCD训练集后,所有的经典模型在IHCCD测试集上的识别性能均得到了较大提升,最高精度能达到89.89%,这表明IHCCD数据集对非规范书写汉字识别具有研究意义。
结论
2
现有OCR识别模型还存在局限性,本文收集的IHCCD数据集能够有效增强识别模型泛化性能。该数据集下载链接
https://pan.baidu.com/s/1PtcfWj3yUSz68o2ZzvPJOQ
https://pan.baidu.com/s/1PtcfWj3yUSz68o2ZzvPJOQ
?pwd=66Y7。
Objective
2
With the rapid development of deep learning technology, the task of handwritten Chinese character recognition (HCCR) has made breakthrough progress. Initially, text recognition research focused primarily on the recognition of English characters and numbers. However, with the deepening of artificial intelligence technology, numerous researchers have begun to focus on the field of Chinese character recognition. In recent years, Chinese character recognition has been widely used in several application scenarios and currently has a wide range of application scenarios in the fields of bank bill recognition, mail sorting, and office automation. Chinese characters are the most widely used language in the world with the richest information meaning and are an important language carrier for people’s communication. Therefore, the research on Chinese character recognition has a crucial value. However, despite these advancements, the recognition of irregular handwritten Chinese characters remains a challenging task. Handwritten Chinese characters are often influenced by various calligraphic styles and individual writing habits, leading to notable deviations from regular printed fonts. These variations can result in considerable differences in the overall structure of characters within the same category. Therefore, recognition models trained on these regular datasets may struggle to accurately identify irregularly handwritten Chinese characters encountered in real-world scenarios. For example, when sending a picture to WeChat, the text in the picture may involve sensitive words. During the identification of words by the text recognition engine, if these words are regular writing, then the engine can accurately identify and filter these sensitive words. However, some people intentionally avoid the identification of the text recognition engine due to irregular handwriting to circumvent regulation; thus, the search engine cannot recognize these words. Therefore, the research on the recognition of irregular handwritten Chinese characters is of considerable importance and can be applied in the fields of information security and filtering.
Method
2
The dataset of irregular handwritten Chinese characters can be classified into the following types: missing or wrong order of strokes, problems with the connection or separation of strokes, maliciously enlarged or shrunken radicals, serious distortion of the character shape, saki change of the form, and excessive horizontal and vertical amplitudes, resulting in misplacement of the entire spatial structure of the characters and easily leading to ambiguities and misinterpretations. This paper collects the first irregular handwritten Chinese character dataset (IHCCD), which currently contains a total of 3 755 categories with 30 samples for each category to promote the research work on the recognition of irregular handwritten Chinese characters. In the experiment, the first 20 samples were used as training samples, and the next 10 samples were used as test samples. IHCCD is performed by different irregular handwriters who handwrite on A4 printing paper and use a scanner as the input device to convert handwritten character samples into digital image samples. These irregular handwriters do not need to write exactly according to the regular Chinese character stroke order during the dataset collection process. They can freely adjust the stroke thickness, length, and position and enlarge or reduce the radicals arbitrarily. Moreover, they can change the tilt of the Chinese characters, resulting in distorted shapes and misaligned spatial structures, bypassing the current text recognition engine. A series of image processing techniques must be adopted for the collected dataset of irregular handwritten Chinese characters. These image processing techniques, including image skew correction, single character segmentation, Otsu binarization, and character normalization, must be adopted to construct the IHCCD dataset.
Result
2
In this paper, detailed experiments were conducted on the IHCCD and CASIA-HWDB1.1 datasets to compare the recognition performance of the classical network models, such as ResNet, CBAM-ResNet, Vision Transformer, and Swin Transformer, under different experimental settings., and the experimental results show that although the above classical network models can achieve good performance on the canonically written CASIA-HWDB1.1 dataset. Among them, Swin Transformer achieves the highest accuracy of 95.31% on the CASIA-HWDB1.1 dataset, but the network model trained using the CASIA-HWDB1.1 training set has poor recognition results on the IHCCD test set, and the highest accuracy can only reach 30.20%. After adding the IHCCD training set, the recognition performance of all the classical models on the IHCCD test set is markedly improved, and the highest accuracy can only reach 30.20%,showing that the IHCCD dataset is crucial for the study of irregular written Chinese character recognition.
Conclusion
2
The existing optical character recognition(OCR) recognition models still have limitations, and the dataset collected in this paper can effectively enhance the generalization performance of the recognition models. However, even for the Swin Transformer model, which has the best performance, a large gap still exists between the recognition accuracy of irregular written Chinese characters and that of regular written Chinese characters, which requires researchers to conduct further in-depth study on this problem. Link to download this dataset:
https://pan.baidu.com/s/1PtcfWj3yUSz68o2ZzvPJOQ
https://pan.baidu.com/s/1PtcfWj3yUSz68o2ZzvPJOQ
?pwd=66Y7.
非规范书写手写汉字识别(HCCR)IHCCD数据集深度学习经典分类模型
irregular writinghandwritten Chinese character recognition(HCCR)IHCCD datasetdeep learningclassical classification model
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2020. An image is worth16 × 16 words: Transformers for image recognition at scale [EB/OL]. [2023-12-21]. https://arxiv.org/pdf/2010.11929v1.pdfhttps://arxiv.org/pdf/2010.11929v1.pdf
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
He P S, Li W C, Zhang J Y, Wang H X and Jiang X H. 2022. Overview of passive forensics and anti-forensics techniques for GAN-generated image. Journal of Image and Graphics, 27(1): 88-110
何沛松, 李伟创, 张婧媛, 王宏霞, 蒋兴浩. 2022. 面 向GAN生成图像的被动取证及反取证技术综述. 中国图象图形学报, 27(1): 88-110 [DOI: 10.11834/jig.210430http://dx.doi.org/10.11834/jig.210430]
He X, Zhou Y, Zhao J Q, Zhang D, Yao R and Xue Y. 2022. Swin Transformer embedding UNet for remote sensing image semantic segmentation. IEEE Transactions on Geoscience and Remote Sensing, 60: #4408715 [DOI: 10.1109/TGRS.2022.3144165http://dx.doi.org/10.1109/TGRS.2022.3144165]
Hou Q B, Zhou D Q and Feng J S. 2021. Coordinate attention for efficient mobile network design//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 13708-13717 [DOI: 10.1109/CVPR46437.2021.01350http://dx.doi.org/10.1109/CVPR46437.2021.01350]
Ilya L and Frank H. 2016. SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv: 1608.03983 [DOI: 10.48550/arXiv.1608.03983http://dx.doi.org/10.48550/arXiv.1608.03983]
Kang L, Riba P, Rusiñol M, Fornés A and Villegas M. 2022. Pay attention to what you read: non-recurrent handwritten text-line recognition [EB/OL]. [2023-12-21]. https://arxiv.org/pdf/2005.13044.pdfhttps://arxiv.org/pdf/2005.13044.pdf
Kingma D P and Ba J. 2015. Adam: a method for stochastic optimization//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: ICLR: 1-13 [DOI: 10.48550/arXiv.1412.6980http://dx.doi.org/10.48550/arXiv.1412.6980]
Liu C L, Yin F, Wang D H and Wang Q F. 2013. Online and offline handwritten Chinese character recognition: benchmarking on new databases. Pattern Recognition, 46(1): 155-162 [DOI: 10.1016/j.patcog.2012.06.021http://dx.doi.org/10.1016/j.patcog.2012.06.021]
Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, Lin S and Guo B N. 2021. Swin Transformer: hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 10012-10022 [DOI: 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986]
Melnyk P, You Z Q and Li K Q. 2020. A high-performance CNN method for offline handwritten Chinese character recognition and visualization. Soft Computing, 24(11): 7977-7987 [DOI: 10.1007/s00500-019-04083-3http://dx.doi.org/10.1007/s00500-019-04083-3]
Nasr G E, Badr E A and Joun C. 2002. Cross entropy error function in neural networks: forecasting gasoline demand//Proceedings of the 15th International Florida Artificial Intelligence Research Society Conference (FLAIRS). Florida, USA: AAAI: 381-384 [DOI: 10.5555/646815.708603http://dx.doi.org/10.5555/646815.708603]
Otsu N. 1979. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1): 62-66 [DOI: 10.1109/TSMC.1979.4310076http://dx.doi.org/10.1109/TSMC.1979.4310076]
Qiao L, Li Z S, Cheng Z Z and Li X. 2023. SCID: a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images. Journal of Image and Graphics, 28(8): 2298-2313
乔梁, 李再升, 程战战, 李玺. 2023. SCID: 用于富含视觉信息文档图像中信息提取任务的扫描中文票据数据集. 中国图象图形学报, 28(8): 2298-2313 [DOI: 10.11834/jig.220911http://dx.doi.org/10.11834/jig.220911]
Sermanet P, Chintala S and LeCun Y. 2012. Convolutional neural networks applied to house numbers digit classification [EB/OL]. [2023-01-17]. https://arxiv.org/pdf/1204.3968.pdfhttps://arxiv.org/pdf/1204.3968.pdf
Shao Y X and Liu C L. 2020. Teaching machines to write like humans using L-attributed grammar. Engineering Applications of Artificial Intelligence, 90: #103489 [DOI: 10.1016/j.engappai.2020.103489http://dx.doi.org/10.1016/j.engappai.2020.103489]
Shao Y X, Wang C H and Xiao B H. 2013. Visual word density-based nonlinear shape normalization method for handwritten Chinese character recognition. International Journal on Document Analysis and Recognition (IJDAR), 16(4): 387-397 [DOI: 10.1007/s10032-012-0198-4http://dx.doi.org/10.1007/s10032-012-0198-4]
Shuai X, Wang X, Wang W, Yuan X and Xu X. 2022. SAM: self attention mechanism for scene text recognition based on swin Transformer//Jónsson B Þ, Gurrin C, Tran M T, Dang-Nguyen D T, Hu A M C, Thanh B H T and Huet B, eds. MultiMedia Modeling. Cham, USA: Springer [DOI: 10.1007/978-3-030-98358-1_35http://dx.doi.org/10.1007/978-3-030-98358-1_35]
Su T H, Zhang T W and Guan D J. 2007. Corpus-based HIT-MW database for offline recognition of general-purpose Chinese handwritten text. International Journal of Document Analysis and Recognition (IJDAR), 10(1): 27-38 [DOI: 10.1007/s10032-006-0037-6http://dx.doi.org/10.1007/s10032-006-0037-6]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. California, USA: Curran Associates Inc.: 6000- 6010 [DOI: 10.5555/3295222.3295349http://dx.doi.org/10.5555/3295222.3295349]
Wang Y T, Yang Y J, Chen H Y, Zheng H and Chang H Y. 2022. End-to-end handwritten Chinese paragraph text recognition using residual attention networks. Intelligent Automation and Soft Computing, 34(1): 371-388 [DOI: 10.32604/iasc.2022.027146http://dx.doi.org/10.32604/iasc.2022.027146]
Woo S, Park J, Lee J and Kweon I S. 2018. CBAM: convolutional block attention module [EB/OL]. [2023-12-21]. https://arxiv.org/pdf/1807.06521.pdfhttps://arxiv.org/pdf/1807.06521.pdf
Wu Z F, Shen C H and van den Hengel A. 2019. Wider or deeper: revisiting the ResNet model for visual recognition [EB/OL]. [2023-12-21]. https://arxiv.org/pdf/1611.10080.pdfhttps://arxiv.org/pdf/1611.10080.pdf
Yin F, Wang Q F, Zhang X Y and Liu C L. 2013. ICDAR 2013 Chinese handwriting recognition competition//Proceedings of the 12th International Conference on Document Analysis and Recognition. Washington, USA: IEEE: 1464-1470 [DOI: 10.1109/ICDAR.2013.218http://dx.doi.org/10.1109/ICDAR.2013.218]
Zhan H J, Lyu S and Lu Y. 2022. Improving offline handwritten Chinese text recognition with glyph-semanteme fusion embedding. International Journal of Machine Learning and Cybernetics, 13(2): 485-496 [DOI: 10.1007/s13042-021-01420-7http://dx.doi.org/10.1007/s13042-021-01420-7]
Zhang H G, Guo J, Chen G and Li C G. 2009. HCL2000-a large-scale handwritten Chinese character database for handwritten character recognition//Proceedings of the 10th International Conference on Document Analysis and Recognition. Barcelona, Spain: IEEE: 286-289 [DOI: 10.1109/ICDAR.2009.15http://dx.doi.org/10.1109/ICDAR.2009.15]
Zhuang Y, Liu Q, Qiu C J, Wang C, Ya F, Sabbir A and Yan J Q. 2021. A handwritten Chinese character recognition based on convolutional neural network and median filtering. Journal of Physics: Conference Series, 1820(1): #012162 [DOI: 10.1088/1742-6596/1820/1/012162http://dx.doi.org/10.1088/1742-6596/1820/1/012162]
相关作者
相关机构