中文文本错别字检测以及自动纠错

机器学习AI算法工程

共 2892字，需浏览 6分钟

·

2020-11-11 15:33

向AI转型的程序员都关注了这个号???

机器学习AI算法工程公众号：datayx

How to use :

run in the terminal : python Autochecker4Chinese.py
You will get the following result :

代码及运行教程获取：

关注微信公众号 datayx 然后回复纠错即可获取。

1. Make a detecter

Construct a dict to detect the misspelled chinese phrase，key is the chinese phrase, value is its corresponding frequency appeared in corpus.
You can finish this step by collecting corpus from the internet, or you can choose a more easy way, load some dicts already created by others. Here we choose the second way, construct the dict from file.
The detecter works in this way: for any phrase not appeared in this dict, the detecter will detect it as a mis-spelled phrase.

Make an autocorrecter

Make an autocorrecter for the misspelled phrase, we use the edit distance to make a correct-candidate list for the mis-spelled phrase
We sort the correct-candidate list according to the likelyhood of being the correct phrase, based on the following rules:

If the candidate's pinyin matches exactly with misspelled phrase's pinyin, we put the candidate in first order, which means they are the most likely phrase to be selected.
Else if candidate first word's pinyin matches with misspelled phrase's first word's pinyin, we put the candidate in second order.
Otherwise, we put the candidate in third order.

3. Correct the misspelled phrase in a sentance

For any given sentence, use jieba do the segmentation,
Get segment list after segmentation is done, check if the remain phrase exists in word_freq dict, if not, then it is a misspelled phrase
Use auto_correct function to correct the misspelled phrase
Output the correct sentence

阅读过本文的人还看了以下文章：

TensorFlow 2.0深度学习案例实战

基于40万表格数据集TableBank，用MaskRCNN做表格检测

《基于深度学习的自然语言处理》中/英PDF

Deep Learning 中文版初版-周志华团队

【全套视频课】最全的目标检测算法系列讲解，通俗易懂！

《美团机器学习实践》_美团算法团队.pdf

《深度学习入门：基于Python的理论与实现》高清中文PDF+源码

特征提取与图像处理(第二版).pdf

python就业班学习视频，从入门到实战项目

2019最新《PyTorch自然语言处理》英、中文版PDF+源码

《21个项目玩转深度学习：基于TensorFlow的实践详解》完整版PDF+附书代码

《深度学习之pytorch》pdf+附书源码

PyTorch深度学习快速实战入门《pytorch-handbook》

【下载】豆瓣评分8.1,《机器学习实战:基于Scikit-Learn和TensorFlow》

《Python数据分析与挖掘实战》PDF+完整源码

汽车行业完整知识图谱项目实战视频(全23课)

李沐大神开源《动手学深度学习》，加州伯克利深度学习（2019春）教材

笔记、代码清晰易懂！李航《统计学习方法》最新资源全套！

《神经网络与深度学习》最新2018版中英PDF+源码

将机器学习模型部署为REST API

FashionAI服装属性标签图像识别Top1-5方案分享

重要开源！CNN-RNN-CTC 实现手写汉字识别

yolo3 检测出图像中的不规则汉字

同样是机器学习算法工程师，你的面试为什么过不了？

前海征信大数据算法：风险概率预测

【Keras】完整实现‘交通标志’分类、‘票据’分类两个项目，让你掌握深度学习图像分类

VGG16迁移学习，实现医学图像识别分类工程项目

特征工程(一)

特征工程(二) :文本数据的展开、过滤和分块

特征工程(三):特征缩放,从词袋到 TF-IDF

特征工程(四): 类别特征

特征工程(五): PCA 降维

特征工程(六): 非线性特征提取和模型堆叠

特征工程(七)：图像特征提取和深度学习

如何利用全新的决策树集成级联结构gcForest做特征工程并打分？

Machine Learning Yearning 中文翻译稿

蚂蚁金服2018秋招-算法工程师（共四面）通过

全球AI挑战-场景分类的比赛源码(多模型融合)

斯坦福CS230官方指南：CNN、RNN及使用技巧速查（打印收藏）

python+flask搭建CNN在线识别手写中文网站

中科院Kaggle全球文本匹配竞赛华人第1名团队-深度学习与特征工程

不断更新资源

深度学习、机器学习、数据分析、python

搜索公众号添加： datayx

机大数据技术与机器学习工程

搜索公众号添加： datanlp

长按图片，识别二维码

浏览 153

点赞

收藏

分享

举报

评论

图片

表情

NLP（四十八）使用kenlm进行文本纠错

Python爬虫与算法

NHAutoCompleteTextFieldiOS 文本自动完成

NHAutoCompleteTextField 是文本字段自动完成控件，按照搜索标准过滤列表，选择处

NHAutoCompleteTextFieldiOS 文本自动完成

NHAutoCompleteTextField是文本字段自动完成控件，按照搜索标准过滤列表，选择处理下拉方向和突出搜索用户类型。NHAutoComplete要求ARC。

Ekho中文文本转语音引擎

Ekho（余音）是一个把文字转换成声音的软件。它目前支持粤语、普通话（国语）、诏安客语、藏语、雅言（中国古代通用语）和韩语（试验中），英文则通过Festival间接实现。支持Linux、Windows

Ekho中文文本转语音引擎

Ekho（余音）是一个把文字转换成声音的软件。它目前支持粤语、普通话（国语）、诏安客语、藏语、雅言（

NLP（四十七）文本纠错之获取形近字

Python爬虫与算法

值得一看的文本检测方法

机器学习算法工程师

dict_build自动构建中文词库

博文的java实现，可以自动抽取语料库中的词汇，可以作为自然语言处理的第一步，准备词典。成词条件互信息左右熵位置成词概率ngram频率运行方法下载或者gradledistTar打包程序解压dict_b

FitText.js文本大小自动调整

FitText 是一个非常有趣的工具，它可实现响应式的网页布局。使用方法：<script sr

MPGTextFieldiOS 文本字段自动完成

MPGTextField 是自动完成文本字段的 iOS 应用，当输入的时候会提供相应的建议。MPGT