juniversalchardet编码检测工具-技术圈

juniversalchardet 是“universalchardet”的 Java 端口，它是 Mozilla 的编码检测器库。

Mozilla在很多年前就做了一个非常优秀的编码检测工具，叫chardet(java版jchardet )，后来有发布了算法更加优秀的universalchardet，用于Firefox的自动编码识别。另外Apache内容抽取项目Tika的发布包tika-app-1.*.jar(自1.2及以后版本)其中打包了 juniversalchardet。

注意：如果试图识别几个字节的短文本编码，可能会出现了识别错误，这应该是算法实现本身的缺陷，但识别稍大一点文本编码，正确率则非常高，尤其较chardet要高的多。

可以检测到的编码

Chinese
- ISO-2022-CN
- BIG5
- EUC-TW
- GB18030
- HZ-GB-23121

Cyrillic
- ISO-8859-5
- KOI8-R
- WINDOWS-1251
- MACCYRILLIC
- IBM866
- IBM855

Greek
- ISO-8859-7
- WINDOWS-1253

Hebrew
- ISO-8859-8
- WINDOWS-1255

Japanese
- ISO-2022-JP
- SHIFT_JIS
- EUC-JP

Korean
- ISO-2022-KR
- EUC-KR

Unicode
- UTF-8
- UTF-16BE / UTF-16LE
- UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431

Others
- WINDOWS-1252

示例代码



import org.mozilla.universalchardet.UniversalDetector;



public class TestDetector {

  public static void main(String[] args) throws java.io.IOException {

    byte[] buf = new byte[4096];

    String fileName = args[0];

    java.io.FileInputStream fis = new java.io.FileInputStream(fileName);



    // (1)

    UniversalDetector detector = new UniversalDetector(null);



    // (2)

    int nread;

    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {

      detector.handleData(buf, 0, nread);

    }

    // (3)

    detector.dataEnd();



    // (4)

    String encoding = detector.getDetectedCharset();

    if (encoding != null) {

      System.out.println("Detected encoding = " + encoding);

    } else {

      System.out.println("No encoding detected.");

    }



    // (5)

    detector.reset();

  }

}