Tesseract
Tesseract 开源 OCR 引擎
tesseract-ocr / tesseract
https://github.com/tesseract-ocr/tesseract
训练好的语言模型库
tesseract-ocr / tessdata
https://github.com/tesseract-ocr/tessdata
Tesseract
psm 页面分割方式
通过 --psm
指定页面分割方式参数,比如 --psm 7
--psm 7
适合单行文本,比如车牌识别--psm 8
适合单个单词识别
tesseract --help-psm
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy
https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/
oem 引擎模式
通过 --oem
参数指定引擎模式,例如 --oem 1
0 遗留的老模式
1 LSTM 神经网络模式
2 老模式+LSTM
3 默认
tessdata_best 和 tessdata_fast 中的模型只支持 LSTM 引擎(–oem 1),不支持 -oem 0 老模式,使用 tess4j 时如果新模型传入 -oem 0 参数会直接崩溃(ERROR)
tesseract --help-oem
OCR Engine modes:
0 Legacy engine only.
1 Neural nets LSTM engine only.
2 Legacy + LSTM engines.
3 Default, based on what is available.
三个官方训练好的语言模型库
Traineddata Files for Version 4.00 +
https://tesseract-ocr.github.io/tessdoc/Data-Files.html
tessdata_fast 速度最快,int 模型
https://github.com/tesseract-ocr/tessdata_fasttessdata_best 准确率最高,float 模型
https://github.com/tesseract-ocr/tessdata_besttessdata 遗留的老模型
https://github.com/tesseract-ocr/tessdata
实际使用中发现 tessdata 库中的模型最大,效果最好,比 tessdata_best 中的还要好。
tessdata_best 和 tessdata_fast 中的模型只支持 LSTM 引擎(–oem 1),不支持 -oem 0 老模式,使用 tess4j 时如果新模型传入 -oem 0 参数会直接崩溃(ERROR)
使用 tesseract 命令进行ocr识别
1、下载训练好的语言模型
https://tesseract-ocr.github.io/tessdoc/Data-Files.html
下载中文模型 chi_sim.traineddata 放到 /usr/share/tesseract/4/tessdata 目录,或者放到任意目录执行命令时指定 data 目录
tesseract --tessdata-dir / tesseract-test.png outfile -l chi_sim
- –tessdata-dir 指定语言模型文件目录,默认 /usr/share/tesseract/4/tessdata
- tesseract-test.png 是输入图片文件
- outfile 是输出结果文件,命令执行完会生成 outfile.txt 文件
- -l chi_sim 指定语言
效果很好,准确率很高
Tesseract 性能
使用识别率高的 tessdata_best 模型的话
大段中文识别很慢,需要将近20秒才出结果
识别四五十字的中英文混合,也需要10秒钟
# time tesseract --tessdata-dir / tesseract-test.png outfile -l chi_sim
Tesseract Open Source OCR Engine v4.1.3 with Leptonica
real 0m18.322s
user 0m41.052s
sys 0m0.276s
改用速度快的 tessdata_fast 模型会快一些,效果也不是很差
# time tesseract --tessdata-dir / tesseract-test.png outfile -l chi_sim
Tesseract Open Source OCR Engine v4.1.3 with Leptonica
real 0m14.212s
user 0m43.986s
sys 0m0.107s
Tesseract 最佳实践
利用 Java Graphics2D 将图片左下角一块 100 * 40 区域填充一个黑色矩形框,将一串白色纯数字写到黑底矩形框上。
使用 Tesseract 4.1.3 配置如下参数可以 100% 准确率识别出这些黑底白字的纯数字:
- 使用 legacy eng 语言模型
- 设置 oem 为 0,即 legacy 模式
- psm 默认
-c tessedit_char_whitelist=0123456789
指定白名单为纯数字
比较奇怪的是,Tesseract 4.1.3 上,legacy eng模型+oem=0 比 best eng模型+oem=1 效果好很多。
CentOS7 上安装 Tesseract
CentOS7 上 yum 安装使用 Tesseract 4.1.3
https://tesseract-ocr.github.io/tessdoc/Installation.html
https://tesseract-ocr.github.io/tessdoc/InstallationOpenSuse.html
按官方文档安装后报错:
leptonica-1.76.0-2.5.x86_64.rpm 的公钥尚未安装
根据下面这个文档
Public key for tesseract-4.00~git2686-1.1.x86_64.rpm is not installed
https://github.com/tesseract-ocr/tesseract/issues/1749
加 –nogpgcheck 忽略公钥检查
sudo yum-config-manager --add-repo http://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/RHEL_7/
sudo yum update -y
sudo yum install tesseract -y --nogpgcheck
查看版本号:
[centos@lightsail lib64]$ tesseract -v
tesseract 4.1.3
leptonica-1.76.0
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Alexander_Pozdnyakov 也提供 tesseract5 的 yum 源,但是需要centos8
https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov:/tesseract5/
安装前备份了一份 /usr/lib64 到 /matt/lib64/,安装后 diff 比较发现多出来这些 lib,这一步是为了找出 tesseract 需要哪些lib,之后打包到 SpringBoot 镜像里离线使用
后来换了个linux版本再次安装对比发现还多了 libpng15.so.15 和 libpng15.so.15.13.0
# diff -r /usr/lib64/ /matt/lib64/
Only in /usr/lib64/: libgomp.so.1
Only in /usr/lib64/: libgomp.so.1.0.0
Only in /usr/lib64/: libjbig85.so.2.0
Only in /usr/lib64/: libjbig.so.2.0
Only in /usr/lib64/: libjpeg.so.62
Only in /usr/lib64/: libjpeg.so.62.1.0
Only in /usr/lib64/: liblept.so.5
Only in /usr/lib64/: liblept.so.5.0.3
Only in /usr/lib64/: libtesseract.so.4
Only in /usr/lib64/: libtesseract.so.4.0.1
Only in /usr/lib64/: libtiff.so.5
Only in /usr/lib64/: libtiff.so.5.2.0
Only in /usr/lib64/: libtiffxx.so.5
Only in /usr/lib64/: libtiffxx.so.5.2.0
Only in /usr/lib64/: libwebpmux.so.0
Only in /usr/lib64/: libwebpmux.so.0.0.0
Only in /usr/lib64/: libwebp.so.4
Only in /usr/lib64/: libwebp.so.4.0.2
2023.7.1 安装的最新版还是 tesseract 4.1.3,不是 tesseract 5.x 版本,但和 tess4j-5.7.0 搭配使用是正常的。
CentOS7 上编译安装 Tesseract 5.2.0
1、编译工具安装
yum install -y gcc gcc-c++ make autoconf automake libtool libjpeg libpng libtiff zlib libjpeg-devel libpng-devel libtiff-devel zlib-devel
2、升级gcc8(编译 Tesseract5 需要 c++17)
yum install -y centos-release-scl
yum install -y devtoolset-8-gcc*
mv /usr/bin/gcc /usr/bin/gcc-4.8.5
ln -s /opt/rh/devtoolset-8/root/bin/gcc /usr/bin/gcc
mv /usr/bin/g++ /usr/bin/g++-4.8.5
ln -s /opt/rh/devtoolset-8/root/bin/g++ /usr/bin/g++
gcc –version
gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
g++ –version
g++ (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
3、安装 leptonica(Tesseract 依赖 leptonica 进行图片处理)
wget http://www.leptonica.org/source/leptonica-1.82.0.tar.gz
tar zxf leptonica-1.82.0.tar.gz
cd leptonica-1.82.0/
./configure && make && make install
编辑 /etc/profile 添加环境变量
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
export LIBLEPT_HEADERSDIR=/usr/local/include
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
source /etc/profile
4、编译安装 tesseract 5.2.0
wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/5.2.0.tar.gz
tar xvf 5.2.0.tar.gz
cd tesseract-5.2.0
./autogen.sh
./configure –with-extra-includes=/usr/local/include –with-extra-libraries=/usr/local/lib
make && make install
完成后提示:
Libraries have been installed in:
/usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR’
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH’ environment variable
during execution - add LIBDIR to the `LD_RUN_PATH’ environment variable
during linking - use the `-Wl,-rpath -Wl,LIBDIR’ linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf’
完成后查看版本
# tesseract -v
tesseract 5.2.0
leptonica-1.82.0
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7
Found SSE4.1
Found OpenMP 201511
5、下载语言
cd /usr/local/share/tessdata/
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
这里下载的是 legacy 英文语言模型,也可以下载 fast 或 best 的
6、测试
tesseract 25.png output
执行后 output.txt 中就是 ocr 结果
https://www.jianshu.com/p/edfabeaf6ba8
http://www.nanstar.top/p/wiki_1649411481701
https://gist.github.com/zhuth/b75dd8440abb0771e510efa1f410086e
SpringBoot+CentOS7 上使用 tess4j
在 Linux 上使用 tess4j 需要先安装 tesseract,否则 ocr 识别会报找不到下面这些 lib
java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
libgomp.so.1: cannot open shared object file: No such file or directory
libgomp.so.1: cannot open shared object file: No such file or directory
Native library (linux-x86-64/libtesseract.so) not found in resource path ([jar:file:/blog-server.jar!/BOOT-INF/classes!/,
java.lang.UnsatisfiedLinkError: Error loading shared library liblept.so.5: No such file or directory (needed by /root/.cache/JNA/temp/jna4202543007498402592.tmp)
java.lang.UnsatisfiedLinkError: libgomp.so.1: cannot open shared object file: No such file or directory
java.lang.UnsatisfiedLinkError: libtiff.so.5: cannot open shared object file: No such file or directory
java.lang.UnsatisfiedLinkError: libjpeg.so.62: cannot open shared object file: No such file or directory
java.lang.UnsatisfiedLinkError: libwebp.so.4: cannot open shared object file: No such file or directory
java.lang.UnsatisfiedLinkError: libjbig.so.2.0: cannot open shared object file: No such file or directory
如果 linxu 是离线环境,可以利用 docker 在本机模拟 CentOS7 环境,本地先启动个 SpringBoot+CentOS7 的镜像,进入容器后安装 tesseract 后把 lib 拷贝出来,注意架构,如果M1 Mac默认可能是 arm64/aarch64 架构
1、yum 安装 tesseract,安装版本是 4.1.3,经测试和 tess4j-5.7.0 搭配使用是正常的。
2、将相关 lib 文件单独拷贝出来,包含下面这些,这是我安装 tesseract 前后对比 /usr/lib64 目录找出来的
# ls /usr/lib64/ |egrep "libgomp|libjbig|libjpeg|liblept|libtess|libtiff|libwebpmux|libwebp" |xargs -i cp {} /tesseract-lib-4.1.3
# ls /tesseract-lib-4.1.3
libgomp.so.1 libjbig85.so.2.0 libjpeg.so.62 liblept.so.5 libtesseract.so.4 libtiff.so.5 libtiffxx.so.5 libwebpmux.so.0 libwebp.so.4
libgomp.so.1.0.0 libjbig.so.2.0 libjpeg.so.62.1.0 liblept.so.5.0.3 libtesseract.so.4.0.1 libtiff.so.5.2.0 libtiffxx.so.5.2.0 libwebpmux.so.0.0.0 libwebp.so.4.0.2
3、将上面的 lib 文件放到 Maven 项目的 /src/main/resources/linux-x86-64 目录中(注意 amd64/x86_64 架构上 tess4j 才会去 classpath 下的 linux-x86-64 子目录找 lib,不同架构的不同)
或者,由于我这里是用 docker 部署的 SpringBoot 服务,基础镜像用的是 centos7,直接将这些 lib 打包到容器中的 /usr/lib64 目录,然后设置 LD_LIBRARY_PATH
环境变量增加去 /usr/lib64 目录找 lib 即可
Dockerfile 文件关键部分如下:
ADD devops/tesseract-lib-4.1.3/* /usr/lib64/
# tesseract lib 目录
ENV LD_LIBRARY_PATH $LD_LIBRARY_PATH:/usr/lib64
之后启动 SpringBoot 服务即可正常使用 tess4j 做 ocr
Linux系统安装及部署tess4j项目(CentOS 7为例)
https://blog.csdn.net/makang110/article/details/122623811
Linux环境如何支持使用tess4j进行ORC
https://www.jianshu.com/p/134a09c5af9e
Linux下部署tesseract-ocr以支持tess4j
https://blog.csdn.net/dhx20022889/article/details/122939939
Tess4J -4.0.2- Linux 实践 [解决:Tess4J - Native library (linux-x86-64/libtesseract.so) not found in resource path]
https://www.cnblogs.com/socketqiang/p/10960800.html
在Linux下使用Tess4j的依赖问题和未成功的尝试
https://blog.desmondcobb.org/archives/671
M1 Mac 上使用 tess4j
1、安装
brew install tesseract
tesseract -v 可以看到显示对应的版本信息
brew list tesseract 查看安装路径
/opt/homebrew/Cellar/tesseract/5.2.0/lib
2、拷贝 libtesseract.5.dylib 到 Java 项目的 resources 文件夹下,改名为 libtesseract.dylib(注意不要直接拷贝 lib 中的 libtesseract.dylib,只是个链接)
无 libtesseract.dylib 会报错:
java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
dlopen(libtesseract.dylib, 0x0009): tried: '/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home/bin/./libtesseract.dylib' (no such file), 'libtesseract.dylib' (no such file), '/usr/local/lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file), '/Users/xxx/git/my/spring-boot-masikkk/common/libtesseract.dylib' (no such file), '/usr/local/lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file)
3、可以使用默认的语言模型,也可以下载训练好的语言模型比如英文的 eng.traineddata,放到任意目录,通过 setDatapath 指定语言文件目录
Tess4J
nguyenq / tess4j
https://github.com/nguyenq/tess4j
Tess4J jar 包中自带英文语言模型文件
maven 引入的 tess4j jar 包中有个 tessdata 目录,里面有训练好的 eng.traineddata 和 osd.traineddata 语言模型,通过下面代码指定使用 jar 包中的自带模型:
instance.setDatapath(LoadLibs.extractTessResources("tessdata").getAbsolutePath()); // 如果没有自己的语言模型,可以使用默认的
Tess4J 使用示例
@Test
@SneakyThrows
public void testTess4j() {
// 加载图片
InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("bg_night.png");
File imageFile = File.createTempFile("temp", ".png");
FileUtils.copyInputStreamToFile(inputStream, imageFile);
// 初始化 Tesseract 实例,设置语言,设置模型目录
ITesseract instance = new Tesseract(); // JNA Interface Mapping
// instance.setLanguage("chi_sim"); // 中文
instance.setLanguage("eng"); // 英文
// 如果没有自己的语言模型,可以使用 tess4j jar 包中自带的 英文eng.traineddata 和 osd.traineddata 两个模型
instance.setDatapath(LoadLibs.extractTessResources("tessdata").getAbsolutePath());
// 或者指定自己训练好的语言模型目录
// instance.setDatapath("/Users/masi/git/my/spring-boot-masikkk/common/src/test/resources");
// 整张图上做OCR
long ts = System.currentTimeMillis();
String result = instance.doOCR(imageFile);
log.info("结果: {},耗时: {}", result, System.currentTimeMillis() - ts);
// 指定范围做OCR,x,y是以左上角为原点,width和height是以xy为基础
Rectangle rect = new Rectangle(8, 604, 59, 18);
ts = System.currentTimeMillis();
result = instance.doOCR(imageFile, rect);
log.info("结果: {},耗时: {}", result, System.currentTimeMillis() - ts);
// 设置字符白名单,只检测数字
ts = System.currentTimeMillis();
instance.setVariable("tessedit_char_whitelist", "0123456789");
result = instance.doOCR(imageFile, rect);
log.info("结果: {},耗时: {}", result, System.currentTimeMillis() - ts);
}
https://juejin.cn/post/7066642049537146893
https://www.cnblogs.com/pejsidney/p/9487881.html
tessedit_char_whitelist 设置字符白名单
如果确定图片里有哪些固定的字符,可以设置 tessedit_char_whitelist 白名单,使检测结果更准确
比如设置只检测数字instance.setVariable("tessedit_char_whitelist", "0123456789");
tess4j Set only to identify numbers and letters
https://stackoverflow.com/questions/42430384/tess4j-set-only-to-identify-numbers-and-letters
Tess4J 不支持多线程并发访问 instance
问题:
全局初始化一个 ITesseract instance = new Tesseract() 然后多线程并发进行 ocr 会报一个底层 cpp 错误,比如
static_cast<unsigned>(id) < this->size():Error:Assert failed:in file src/ccutil/unicharset.cpp, line 283
解决方法:
每次 OCR 调用 new Tesseract() instance
NPE during concurrent thread access of a single tess4j instance
https://stackoverflow.com/questions/28954476/npe-during-concurrent-thread-access-of-a-single-tess4j-instance
Tess4j on Windows 64-bit: exception on multiple threads
https://stackoverflow.com/questions/24799038/tess4j-on-windows-64-bit-exception-on-multiple-threads
How to use multi thread in tess4j
https://github.com/nguyenq/tess4j/issues/46
Multi threading / parallel processing - Java 8 - JVM 64 bit - Tess4J 1.3.0 / 1.4.1
https://sourceforge.net/p/tess4j/discussion/1202293/thread/4562eccb/
JNA 版本冲突导致报错
instance.doOCR 报错:
java.lang.NoSuchMethodError: com.sun.jna.Native.load(Ljava/lang/String;Ljava/lang/Class;)Lcom/sun/jna/Library;
at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:83)
原因:
JNA 版本冲突,通过 Dependency Analyzer 插件看到引入了两个版本的 JNA,一个是 elasticsearch-7.6.2 引入的 JNA-4.5.1,一个是 tess4j-5.7.0 引入的 JNA-5.13.0
解决:
exclusion elasticsearch-7.6.2 中的 JNA
https://github.com/testcontainers/testcontainers-java/issues/3734
模型和引擎模式不匹配会直接导致Java程序崩溃(Error)
比如使用 eng best 语言模型,但 ocrEngineMode 设置为 0,会报下面的 ERROR,程序直接崩溃
Error: Tesseract (legacy) engine requested, but components are not present in /root/apps/da/tesseract-traineddata/eng_best.traineddata!!
Failed loading language 'eng_best'
Tesseract couldn't load any languages!
Warning: Invalid resolution 0 dpi. Using 70 instead.
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f9c06c43737, pid=7, tid=0x00007f9c0e8c9700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_202-b08) (build 1.8.0_202-b08)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.202-b08 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libtesseract.so.4.0.1+0xc9737] tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int)+0x637
#
# Core dump written. Default location: /root/apps/da/core or core.7
#
# An error report file with more information is saved as:
上一篇 JsonPath
下一篇 Spring-Security
页面信息
location:
protocol
: host
: hostname
: origin
: pathname
: href
: document:
referrer
: navigator:
platform
: userAgent
: