Article From:https://www.cnblogs.com/zhangxinqi/p/9297292.html

OCR,Optical Character Recognition, optical character recognition, refers to the process of translating by scanning characters and then translating them into electronic text through their shapes, which correspond to graphic verification codes, which are a few irregular characters, and these characters are slightly added by characters.The content of distortion transformation can be converted into electronic text by OCR technology, and then the result is extracted to the server, and the process of automatic identification can be realized.

tesserocrAnd pytesseract is a OCR identification library for Python, but it is a Python API package for Tesseract, and pytesseract is the Tesseract-OCR engine wrapper of Google; soTheir core is Tesseract, so before installing tesserocr, we need to install Tesseract first.

1、Install Tesseract, tesserocr, pytesseract

(1)windowsLower installation

Download tesseract:https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.0.0-beta.1.20180414.exe

Then double click the program installation, you can check the Additional language data (download) option to install the OCR recognition supported language package, but it is really slow to download the language package, and we can directly from https://github.cOm/tesseract-ocr/tessdata downloads zip’s language package to compress files and unzip files from tessdata-master to Tesseract’s installation directory C:\Program Files (x86)In the Tesseract-OCR tessdata directory, we finally configure the environment variables and add C: Program Files (x86) Tesseract-OCR to the environment variables

Before testing, understand the command format of Tesseract.

 tesseract imagename outputbase [-l lang]

imagenameSpecifies the image name, outputbase specifies the output file name, and -l specifies the recognized language.

#Display the installed language package
tesseract --list-langs

#Show help
tesseract --help
tesseract --help-extra
tesseract --version

Test:

#Statistics installed language pack, installed 168 language packages
C:\Users\Administrator.DESKTOP-6JT7D2H>tesseract --list-langs | find /c /v ""
168

#Use a picture test to successfully identify a string
tesseract image.png result -l eng |type result.txt
Python3WebSpider

Because tesserocr has various incompatibility problems in the windows environment and incompatible with the pycharm virtual environment, we choose the pytesseract module to install in the windows system environment, if you want to install it, please use it.WHL file installation or use of CONDA installation

pip install pytesseract

If the Tesseract interpreter is not found in pytesseract, this is generally happening in a virtual environment, and we need to configure the tesseract-OCR’s execution file tesseract.ext to the windows system.In the PATH environment, or modify the pytesseract. py file to specify the “tesseract_cmd” field as the full path to tesseract. exe

Test recognition function:

import pytesseract
from PIL import Image

im=Image.open('image.png')
print(pytesseract.image_to_string(im))

(2)linuxLower installation

In Ubuntu, Debian and Deepin systems, the installation commands are as follows:

#Install Tesseractsudo apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev

#Install the language packageGit clone https://github.com/tesseract-ocr/tessdata.git
sudo mv tessdata/* /usr/share/tesseract-ocr/tessdata

#Install tesserocrPIP3 install tesserocr

#Install pytesseract
pip3 install pytesseract

Under the CentOS and Red Hat system, the installation commands are as follows:

#Install Tesseractyum install -y tesseract

#Install the language packageGit clone https://github.com/tesseract-ocr/tessdata.git
mv tessdata/* /usr/share/tesseract/tessdata

#Install tesserocrPIP3 install tesserocr

#Install pytesseract
pip3 install pytesseract

 Test installation environment:

In [1]: import tesserocr
In [2]: from PIL import Image
In [3]: im=Image.open('image.png')
In [4]: tesserocr.image_to_text(im)
Out[4]: 'Python3WebSpider\n\n'

tesserocrInstallation reference link: https://github.com/sirfz/tesserocr

pytesseractInstallation reference link: https://github.com/madmaze/pytesseract

tesseractInstallation reference link: https://github.com/tesseract-ocr/tesseract/wiki

2、tesserocrAnd the use of pytesseract module

(1)tesserocrUse

#Identifying image characters from files
In [7]: tesserocr.file_to_text('image.png')
Out[7]: 'Python3WebSpider\n\n'

#View the language package installed by Tesseract
In [8]: tesserocr.get_languages()
Out[8]: ('/usr/share/tesseract/tessdata/', ['eng'])

#Identifying image characters from picture data
In [9]: tesserocr.image_to_text(im)
Out[9]: 'Python3WebSpider\n\n'

#View version information
In [10]: tesserocr.tesseract_version()
Out[10]: 'tesseract 3.04.00\n leptonica-1.72\n  libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0\n'

(2)pytesseractUse

Function:

  • get_tesseract_version  Return the version of the Tesseract installed in the system.
  • image_to_string  Returns the result of the Tesseract OCR operation on the image to the string.
  • image_to_boxes  Returns the result containing the recognized character and its frame border.
  • image_to_data  Returns the result containing box borders, confidence and other information.Tesseract 3.05+ is needed.For more information, please checkTesseract TSVFile
  • image_to_osd  Returns the result containing information about the direction and script detection.

Parameters:

image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING)

  • image object  Image object
  • lang String,Tesseract  Language code string
  • config String  Any other configuration is a string, for example:config='--psm 6'
  • nice Integer  Modify the processor priority of the Tesseract run.WindowsI won’t support it.Nice has adjusted the advantages of a UNIX – like process.
  • output_type  Class attribute, specifying the type of output, by defaultstringFor all types of support lists, please check.pytesseract.OutputclassDefinition
from PIL import Image
import pytesseract

#If there is no Tesseract executable file in PATH, specify the Tesseract path.
pytesseract.pytesseract.tesseract_cmd='C:\Program Files (x86)\Tesseract-OCR\\tesseract.exe'

#Print the string that identifies the image
print(pytesseract.image_to_string(Image.open('test.png')))

#Specify the language to identify the image string, eng for English
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='eng'))

#Get the image boundary frame
print(pytesseract.image_to_boxes(Image.open('test.png')))

#Get detailed data including bounding box, confidence level, row and page number.
print(pytesseract.image_to_data(Image.open('test.png')))

#Acquisition direction and script detection
print(pytesseract.image_to_osd(Image.open('test.png'))

3、Simple application of image recognition

 In general image processing verification, it is necessary to increase the recognition degree of images by gray processing and two values. The following is a simple recognition processing of image verification code.But its recognition is only about thirty percent, so there is another way to bypass it.

from PIL import Image
import pytesseract

im = Image.open('66.png')
#Two valued image afferent image and threshold
def erzhihua(image,threshold):
    ''':type image:Image.Image'''
    image=image.convert('L')
    table=[]
    for i in range(256):
        if i <  threshold:
            table.append(0)
        else:
            table.append(1)
    return image.point(table,'1')


image=erzhihua(im,127)
image.show()

result=pytesseract.image_to_string(image,lang='eng')
print(result)

Analog automatic identification verification code landing:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018/7/13 8:58
# @Author  : Py.qi
# @File    : login.py
# @Software: PyCharm
from selenium import webdriver
from selenium.common.exceptions import TimeoutException,WebDriverException
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.remote.webelement import WebElement
from io import BytesIO
from PIL import Image
import pytesseract
import time

user='zhang'
password='123'
url='http://10.0.0.200'
driver=webdriver.Chrome()
wait=WebDriverWait(driver,10)

#Identification verifying code
def acker(content):
    im_erzhihua=erzhihua(content,127)
    result=pytesseract.image_to_string(im_erzhihua,lang='eng')
    return result

#Verifying code two value
def erzhihua(image,threshold):
    ''':type image:Image.Image'''
    image=image.convert('L')
    table=[]
    for i in range(256):
        if i <  threshold:
            table.append(0)
        else:
            table.append(1)
    return image.point(table,'1')

#Automatic landing
def login():
    try:
        driver.get(url)
        #Get the user input box
        input=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#loginname'))) #type:WebElement
        input.clear()
        #Send username
        input.send_keys(user)
        #Get the password box
        inpass=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#password'))) #type:WebElement
        inpass.clear()
        #Send a password
        inpass.send_keys(password)
        #Get the validation input box
        yanzheng=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#code'))) #type:WebElement
        #Get the location of the verifying code in the canvas
        codeimg=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#codeImg'))) #type:WebElement
        image_location = codeimg.location
        #Intercept the page image and intercept the mask code area image
        image=driver.get_screenshot_as_png()
        im=Image.open(BytesIO(image))
        imag_code=im.crop((image_location['x'],image_location['y'],488,473))
        #Enter the verification code and log in
        yanzheng.clear()
        yanzheng.send_keys(acker(imag_code))
        time.sleep(2)
        yanzheng.send_keys(Keys.ENTER)
    except TimeoutException as e:
        print('timeout:',e)
    except WebDriverException as e:
        print('webdriver error:',e)

if __name__ == '__main__':
    login()

Reference links:

tesserocr GitHub:https://github.com/sirfz/tesserocr

tesserocr PyPI:https://pypi.python.org/pypi/tesserocr

pytesserocr GitHub:https://github.com/madmaze/pytesseract

pytesserocr PyPI:https://pypi.org/project/pytesseract/

tesseractDownload address: http://digi.bib.uni-mannheim.de/tesseract

tesseract GitHub:https://github.com/tesseract-ocr/tesseract

tesseract Language pack: https://github.com/tesseract-ocr/tessdata

tesseractDocument: https://github.com/tesseract-ocr/tesseract/wiki/Documentation

One Reply to “Python3 optical character recognition module tesserocr and pytesseract”

  1. I do all of them, but I received error below,
    TesseractNotFoundError: tesseract is not installed or it’s not in your path

    (I use mac)

Leave a Reply

Your email address will not be published. Required fields are marked *