How to apply OCR to recognize text from any image using Python

How to apply OCR to recognize text from any image using Python

Recently an intriguing thought struck me. With all the automation and technological advancement which we get to read every passing day, I started to wonder if it is possible to electronically read a text from any image. I approached the omniscient google to look for my answers and voila! Thanks to google, I realized that this thought hit me several decades late. I came across terms like Computer Vision and Optical Character Recognition, to begin with. Nonetheless, I decided to step into this new world of Computer Vision and try out myself. My aim through this post is to provide you with an understanding of Computer Vision, Optical Character Recognition (OCR). It will be very interesting to see how we can extract text from an image using Python.

What is Computer Vision:

Computer Vision is a field of computer science wherein we want to enable computers to identify and process images and objects just as we humans do. It also includes providing useful results based on the observation. Let me elaborate with some practical examples.

  • One of the most prominent application fields is medical computer vision. We extract information from image data which helps diagnose a patient. An example of this is the detection of tumors, arteriosclerosis or other malign changes.
  • A text scanner is another widely used computer vision-based application. With this, we can scan any text from an image using Optical Character Recognition and display the text on a screen and perform any further operation/ task desired.

What is Optical Character Recognition(OCR):

OCR is a section of Computer Vision. Though self-explanatory, it is a technique to recognize text inside a digital image of a physical document, for example, a scanned document and can also convert into an editable word processing document directly.

Having explained these terms will now proceed with selecting two images we want to read text from. At times we are required to read an image from a URL. I have picked an image available as a URL. Another is plain text image I picked up from google. I have shared both below.

Optical Character RecognitionText in image

Let us move to Python and the code to pull out text from these images. The code provided is written in Google Colab.

First, we need to install and/or import required libraries. Let us briefly look at the libraries used.

Libraries used:

Pytesseract – Python-tesseract is an optical character recognition (OCR) tool for python. It can recognize and read text embedded in images.

IO – The io module provides Python’s main facilities for dealing with various types of I/O.

BytesIO – Binary I/O (also called buffered I/O) expects bytes-like objects and produces bytes objects. No encoding, decoding, or newline translation is performed. This category of streams is used for all kinds of non-text data.

Requests – Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. This helps in pulling the HTML code of any website.

PIL – Pillow Imaging Library is a free library for the Python programming language that adds support for opening, manipulating, and saving many different image file formats.

Following is the code-


!sudo apt install tesseract-ocr

!pip install pytesseract

import pytesseract

try:

from PIL import Image

except ImportError:

import Image

Working on the first image:

Let us pick our first image with a URL. We need to communicate with the URL, receive its source code using requests.get command. Since it is an image, we will open and store this image as a variable. Lastly, we will read the text in the image using image_to_string command of by tesseract and then print the content. The commands are provided below.


import requests

from io import BytesIO

url="https://i.oodleimg.com/item/5528750302u_0x424x360f?1570301166"

response = requests.get(url)

img1 = Image.open(BytesIO(response.content))

extractedInformation = pytesseract.image_to_string(img1)

print(extractedInformation)

The output generated looks like this.

output file

Working on the second image:

Now let us pick the second image. It’s a file and not a URL. Hence our command would change accordingly which you can find below.


img2= (Image.open('Text_paragraph.png'))

extractedInformation =  pytesseract.image_to_string(img2)

print(extractedInformation)

The output generated looks like this.

output file

Shortcomings of OCR:

The most popular application of OCR is handwriting recognition. With the conversion of a handwritten text into digital text, the uniformity, and usability increase manifolds. Having said that, OCR has its disadvantage of not being 100% accurate. Like you might have already noticed that it could not correctly recognize the website from the first image. There are ways in which we can improve the accuracy of text recognition using OCR which is out of the scope of this article.

With this, we would like to encourage you to try this interesting concept of Optical Character Recognition from the powerful world of Computer Vision and post your comments and remarks.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.