Rounak Jain Dec 05, 2019 No Comments
Recently an intriguing thought struck me. With all the automation and technological advancement which we get to read every passing day, I started to wonder if it is possible to electronically read a text from any image. I approached the omniscient google to look for my answers and voila! Thanks to google, I realized that this thought hit me several decades late. I came across terms like Computer Vision and Optical Character Recognition, to begin with. Nonetheless, I decided to step into this new world of Computer Vision and try out myself. My aim through this post is to provide you with an understanding of Computer Vision, Optical Character Recognition (OCR). It will be very interesting to see how we can extract text from an image using Python.
Computer Vision is a field of computer science wherein we want to enable computers to identify and process images and objects just as we humans do. It also includes providing useful results based on the observation. Let me elaborate with some practical examples.
OCR is a section of Computer Vision. Though self-explanatory, it is a technique to recognize text inside a digital image of a physical document, for example, a scanned document and can also convert into an editable word processing document directly.
Having explained these terms will now proceed with selecting two images we want to read text from. At times we are required to read an image from a URL. I have picked an image available as a URL. Another is plain text image I picked up from google. I have shared both below.
Let us move to Python and the code to pull out text from these images. The code provided is written in Google Colab.
First, we need to install and/or import required libraries. Let us briefly look at the libraries used.
Pytesseract – Python-tesseract is an optical character recognition (OCR) tool for python. It can recognize and read text embedded in images.
IO – The io module provides Python’s main facilities for dealing with various types of I/O.
BytesIO – Binary I/O (also called buffered I/O) expects bytes-like objects and produces bytes objects. No encoding, decoding, or newline translation is performed. This category of streams is used for all kinds of non-text data.
Requests – Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. This helps in pulling the HTML code of any website.
PIL – Pillow Imaging Library is a free library for the Python programming language that adds support for opening, manipulating, and saving many different image file formats.
Following is the code-
!sudo apt install tesseract-ocr !pip install pytesseract import pytesseract try: from PIL import Image except ImportError: import Image
Let us pick our first image with a URL. We need to communicate with the URL, receive its source code using requests.get command. Since it is an image, we will open and store this image as a variable. Lastly, we will read the text in the image using image_to_string command of by tesseract and then print the content. The commands are provided below.
import requests from io import BytesIO url="https://i.oodleimg.com/item/5528750302u_0x424x360f?1570301166" response = requests.get(url) img1 = Image.open(BytesIO(response.content)) extractedInformation = pytesseract.image_to_string(img1) print(extractedInformation)
The output generated looks like this.
Now let us pick the second image. It’s a file and not a URL. Hence our command would change accordingly which you can find below.
img2= (Image.open('Text_paragraph.png')) extractedInformation = pytesseract.image_to_string(img2) print(extractedInformation)
The output generated looks like this.
The most popular application of OCR is handwriting recognition. With the conversion of a handwritten text into digital text, the uniformity, and usability increase manifolds. Having said that, OCR has its disadvantage of not being 100% accurate. Like you might have already noticed that it could not correctly recognize the website from the first image. There are ways in which we can improve the accuracy of text recognition using OCR which is out of the scope of this article.
With this, we would like to encourage you to try this interesting concept of Optical Character Recognition from the powerful world of Computer Vision and post your comments and remarks.