Summary: Optical Character Recognition is an old, but still challenging problem that involves the detection and recognition of text from unstructured data, including images and PDF documents. It has cool…

Top 5 Python Libraries for Extracting Text from Images

Source: Eugenia Anello - 1501-01-01T04:59:59.999Z

0 UP DOWN

Understand and master OCR tools for text localization and recognition

Eugenia Anello

Towards Data Science

Photo by Anna Sullivan on Unsplash

Optical Character Recognition is an old, but still challenging problem that involves the detection and recognition of text from unstructured data, including images and PDF documents. It has cool applications in banking, e-commerce and content moderation in social media.

But as with everything topic in data science, there is a huge amount of resources when trying to learn how to solve the OCR task. This is why I am writing this tutorial, which can help you on getting started.

In this article, I am going to show some Python libraries that can allow you to fastly extract text from images without struggling too much. The explanation of the libraries is followed by a practical example. The dataset used is taken from Kaggle. To simplify the concepts, I am just using an image of the film Rush.

Let’s get started!

Image from textOCR dataset. Source.

Table of contents:

  1. pytesseract
  2. EasyOCR
  3. Keras-OCR
  4. TrOCR
  5. docTR

1. pytesseract

It is one of the most popular Python libraries for optical character recognition. It uses Google’s Tesseract-OCR Engine to extract text from images. There are multiple languages supported. Check here if you want to see if your language is supported. You just need a few lines of code to convert the image into text:

# installation
!sudo apt install tesseract-ocr
!pip install pytesseract

import pytesseract
from pytesseract import Output
from PIL import Image
import cv2

img_path1 = '00b5b88720f35a22.jpg'
text = pytesseract.image_to_string(img_path1,lang='eng')
print(text)

This is the output: