There are many times when we want to read the text from an image or a scanned document. And for that, we will need a little bit of help from the Python programming language. In this blog post, I will show you how you can use Python programming language to extract text from images.
Python has some great libraries for working with images and extracting text from them. The two most popular libraries are PIL (Python Imaging Library) and pytesseract (python-tesseract). I will be using pytesseract in this blog post because it is easier to install and get started with.
First, we will need to install the pytesseract library. We can do this using pip:
Code language: PHP (php)
$pip install pytesseract $pip install pillow
Next, we need to download the tesseract executable file. This can be done from here. Be sure to download the version that is appropriate for your operating system.
Once we have installed the library and downloaded the tesseract executable, we can start writing our code. We will start by reading in the image:
from PIL import Image import pytesseract img = Image.open(‘sample-image.jpg’) text_from_image = pytesseract.image_to_string(img, lang="eng")
This function returns a string that contains all the text in the image.
Now, we can print out the contents of the image:
Code language: PHP (php)
A Full Python Script text extraction from Images
Code language: Python (python)
from PIL import Image import pytesseract as pt import os from pathlib import Path current_location = (os.getcwd() + '\\') def extract(): """ Function for extracting text from images. Additional it saves the text extracted as a txt file. """ image_location = input("Enter the Folder name containing Images: ") image_path = os.path.join(current_location, image_location) # Enter the name of folder which would contain respective txt files destination = input("Enter your desired output location: ") destination_path = os.path.join(current_location, destination) # Path to Tesseract tesseract_path = input("Enter the Path to Tesseract: ") print( '\nNOTE: ' 'It is preferable to setup the PATH variable to Tesseract, see README. \n' ) # = r'C:\Program Files\Tesseract-OCR\tesseract' pt.pytesseract.tesseract_cmd = tesseract_path # iterating over the images in the folder for imageName in os.listdir(image_path): # Join the path and image name to obtain absolute path inputPath = os.path.join(image_path, imageName) img = Image.open(inputPath) # OCR text = pt.image_to_string(img, lang="eng") # Removing extensions img_file = Path(inputPath).stem print(img_file) # The output text file text_file = img_file + ".txt" output_path = os.path.join(destination_path, text_file) # saving the text for every image in a separate .txt file with open(output_path, "w") as file: file.write(text) if __name__ == '__main__': extract()
Read another example to extract text from the image in this post.
That’s it! You have now learned how to extract text from images using python. I hope you found this blog post helpful. If you have any questions or comments, please feel free to leave them below. Thanks for reading!