How to convert PDF to CSV in Python with the Tabula Library? (Source code, python Script)


If you’re working with data that’s stored in a PDF, you know how frustrating it can be to try and extract it. Luckily, there’s a tool that can help. In this blog post, we’ll show you how to use the Python Tabula library to easily convert PDFs into CSVs.

First, we’ll need to install Tabula. We can do this using pip:

pip install tabula-py==1.0.3

Once Tabula is installed, we can start extracting data from our PDF. In this example, we’ll be working with this sample PDF. The first step is to import Tabula into our Python script:

from tabula import convert_into
Code language: JavaScript (javascript)

Next, You use convert_into() function to convert pdf to CSV file.

table_file = r"Actual_Path_to_PDF" output_csv = r"Destination_Directory/file.csv" df = convert_into(table_file, output_csv, output_format='csv', lattice=True, stream=False, pages="all")
Code language: Python (python)

This is a full python script to convert PDFs into CSVs

import tabula # simple wrapper for tabula-java, read tables from PDF into csv import os print("[-+-] starting pdf_csv.py...") print("[-+-] import a pdf and convert it to a csv") # ----------------------------------------------------------------------------- print("[-+-] importing required packages for pdf_csv.py...") # from modules.defaults import df # local module print("[-+-] pdf_csv.py packages imported! \n") # ----------------------------------------------------------------------------- # ----------------------------------------------------------------------------- def pdf_csv(): # convert pdf to csv print("[-+-] default filenames:") filename = "sample1" pdf = filename + ".pdf" csv = filename + ".csv" print(pdf) print(csv + "\n") print("[-+-] default directory:") print("[-+-] (based on current working directory of python file)") defaultdir = os.getcwd() print(defaultdir + "\n") print("[-+-] default file paths:") pdf_path = os.path.join(defaultdir, pdf) csv_path = os.path.join(defaultdir, csv) print(pdf_path) print(csv_path + "\n") print("[-+-] looking for default pdf...") if os.path.exists(pdf_path) == True: # check if the default pdf exists print("[-+-] pdf found: " + pdf + "\n") pdf_flag = True else: print("[-+-] looking for another pdf...") arr_pdf = [ defaultdir for defaultdir in os.listdir() if defaultdir.endswith(".pdf") ] if len(arr_pdf) == 1: # there has to be only 1 pdf in the directory print("[-+-] pdf found: " + arr_pdf[0] + "\n") pdf_path = os.path.join(defaultdir, arr_pdf[0]) pdf_flag = True elif len(arr_pdf) > 1: # there are more than 1 pdf in the directory print("[-+-] more than 1 pdf found, exiting script!") pdf_flag = False # TODO add option to select from available pdfs else: print("[-+-] pdf cannot be found, exiting script!") pdf_flag = False if pdf_flag == True: # check if csv exists at the default file path # if csv does not exist create a blank file at the default path try: print("[-+-] looking for default csv...") open(csv_path, "r") print("[-+-] csv found: " + csv + "\n") except IOError: print("[-+-] did not find csv at default file path!") print("[-+-] creating a blank csv file: " + csv + "... \n") open(csv_path, "w") print("[-+-] converting pdf to csv...") # print("[-+-] pdf to csv conversion suppressed! \n") try: tabula.convert_into(pdf_path, csv_path, output_format="csv", pages="all") print("[-+-] pdf to csv conversion complete!\n") except IOError: print("[-+-] pdf to csv conversion failed!") print("[-+-] converted csv file can be found here: " + csv_path + "\n") print("[-+-] finished pdf_csv.py successfully!") # ----------------------------------------------------------------------------- # ----------------------------------------------------------------------------- pdf_csv() # run the program # -----------------------------------------------------------------------------
Code language: Python (python)

Download the full source code here.

That’s all there is to it! With just a few lines of code, we were able to easily extract data from a PDF and convert it into a CSV file. The Tabula library is an incredibly powerful tool that can save you hours of manual work when dealing with data stored in PDFs.

Andy Avery

I really enjoy helping people with their tech problems to make life easier, ​and that’s what I’ve been doing professionally for the past decade.

Recent Posts