Redacting a PDF in Python – The Why and How


Redacting a document simply means blacking out or obscuring information from the document. For PDFs, this redaction is usually done for security reasons – to hide information that should not be seen by everyone. Redacting a PDF using Python is not as difficult as it may seem at first. In fact, there are only a few lines of code needed to get the job done. Let’s take a look at how to do it.

1) Why Redact a PDF?

There are many reasons why you might need to redact information from a PDF. Perhaps you’re sharing sensitive information with someone and you don’t want them to see certain parts of the document. Or maybe you’re releasing a document to the public but there are parts of it that should remain private (think social security numbers, bank account information, etc.). Whatever the reason, redacting a PDF is a relatively simple process that can be accomplished using Python.

2) How to Redact a PDF

First things first – you’ll need to have the PyPDF2 library installed. If you don’t have it installed, you can do so by running the following command in your terminal:

pip install PyPDF2

Next, open up your favorite text editor and create a new file called “redact_pdf.py”. We’ll be doing all of our work in this file.

The first thing we need to do is import the PyPDF2 library and rename it to “pdf”. This will make our code a bit cleaner and easier to read later on.

import PyPDF2 as pdf
Code language: JavaScript (javascript)

Now we need to open up the PDF that we want to redacted. We’ll do this using the “open” function built into Python.

file = open('path/to/file', 'rb') # read binary mode
Code language: PHP (php)

Once we have our file open, we’re going to create what’s called a “PdfFileReader” object. This object will allow us to access all of the content within our PDF.

reader = pdf.PdfFileReader(file)

Now that we have our reader object, we can loop through each page of our document and print out its contents. Try running the following code:

for page in range(reader.numPages): # loop through each page print(reader.getPage(page).extractText()) # extract text from each page
Code language: Python (python)

If everything worked correctly, you should see the text from each page printed out in your terminal window. Nice work! Now that we can see all of the text from our document, let’s go ahead and start redacting it.

The first thing we’re going to do is create a new “PdfFileWriter” object. This object will allow us to write changes (in this case, redacted text) back into our original PDF file.

writer = pdf.PdfFileWriter()

From here, we’re going to loop through each page of our document again (just like we did above). But this time, instead of printing out the text from each page, we’re going to add some redacted text using the “addText” function provided by PyPDF2:

for page in range(reader.numPages): # loop through each page writer.addText('Redacted', 10, 10) # add 'Redacted' at X=10 Y=10 (top left of screen) writer.writePage(reader.getPage(page)) # append current page from reader writer.write('path/to/new/file') # write everything in memory back out as new file And there you have it!
Code language: Python (python)

You’ve successfully redacted your first PDF using Python!

Another full source code Python

# -*- coding: utf-8 -*- """ Created on Thu Aug 6 11:19:55 2020 @author: 1589928 """ import os import cv2 import img2pdf import pytesseract from pytesseract import Output from pdf2image import convert_from_path # words = ['Protective','Life','Insurance','legal','provide','service','complete','Beneficiary','requirements','form','please'] words = ['bookmarks', 'output', 'Sample', 'four', 'eight'] def redact(pdf_name): pdf_name_no_ext = os.path.splitext(pdf_name)[0] # convert pdf to images pages = convert_from_path(pdf_name, 200) for i in range(len(pages)): pages[i].save(pdf_name_no_ext+'_'+str(i)+'.jpg', 'JPEG') # read images and redact words img_files = [] for i in os.listdir(): if pdf_name_no_ext in i and i.endswith(".jpg"): print(i) img_files.append(i) img = cv2.imread(i) data = pytesseract.image_to_data(img, output_type=Output.DICT) data_size = len(data['level']) for j in range(data_size): if data['text'][j] in words: (x, y, w, h) = (data['left'][j], data['top'][j], data['width'][j], data['height'][j]) cv2.rectangle(img, (x, y), (x + w, y + h), (0, 0, 0), -1) cv2.imwrite(str(i), img) # create new pdf save_path = 'redacted_pdf/' with open(os.path.join(save_path, pdf_name), "wb") as f: f.write(img2pdf.convert([image for image in img_files])) # delete images for image in img_files: os.remove(image) pdf_name = 'sample.pdf' redact(pdf_name)
Code language: Python (python)

Download

Conclusion:

There are many reasons why you might need to redact information from a PDF – whether it’s for security reasons or simply because there is information that should not be seen by everyone. Thankfully, redacting a PDF using Python is not as difficult as it may seem at first thanks to the PyPDF2 library. With just a few lines of code, you can easily obscure or black out sensitive information from any given document!

Andy Avery

I really enjoy helping people with their tech problems to make life easier, ​and that’s what I’ve been doing professionally for the past decade.

Recent Posts