How to Repair a PDF Using Python (Step-by-Step Guide)

If you’ve ever had a PDF that refuses to open, shows garbled text, or throws errors, you know the frustration. Maybe the file was corrupted during download, a crash, or a virus. Instead of relying on online tools that may compromise privacy, you can fix it yourself with Python. This tutorial is for anyone comfortable with basic Python scripting who wants to take control of their PDF repair process.


By the end, you’ll have a working Python script that can detect corrupt PDFs, attempt to repair the internal structure, extract recoverable text, and even regenerate a clean PDF. We’ll use lightweight libraries like pypdf and PyMuPDF—no heavy installations. Let’s dive in.


What You’ll Need


  • Python 3.7 or later installed on your system (get it from python.org)
  • A corrupted PDF file to test with (make a backup!)
  • Terminal or command prompt access
  • Basic knowledge of Python – installing packages and running scripts
  • An IDE or text editor (VS Code, PyCharm, or even Notepad works)


repair pdf using python Python terminal installing pypdf package

Step 1: Install Required Libraries


Open your terminal and run the following command to install the two main libraries we’ll use: pypdf (for reading/writing PDFs) and PyMuPDF (for robust parsing and text extraction).


pip install pypdf PyMuPDF


If you prefer a lighter option, you can also install pdfminer.six (though we’ll stick with the first two). The pypdf library handles basic repair like rebuilding the cross-reference table, while PyMuPDF (fitz) can salvage content from even badly damaged files.


repair pdf using python Screenshot of pip install pypdf command output

Step 2: Try to Open the PDF with pypdf


Create a new Python script (e.g., repair_pdf.py) and add the following code. It attempts to open the corrupted file and report any errors.


from pypdf import PdfReader

try:
reader = PdfReader('corrupted.pdf')
print('PDF loaded successfully.')
print(f'Number of pages: {len(reader.pages)}')
except Exception as e:
print(f'Failed to open PDF: {e}')


Run the script. If you get an error like ‘Cannot find xref table’, the file’s internal reference table is broken. That’s a common issue that pypdf can sometimes rebuild.


repair pdf using python Python script output showing PDF loaded successfully

Step 3: Force Repair with pypdf


pypdf can attempt to fix a damaged cross-reference table using the strict parameter. Change your script to read the file with strict=False:


from pypdf import PdfReader

reader = PdfReader('corrupted.pdf', strict=False)
print('Loaded with strict=False')
print(f'Pages: {len(reader.pages)}')


If this succeeds, you can then write a new clean PDF using PdfWriter:


from pypdf import PdfWriter

writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
with open('repaired.pdf', 'wb') as f:
writer.write(f)
print('Written repaired.pdf')


This process often works for simple corruption. For more stubborn files, proceed to the next step.


repair pdf using python Python code showing PdfWriter writing repaired PDF

Step 4: Extract Text and Rebuild with PyMuPDF


PyMuPDF (imported as fitz) is excellent at extracting text even from damaged PDFs. If pypdf failed, try this approach:


import fitz

doc = fitz.open('corrupted.pdf')
print(f'Pages: {doc.page_count}')

new_doc = fitz.open() # empty document
for page_num in range(doc.page_count):
page = doc[page_num]
new_page = new_doc.new_page(width=page.rect.width, height=page.rect.height)
new_page.insert_text((72, 72), page.get_text())
new_doc.save('repaired_pymupdf.pdf')
new_doc.close()
doc.close()


This creates a brand new PDF containing only the recoverable text. Formatting and images may be lost, but the content is saved.


repair pdf using python PyMuPDF fitz repair script in code editor

Step 5: Automate the Process with a Function


Wrap everything into a reusable function that tries pypdf first, then falls back to PyMuPDF:


def repair_pdf(input_path, output_path):
from pypdf import PdfReader, PdfWriter
import fitz

# Try pypdf repair
try:
reader = PdfReader(input_path, strict=False)
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
with open(output_path, 'wb') as f:
writer.write(f)
print('Repaired with pypdf')
return
except:
pass

# Fallback to PyMuPDF text extraction
try:
doc = fitz.open(input_path)
new_doc = fitz.open()
for page in doc:
new_page = new_doc.new_page(width=page.rect.width, height=page.rect.height)
new_page.insert_text((72, 72), page.get_text())
new_doc.save(output_path)
print('Repaired with PyMuPDF (text only)')
except Exception as e:
print(f'Repair failed: {e}')

repair_pdf('corrupted.pdf', 'fixed.pdf')


This script covers most scenarios. Test it on multiple corrupted files to see how robust it is.

Common Pitfalls


  • **Missing dependencies** – If you forget to install pypdf or PyMuPDF, the script will throw ImportError. Always run pip install in a virtual environment.
  • **Output file not opening** – The repaired PDF might still be broken if the corruption is severe. Try the quick pdf repair methods online, but our Python script works in most cases.
  • **Text extraction loses images** – PyMuPDF’s get_text() only retrieves text. If you need images, consider using pdfminer.six or OCR. For a complete pdf corruption fix, combine multiple tools.

Where to Next


You’ve now built a Python PDF repair tool. To go further, explore advanced recovery techniques like parsing raw PDF streams or using online services for a quick fix – check out our guide to repair pdf instantly. You can also learn how to handle bulk repairs with batch processing. Happy coding!

Leave a Reply

Your email address will not be published. Required fields are marked *