How to Use OCR to Extract Text from PDF Images with Python

Learn how to use pdfRest OCR PDF and Extract Text API Tools with Python to extract all text from a PDF
Share this page

Why Use OCR to Extract Text from PDF with Python?

The pdfRest OCR PDF API Tool is designed to convert scanned documents into PDFs with searchable and extractable text using Optical Character Recognition (OCR). This tutorial will demonstrate how to Extract Text with OCR using Python, making it easy to automate the process of extracting both machine-readable and image-based text from a PDF.

Imagine you have a large number of scanned documents, such as invoices or historical records, and you need to extract the text from these documents. Using OCR, you can convert these scanned images into text that can be extracted and then immediately extract that text, significantly improving your workflow and data management capabilities.

PDF OCR Text Extraction with Python Code Example

from requests_toolbelt import MultipartEncoder
import requests

api_key = 'xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here

ocr_endpoint_url = 'https://api.pdfrest.com/pdf-with-ocr-text'
mp_encoder_pdf = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file.pdf', 'rb'), 'application/pdf'),
        'output': 'example_pdf-with-ocr-text_out',
    }
)

image_headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_pdf.content_type,
    'Api-Key': api_key
}

print("Sending POST request to OCR endpoint...")
response = requests.post(ocr_endpoint_url, data=mp_encoder_pdf, headers=image_headers)

print("Response status code: " + str(response.status_code))

if response.ok:
    response_json = response.json()
    ocr_pdf_id = response_json["outputId"]
    print("Got the output ID: " + ocr_pdf_id)

    extract_endpoint_url = 'https://api.pdfrest.com/extracted-text'

    mp_encoder_extract_text = MultipartEncoder(
        fields={
            'id': ocr_pdf_id
        }
    )

    extract_text_headers = {
        'Accept': 'application/json',
        'Content-Type': mp_encoder_extract_text.content_type,
        'Api-Key': api_key
    }

    print("Sending POST request to extract text endpoint...")
    extract_response = requests.post(extract_endpoint_url, data=mp_encoder_extract_text, headers=extract_text_headers)

    print("Response status code: " + str(extract_response.status_code))

    if extract_response.ok:
        extract_json = extract_response.json()
        print(extract_json["fullText"])

    else:
        print(extract_response.text)


else:
    print(response.text)

Source: GitHub Repository

Breaking Down the Code

The provided code demonstrates how to use the pdfRest OCR PDF API Tool to convert a scanned document into a PDF with searchable text and then extract that text. Here is a detailed breakdown of how the code works:

from requests_toolbelt import MultipartEncoder
import requests

These lines import the necessary libraries. requests_toolbelt is used to handle multipart form data, and requests is used to make HTTP requests.

api_key = 'xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' # place your api key here

Replace the placeholder with your actual API key from pdfRest.

ocr_endpoint_url = 'https://api.pdfrest.com/pdf-with-ocr-text'
mp_encoder_pdf = MultipartEncoder(
    fields={
        'file': ('file_name.pdf', open('/path/to/file.pdf', 'rb'), 'application/pdf'),
        'output': 'example_pdf-with-ocr-text_out',
    }
)

This sets the OCR endpoint URL and prepares the multipart form data. The fields dictionary includes the PDF file to be uploaded and an output identifier.

image_headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_pdf.content_type,
    'Api-Key': api_key
}

These headers specify the content type, accept type, and API key for the request.

print("Sending POST request to OCR endpoint...")
response = requests.post(ocr_endpoint_url, data=mp_encoder_pdf, headers=image_headers)

This sends a POST request to the OCR endpoint with the prepared data and headers.

if response.ok:
    response_json = response.json()
    ocr_pdf_id = response_json["outputId"]
    print("Got the output ID: " + ocr_pdf_id)

If the request is successful, it extracts the output ID from the response JSON.

extract_endpoint_url = 'https://api.pdfrest.com/extracted-text'
mp_encoder_extract_text = MultipartEncoder(
    fields={
        'id': ocr_pdf_id
    }
)

This sets the extraction endpoint URL and prepares the multipart form data with the output ID.

extract_text_headers = {
    'Accept': 'application/json',
    'Content-Type': mp_encoder_extract_text.content_type,
    'Api-Key': api_key
}

These headers specify the content type, accept type, and API key for the extraction request.

print("Sending POST request to extract text endpoint...")
extract_response = requests.post(extract_endpoint_url, data=mp_encoder_extract_text, headers=extract_text_headers)

This sends a POST request to the extraction endpoint with the prepared data and headers.

if extract_response.ok:
    extract_json = extract_response.json()
    print(extract_json["fullText"])

If the extraction request is successful, it prints the extracted text from the response JSON.

Beyond the Tutorial

In this tutorial, you learned how to use the pdfRest OCR PDF API Tool to convert a scanned document into a searchable PDF and extract the text using Python. This process can significantly enhance your document management and data extraction workflows.

To explore more functionalities, you can demo all of the pdfRest API Tools in the API Lab. For detailed information on each endpoint and parameter, refer to the API Reference Guide.

Generate a self-service API Key now!

Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.