How to Use OCR to Extract Text from PDF Images in .NET with C#

Learn how to extract all text from a PDF using pdfRest OCR PDF and Extract Text API Tools with C#
Share this page

Why Use OCR to Extract Text from PDF with C#?

The pdfRest OCR PDF API Tool is a powerful resource for converting scanned documents and images into searchable and extractable text. This tutorial will show you how to send API calls to OCR PDF and Extract Text with C#, making it easier to integrate OCR functionality into your C# applications.

Imagine you have a large collection of scanned documents that you need to search through for specific information. Manually transcribing these documents would be time-consuming and error-prone. By using the OCR PDF and Extract Text API Tools, you can automate the process of extracting text from these documents, making them searchable and editable, thus saving time and improving accuracy.

PDF OCR Text Extraction with C# Code Example

using Newtonsoft.Json.Linq;
using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;

class OcrWithExtractText
{
    private static readonly string apiKey = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"; // Your API key here

    static async Task Main(string[] args)
    {
        using (var httpClient = new HttpClient { BaseAddress = new Uri("https://api.pdfrest.com") })
        {
            // Upload PDF for OCR
            using var ocrRequest = new HttpRequestMessage(HttpMethod.Post, "pdf-with-ocr-text");

            ocrRequest.Headers.TryAddWithoutValidation("Api-Key", apiKey);
            ocrRequest.Headers.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("application/json"));
            var ocrMultipartContent = new MultipartFormDataContent();

            var pdfByteArray = File.ReadAllBytes("/path/to/file.pdf");
            var pdfByteArrayContent = new ByteArrayContent(pdfByteArray);
            ocrMultipartContent.Add(pdfByteArrayContent, "file", "file.pdf");
            pdfByteArrayContent.Headers.TryAddWithoutValidation("Content-Type", "application/pdf");

            ocrRequest.Content = ocrMultipartContent;
            var ocrResponse = await httpClient.SendAsync(ocrRequest);

            var ocrResult = await ocrResponse.Content.ReadAsStringAsync();
            Console.WriteLine("OCR response received.");
            Console.WriteLine(ocrResult);

            dynamic ocrResponseData = JObject.Parse(ocrResult);
            string ocrPDFID = ocrResponseData.outputId;

            // Extract text from OCR'd PDF
            using var extractTextRequest = new HttpRequestMessage(HttpMethod.Post, "extracted-text");

            extractTextRequest.Headers.TryAddWithoutValidation("Api-Key", apiKey);
            extractTextRequest.Headers.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("application/json"));
            var extractTextMultipartContent = new MultipartFormDataContent();

            var byteArrayOption = new ByteArrayContent(Encoding.UTF8.GetBytes(ocrPDFID));
            extractTextMultipartContent.Add(byteArrayOption, "id");

            extractTextRequest.Content = extractTextMultipartContent;
            var extractTextResponse = await httpClient.SendAsync(extractTextRequest);

            var extractTextResult = await extractTextResponse.Content.ReadAsStringAsync();
            Console.WriteLine("Extract text response received.");
            Console.WriteLine(extractTextResult);

            dynamic extractTextResponseData = JObject.Parse(extractTextResult);
            string fullText = extractTextResponseData.fullText;

            Console.WriteLine("Extracted text:");
            Console.WriteLine(fullText);
        }
    }
}

Source: GitHub

Breaking Down the Code

Let's break down the provided code to understand how it works:

private static readonly string apiKey = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"; // Your API key here

This line defines the API key required to authenticate your requests to the pdfRest API.

using (var httpClient = new HttpClient { BaseAddress = new Uri("https://api.pdfrest.com") })

Here, we create an instance of HttpClient with the base address set to the pdfRest API URL.

using var ocrRequest = new HttpRequestMessage(HttpMethod.Post, "pdf-with-ocr-text");

This line initializes a new HTTP POST request to the pdf-with-ocr-text endpoint.

ocrRequest.Headers.TryAddWithoutValidation("Api-Key", apiKey);
ocrRequest.Headers.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("application/json"));

These lines add the necessary headers to the request, including the API key and the expected response format (JSON).

var pdfByteArray = File.ReadAllBytes("/path/to/file.pdf");
var pdfByteArrayContent = new ByteArrayContent(pdfByteArray);
ocrMultipartContent.Add(pdfByteArrayContent, "file", "file.pdf");
pdfByteArrayContent.Headers.TryAddWithoutValidation("Content-Type", "application/pdf");

Here, we read the PDF file into a byte array and add it to the request content as a multipart form data. The content type is set to application/pdf.

var ocrResponse = await httpClient.SendAsync(ocrRequest);
var ocrResult = await ocrResponse.Content.ReadAsStringAsync();

We send the OCR request and read the response content as a string.

dynamic ocrResponseData = JObject.Parse(ocrResult);
string ocrPDFID = ocrResponseData.outputId;

The response is parsed as a JSON object, and the output ID of the OCR'd PDF is extracted.

using var extractTextRequest = new HttpRequestMessage(HttpMethod.Post, "extracted-text");

We initialize a new HTTP POST request to the extracted-text endpoint for extracting text from the OCR'd PDF.

var byteArrayOption = new ByteArrayContent(Encoding.UTF8.GetBytes(ocrPDFID));
extractTextMultipartContent.Add(byteArrayOption, "id");

The OCR'd PDF ID is added to the request content as a multipart form data.

var extractTextResponse = await httpClient.SendAsync(extractTextRequest);
var extractTextResult = await extractTextResponse.Content.ReadAsStringAsync();

We send the extract text request and read the response content as a string.

dynamic extractTextResponseData = JObject.Parse(extractTextResult);
string fullText = extractTextResponseData.fullText;

The response is parsed as a JSON object, and the extracted text is retrieved.

Beyond the Tutorial

In this tutorial, we demonstrated how to use the pdfRest OCR PDF API tool with the Extract Text API tool to extract image-based text from a PDF using C#. By following these steps, you can integrate OCR functionality into your own C# applications.

To explore more functionalities, you can demo all of the pdfRest API Tools in the API Lab. For detailed information, refer to the API Reference Guide.

Generate a self-service API Key now!

Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.