PDF OCR

Extract text from PDF files using OCR (Optical Character Recognition)

Overview

This node performs Optical Character Recognition (OCR) on PDF files provided as binary data. It extracts text content from each page of the PDF by rendering pages into images and then applying OCR to recognize the text. This is useful when you need to convert scanned PDFs or image-based PDFs into searchable and editable text.

Common scenarios include:

  • Extracting text from scanned invoices, receipts, or contracts.
  • Automating data entry by converting PDF documents into machine-readable text.
  • Processing multi-language PDF documents for text analysis or translation.

For example, a user can input a PDF containing scanned pages in English, select English as the OCR language, and receive the extracted text either combined from all pages or separated per page.

Properties

Name Meaning
Input Binary Property Name of the binary property that contains the PDF file to be processed.
Language Language to use for OCR text recognition. Options: Chinese Simplified, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish.
Output Format How to format the extracted text output. Options: Combined Text (all pages concatenated), Per Page (array of page texts), Detailed (includes combined text, per page array, and metadata).
Scale Scale factor for rendering PDF pages before OCR. Higher values improve quality but slow processing.

Output

The node outputs JSON data with the recognized text and related information. The structure depends on the selected output format:

  • Combined Text:

    {
      "text": "Full extracted text from all pages concatenated",
      "totalPages": <number_of_pages>,
      "language": "<selected_language_code>"
    }
    
  • Per Page:

    {
      "pages": [
        { "pageNumber": 1, "text": "Text from page 1" },
        { "pageNumber": 2, "text": "Text from page 2" },
        ...
      ],
      "totalPages": <number_of_pages>,
      "language": "<selected_language_code>"
    }
    
  • Detailed:

    {
      "text": "Full extracted text from all pages concatenated",
      "pages": [
        { "pageNumber": 1, "text": "Text from page 1" },
        { "pageNumber": 2, "text": "Text from page 2" },
        ...
      ],
      "metadata": {
        "totalPages": <number_of_pages>,
        "language": "<selected_language_code>",
        "scale": <scale_factor_used>
      }
    }
    

If the input item contained binary data, it is preserved in the output alongside the JSON.

The node does not output binary data itself; it only processes binary PDF input and outputs recognized text in JSON form.

Dependencies

  • Uses pdfjs-dist library to parse and render PDF pages.
  • Uses tesseract.js for performing OCR on rendered page images.
  • Requires an environment capable of running these libraries (Node.js environment with canvas support).
  • No external API keys are needed since OCR is performed locally via tesseract.js.

Troubleshooting

  • No binary data found for property "X":
    Ensure the input item contains binary data under the specified property name. Check that the previous node outputs the PDF file correctly.

  • OCR quality issues or missing text:
    Increase the scale factor property to render higher resolution images for OCR at the cost of slower processing time.

  • Unsupported PDF or corrupted file errors:
    Verify the PDF file integrity and compatibility. Some PDFs with unusual encoding or encryption may fail to render.

  • Language not supported or incorrect text recognition:
    Confirm the correct language is selected matching the document's language to improve OCR accuracy.

  • Performance considerations:
    Large PDFs with many pages and high scale factors will increase processing time and memory usage.

Links and References

Discussion