PDF & Excel Processor

Process PDF and Excel files

Overview

The PDF & Excel Processor node enables extraction of text from PDF files using OCR (Optical Character Recognition). This is particularly useful for processing scanned or image-based PDFs where standard text extraction fails. The node can also process Excel files, but for the selected resource and operation, it focuses on extracting text from PDFs with OCR.

Common scenarios:

  • Extracting searchable text from scanned documents for archiving or further automation.
  • Converting invoices, contracts, or forms in image-based PDFs into machine-readable text.
  • Automating data entry by extracting content from non-editable PDF files.

Practical example:
A user uploads a scanned contract as a PDF. This node extracts the text using OCR, making it available for downstream processing such as keyword search, compliance checks, or database storage.

Properties

Name Type Meaning
File Type options Selects the type of file to process. For this operation, should be set to "PDF".
Binary Property string The name of the binary property containing the file data (e.g., "data").
OCR Language options Specifies the language to use for OCR, improving recognition accuracy for that language.

Output

The node outputs an object with the following structure in the json field:

{
  "pdfResults": {
    // ...extracted data from the processor,
    "operation": "extractTextWithOCR",
    "success": true,
    "timestamp": "2024-06-01T12:34:56.789Z"
  }
}
  • pdfResults: Contains the results of the OCR extraction, including:
    • The extracted text and any additional metadata provided by the processor.
    • The operation performed ("extractTextWithOCR").
    • A success flag (true if successful).
    • A timestamp indicating when the operation was performed.

If an error occurs and "Continue On Fail" is enabled, the output will include an error field with the error message.

Binary data is preserved and passed through unchanged.

Dependencies

  • External Libraries/Services: The node relies on internal processor modules (e.g., ProcessorFactory) for handling PDF and OCR operations.
  • No explicit API keys or environment variables are required for this operation based on static analysis.

Troubleshooting

Common issues:

  • Missing binary data: If the input item does not contain binary data, the node will throw an error:
    "No binary data found"
  • Incorrect binary property name: If the specified binary property does not exist, you will see:
    "Binary property '<name>' not found"
  • Corrupted or missing data: If the binary data is invalid or missing, the error will be:
    "Binary data in property '<name>' is invalid or missing data content"
  • Buffer creation failure: If the binary data cannot be converted to a buffer, the error will mention:
    "Failed to create buffer from binary data: <error message>"

How to resolve:

  • Ensure the input item contains valid binary data under the correct property name.
  • Double-check the "Binary Property" value matches the property holding your file data.
  • Verify the uploaded file is a valid PDF and not corrupted.

Links and References

Discussion