PDF & Excel Processor

Process PDF and Excel files

Overview

The PDF & Excel Processor node is designed to extract text from PDF files (with optional OCR fallback) and process Excel files within n8n workflows. In the context of the "Default" resource and "Extract Text" operation, this node focuses on extracting text content from PDF documents. It can optionally use Optical Character Recognition (OCR) if standard extraction fails, making it useful for both text-based and image-based PDFs.

Common scenarios:

  • Extracting readable text from uploaded or received PDF files.
  • Automating document processing pipelines where text needs to be parsed from PDFs for further analysis or storage.
  • Handling scanned documents by enabling OCR fallback.

Practical examples:

  • Automatically extracting invoice data from PDF attachments in emails.
  • Processing scanned contracts or forms by extracting their text content for archiving or review.

Properties

Name Type Meaning
File Type options Type of file to process. For this operation, should be set to "PDF".
Binary Property string Name of the binary property containing the file data.
Fallback to OCR boolean Whether to fallback to OCR if standard text extraction fails (only for PDF).

Output

The output is a JSON object with the following structure:

{
  "pdfResults": {
    // ...extracted text and/or metadata fields depending on processor,
    "operation": "extractText",
    "success": true,
    "timestamp": "2024-06-01T12:34:56.789Z"
  }
}
  • The pdfResults field contains:
    • The extracted text (field names depend on the underlying processor).
    • The operation performed (extractText).
    • A success flag.
    • A timestamp of when the operation was performed.

If an error occurs and "Continue On Fail" is enabled, the output will instead contain:

{
  "error": "Error message here"
}

Binary data is preserved and passed through unchanged.


Dependencies

  • No external API keys are required.
  • The node relies on internal processors for PDF and Excel handling.
  • No special n8n configuration or environment variables are needed for basic usage.

Troubleshooting

Common issues:

  • No binary data found: Ensure that the input item contains a binary property with the specified name.
  • Binary property 'X' not found: Double-check the "Binary Property" value matches the actual property name in your input.
  • Binary data in property 'X' is invalid or missing data content: The binary property must include valid base64-encoded data.
  • Failed to create buffer from binary data: The binary data may be corrupted or improperly encoded.

How to resolve:

  • Verify that the previous node outputs a binary file under the correct property name.
  • Make sure the file type is supported (PDF for this operation).
  • If using OCR fallback, ensure the PDF is image-based or has non-extractable text.

Links and References

Discussion