Overview
The PDF & Excel Processor node enables extraction of text from PDF files using OCR (Optical Character Recognition). This is particularly useful for processing scanned or image-based PDFs where standard text extraction fails. The node can also process Excel files, but for the selected resource and operation, it focuses on extracting text from PDFs with OCR.
Common scenarios:
- Extracting searchable text from scanned documents for archiving or further automation.
- Converting invoices, contracts, or forms in image-based PDFs into machine-readable text.
- Automating data entry by extracting content from non-editable PDF files.
Practical example:
A user uploads a scanned contract as a PDF. This node extracts the text using OCR, making it available for downstream processing such as keyword search, compliance checks, or database storage.
Properties
| Name | Type | Meaning |
|---|---|---|
| File Type | options | Selects the type of file to process. For this operation, should be set to "PDF". |
| Binary Property | string | The name of the binary property containing the file data (e.g., "data"). |
| OCR Language | options | Specifies the language to use for OCR, improving recognition accuracy for that language. |
Output
The node outputs an object with the following structure in the json field:
{
"pdfResults": {
// ...extracted data from the processor,
"operation": "extractTextWithOCR",
"success": true,
"timestamp": "2024-06-01T12:34:56.789Z"
}
}
- pdfResults: Contains the results of the OCR extraction, including:
- The extracted text and any additional metadata provided by the processor.
- The operation performed ("extractTextWithOCR").
- A success flag (
trueif successful). - A timestamp indicating when the operation was performed.
If an error occurs and "Continue On Fail" is enabled, the output will include an error field with the error message.
Binary data is preserved and passed through unchanged.
Dependencies
- External Libraries/Services: The node relies on internal processor modules (e.g.,
ProcessorFactory) for handling PDF and OCR operations. - No explicit API keys or environment variables are required for this operation based on static analysis.
Troubleshooting
Common issues:
- Missing binary data: If the input item does not contain binary data, the node will throw an error:
"No binary data found" - Incorrect binary property name: If the specified binary property does not exist, you will see:
"Binary property '<name>' not found" - Corrupted or missing data: If the binary data is invalid or missing, the error will be:
"Binary data in property '<name>' is invalid or missing data content" - Buffer creation failure: If the binary data cannot be converted to a buffer, the error will mention:
"Failed to create buffer from binary data: <error message>"
How to resolve:
- Ensure the input item contains valid binary data under the correct property name.
- Double-check the "Binary Property" value matches the property holding your file data.
- Verify the uploaded file is a valid PDF and not corrupted.
Links and References
- n8n Documentation: Working with Binary Data
- n8n Community Forum
- Tesseract OCR Languages (for supported OCR languages)