Overview
The PDF Page Extract node extracts text content from PDF files on a per-page basis. It processes binary PDF data and outputs an array where each element corresponds to the text extracted from one page of the PDF. Optionally, it can also include the entire raw text of the PDF as a single string and metadata information about the PDF document.
This node is useful in scenarios where you need to analyze or process PDF documents by their individual pages, such as extracting page-wise summaries, indexing content for search, or feeding page-specific data into further workflows.
Practical examples:
- Extracting text from scanned reports or invoices page-by-page for automated data extraction.
- Splitting large PDF documents into page-level text chunks for natural language processing tasks.
- Retrieving metadata like author or creation date alongside page content for document management systems.
Properties
| Name | Meaning |
|---|---|
| Binary Property | Name of the binary property that contains the PDF file data (default: "data"). |
| Include Raw Text | Whether to include the complete raw text of the entire PDF in addition to the pages array. |
| Include Metadata | Whether to include PDF metadata (such as info and metadata fields) in the output JSON. |
Output
The node outputs an array of items, each containing:
json: An object with the following structure:filename: The original filename of the PDF (if available).totalPages: Total number of pages in the PDF.pages: An array of strings, each string representing the extracted text of a single page.text(optional): The full raw text of the entire PDF if "Include Raw Text" is enabled.metadataandinfo(optional): Metadata objects extracted from the PDF if "Include Metadata" is enabled.
binary: The original binary data of the PDF passed through unchanged.
If an error occurs during processing and the node is configured to continue on failure, the output item will contain an error field in json describing the issue.
Dependencies
- Requires the external library
pdf-parsefor parsing PDF content. - The node expects valid PDF binary data provided via a specified binary property.
- No additional API keys or external services are required.
Troubleshooting
No binary data exists on item!
This error indicates that the input item does not contain any binary data. Ensure that the input to this node includes binary PDF data.Binary data property "[name]" does not exist on item!
The specified binary property name does not match any binary data property on the input item. Verify the property name matches exactly.PDF binary data is empty after decoding!
The binary data was found but is empty after decoding. Confirm that the binary data is correctly loaded and not corrupted.If the node fails unexpectedly, check that the input data is a valid PDF file and that the binary property is correctly set.
Links and References
- pdf-parse GitHub repository – underlying library used for PDF text extraction.
- n8n Documentation – general guidance on working with binary data and custom nodes.