PDF Parse

Parse PDF files and extract text content with enhanced AI-friendly formatting

Actions2

- Parse PDF
- Convert to Image

Overview

This node parses PDF files to extract text content or converts PDF pages into images. It supports input PDFs either from binary data within the workflow or by fetching them from a URL. The node offers advanced text formatting options tailored for AI-friendly output, preserving layout and structure in various styles. Additionally, it can convert PDF pages into image files (PNG or JPEG) with configurable resolution and dimensions.

Common scenarios:

Extracting clean, structured text from invoices, reports, or contracts for further processing.
Converting PDF pages into images for preview thumbnails or embedding in other documents.
Fetching PDFs from external URLs and parsing their content automatically.
Preparing PDF text for natural language processing or AI analysis with customizable formatting.

Practical examples:

Automatically extracting purchase order details from supplier PDFs.
Generating image previews of PDF manuals for display in a web app.
Parsing multi-page PDFs and splitting extracted text by page for granular processing.
Downloading and converting PDF brochures from URLs into optimized JPEG images.

Properties

Name	Meaning
PDF Source	Source of the PDF file to parse: either Binary Data (from a binary property in the workflow) or URL (fetch PDF from a web address).
Binary Property	Name of the binary property containing the PDF file (required if source is Binary Data).
URL	URL of the PDF file to parse (required if source is URL).
Output Property Name	Property name where the extracted content or image info will be stored in the output JSON.
Additional Options	Collection of optional settings:
- Max Pages	Maximum number of pages to parse or convert (0 means all pages).
- Page Range Start	Starting page number (1-based) for parsing or conversion.
- Page Range End	Ending page number (0 means last page).
- Text Formatting	Style of text formatting for extracted content (only for parsing): • Raw (best for AI) • Smart Layout • Minimal Cleanup • Visual Layout • Structured • Compact
- Include Metadata	Whether to include PDF metadata (like author, creation date) in the output.
- Split by Pages	Whether to split the extracted text output into an array of page texts instead of one combined string.
- Version	PDF.js version used for parsing (internal detail, default "v1.10.100").
- Image Format	Output image format when converting PDF pages: PNG (with transparency) or JPEG (smaller size).
- DPI (Resolution)	Dots per inch for image conversion; higher values yield better quality but larger files (range 72–600).
- Width	Width in pixels for output images (0 means auto based on DPI).
- Height	Height in pixels for output images (0 means auto based on DPI).
- Preserve Aspect Ratio	Maintain original aspect ratio when resizing images.

Output

The node outputs an array of items corresponding to each input item processed.

For Parse PDF operation:
- The output JSON contains the extracted text under the specified output property name.
- If Split by Pages is enabled, this property holds an array of strings, each representing one page's text.
- Optionally includes PDF metadata if enabled.
- Additional statistics like total text length, word count, and page count are included.
- Binary data is passed through unchanged.
For Convert to Image operation:
- The output JSON contains an array of objects, each describing one converted page image with properties such as page number, width, height, size, format, and the binary property name holding the image.
- The binary data contains base64-encoded image buffers keyed by generated binary property names (e.g., image_page_1).
- Images are provided in the chosen format (PNG or JPEG).

Dependencies

Uses the pdf-parse library for PDF text extraction.
Uses the pdf2pic library for converting PDF pages to images.
Requires network access if fetching PDFs from URLs.
No special environment variables needed beyond standard n8n credentials for HTTP requests if applicable.

Troubleshooting

Error: "URL is required when source is set to URL"
Ensure the URL property is filled when selecting URL as the PDF source.
Error: "Invalid URL format"
Verify the URL is correctly formatted and accessible.
Error: "Failed to fetch PDF from URL"
Check network connectivity and that the URL points to a valid PDF file.
Error: "PDF file is empty or could not be read"
Confirm the binary data or fetched file is a valid, non-empty PDF.
Error: "File does not appear to be a valid PDF"
The file header does not match PDF signature; verify the input file is a proper PDF.
Image conversion failures
May occur if requested page numbers exceed document length or due to unsupported PDF features; check page range and PDF integrity.
Performance considerations
Parsing large PDFs or converting many pages at high DPI may consume significant memory and time.

Links and References

pdf-parse GitHub repository – PDF text extraction library used internally.
pdf2pic GitHub repository – Library for converting PDF pages to images.
PDF.js project – Underlying PDF rendering engine referenced by version property.