PDF Parse

Parse PDF files and extract text content with enhanced AI-friendly formatting

Actions2

- Parse PDF
- Convert to Image

Overview

This node parses PDF files to extract their text content or convert PDF pages into images. It supports input PDFs from either binary data within the workflow or by fetching from a URL. The node offers multiple text formatting styles optimized for different use cases, including AI-friendly raw text and visually structured layouts. It can also split extracted text by pages and optionally include PDF metadata.

Common scenarios where this node is beneficial include:

Extracting readable text from invoices, reports, or contracts stored as PDFs.
Preparing PDF text for further processing with AI or text analysis tools.
Converting PDF pages into images for preview or archival purposes.
Fetching and parsing remote PDF documents dynamically via URLs.

Practical examples:

Automatically extracting purchase order details from PDF attachments in emails.
Converting multi-page PDF brochures into PNG images for web display.
Parsing scanned PDF reports and cleaning up text layout for natural language processing.

Properties

Name	Meaning
Operation	Choose between "Parse PDF" (extract text content) or "Convert to Image" (convert PDF pages to JPG/PNG images).
PDF Source	Select the source of the PDF file: "Binary Data" (from workflow binary property) or "URL" (fetch PDF from a web address).
Binary Property	Name of the binary property containing the PDF file (required if source is "Binary Data").
URL	URL of the PDF file to parse (required if source is "URL").
Output Property Name	Name of the property where the extracted content or image info will be stored in the output JSON.
Additional Options	Collection of optional settings:
- Max Pages	Maximum number of pages to parse or convert (0 means all pages).
- Page Range Start	Starting page number (1-based) for parsing or conversion.
- Page Range End	Ending page number (0 means last page).
- Text Formatting	Style of text formatting for extracted content when parsing: • Raw (best for AI) • Smart Layout • Minimal Cleanup • Visual Layout • Structured • Compact
- Include Metadata	Whether to include PDF metadata (like author, creation date) in the output.
- Split by Pages	Whether to split the extracted text output into an array of page texts instead of a single string.
- Version	PDF.js library version used for parsing (default "v1.10.100").
- Image Format	(For conversion operation) Output image format: PNG (with transparency) or JPEG (smaller size).
- DPI (Resolution)	(For conversion) Dots per inch for image quality; higher values produce better quality but larger files (range 72–600).
- Width	(For conversion) Image width in pixels; 0 means auto based on DPI.
- Height	(For conversion) Image height in pixels; 0 means auto based on DPI.
- Preserve Aspect Ratio	(For conversion) Maintain original aspect ratio when resizing images.

Output

For Parse PDF operation:
- Outputs JSON with the extracted text stored under the specified output property name.
- If "Split by Pages" is enabled, the output property contains an array of strings, each representing one page's text.
- Optionally includes PDF metadata under pdfMetadata if enabled.
- Also includes summary stats like total pages, text length, and word count under numPages and pdfStats.
- Binary data from input is preserved in output.
For Convert to Image operation:
- Outputs JSON with an array of objects describing each converted page image, including page number, dimensions, file size, format, and reference to binary data property.
- Binary data properties contain base64-encoded image data with appropriate MIME types and filenames.
- Each image corresponds to one PDF page converted according to specified options.

Dependencies

Uses the pdf-parse library for extracting text content from PDFs.
Uses the pdf2pic library for converting PDF pages to images.
Requires network access if fetching PDFs from URLs.
No special environment variables are needed beyond standard n8n credentials for HTTP requests if URL source is used.

Troubleshooting

Common issues:
- Empty or unreadable PDF input: Ensure the binary property or URL points to a valid PDF file.
- Invalid URL format or inaccessible URL: Verify the URL is correct and reachable.
- File not recognized as PDF: Input must start with "%PDF" header.
- Conversion failures for specific pages: May occur due to corrupted pages or unsupported content; these pages are skipped with warnings.
Error messages:
- "URL is required when source is set to URL": Provide a valid URL.
- "Invalid URL format": Check URL syntax.
- "Failed to fetch PDF from URL": Network or permission issue accessing the URL.
- "PDF file is empty or could not be read": Input PDF data is missing or corrupted.
- "File does not appear to be a valid PDF": Input file is not a proper PDF document.
- "Image conversion failed": Problem during image generation; check parameters and PDF integrity.
To resolve errors, verify inputs, ensure network connectivity, and confirm PDF validity before running the node.