PDF Parse

Parse PDF files and extract text content with enhanced AI-friendly formatting

Actions2

- Parse PDF
- Convert to Image

Overview

This node processes PDF files either by extracting text content or converting PDF pages into images. It supports input PDFs from binary data or URLs. The "Convert to Image" operation transforms each page of the PDF into an image file (PNG or JPEG) with configurable resolution and dimensions. This is useful for workflows that require visual representations of PDF pages, such as generating thumbnails, previews, or embedding images in reports.

Practical examples:

Converting a multi-page PDF invoice into individual PNG images for display on a website.
Extracting images from PDF pages to send as attachments in emails.
Creating thumbnails of PDF documents stored in cloud storage.

Properties

Name	Meaning
PDF Source	Source of the PDF file: either "Binary Data" (from a binary property) or "URL" (download from a URL).
Binary Property	Name of the binary property containing the PDF file (required if source is Binary Data).
URL	URL of the PDF file to parse (required if source is URL).
Output Property Name	Property name where the output (image info array) will be stored.
Additional Options	Collection of optional settings:
- Max Pages	Maximum number of pages to convert (0 means all pages).
- Page Range Start	Starting page number (1-based) for conversion.
- Page Range End	Ending page number (0 means last page).
- Image Format	Output image format: PNG (supports transparency) or JPEG (smaller size).
- DPI (Resolution)	Dots per inch for image quality; higher values yield better quality but larger files (72–600).
- Width	Image width in pixels; 0 means auto-calculated based on DPI.
- Height	Image height in pixels; 0 means auto-calculated based on DPI.
- Preserve Aspect Ratio	Whether to maintain the original aspect ratio when resizing images.

Output

The node outputs JSON data with the specified output property containing an array of objects, each representing a converted PDF page image. Each object includes:

page: The page number.
width: Width of the generated image in pixels.
height: Height of the generated image in pixels.
size: Size of the image data in bytes.
format: Image format used ("png" or "jpeg").
binaryProperty: The name of the binary property holding the base64-encoded image data.

The binary data for each image page is included in the output under separate binary properties named like image_page_1, image_page_2, etc., with appropriate MIME types (image/png or image/jpeg) and file extensions.

Dependencies

Uses the pdf2pic library to convert PDF pages to images.
Requires access to the PDF file either as binary data or via a valid URL.
No special environment variables are needed beyond standard n8n credentials for HTTP requests if fetching from URLs.

Troubleshooting

Invalid or empty PDF file: The node checks if the input buffer starts with %PDF. If not, it throws an error indicating the file is not a valid PDF.
Failed to fetch PDF from URL: If the URL is invalid or unreachable, an error is thrown. Ensure the URL is correct and accessible.
Image conversion failures: Individual page conversions may fail silently with warnings logged. Check the node execution logs for details.
Incorrect binary property name: When using binary input, ensure the binary property name matches the actual property containing the PDF data.
Unsupported image format or DPI values: Use only supported formats (PNG, JPEG) and DPI within 72 to 600.

Links and References

pdf2pic GitHub repository – Library used for PDF to image conversion.
PDF.js project – Underlying PDF parsing technology referenced in the code.
n8n Documentation – For general guidance on creating and using custom nodes.