Actions2
Overview
This node parses PDF files to extract text content or converts PDF pages into images. It supports input PDFs either from binary data within the workflow or by fetching them from a URL. The node offers advanced text formatting options tailored for AI-friendly output, preserving layout and structure in various styles. Additionally, it can convert PDF pages into image files (PNG or JPEG) with configurable resolution and dimensions.
Common scenarios:
- Extracting clean, structured text from invoices, reports, or contracts for further processing.
- Converting PDF pages into images for preview thumbnails or embedding in other documents.
- Fetching PDFs from external URLs and parsing their content automatically.
- Preparing PDF text for natural language processing or AI analysis with customizable formatting.
Practical examples:
- Automatically extracting purchase order details from supplier PDFs.
- Generating image previews of PDF manuals for display in a web app.
- Parsing multi-page PDFs and splitting extracted text by page for granular processing.
- Downloading and converting PDF brochures from URLs into optimized JPEG images.
Properties
| Name | Meaning |
|---|---|
| PDF Source | Source of the PDF file to parse: either Binary Data (from a binary property in the workflow) or URL (fetch PDF from a web address). |
| Binary Property | Name of the binary property containing the PDF file (required if source is Binary Data). |
| URL | URL of the PDF file to parse (required if source is URL). |
| Output Property Name | Property name where the extracted content or image info will be stored in the output JSON. |
| Additional Options | Collection of optional settings: |
| - Max Pages | Maximum number of pages to parse or convert (0 means all pages). |
| - Page Range Start | Starting page number (1-based) for parsing or conversion. |
| - Page Range End | Ending page number (0 means last page). |
| - Text Formatting | Style of text formatting for extracted content (only for parsing): • Raw (best for AI) • Smart Layout • Minimal Cleanup • Visual Layout • Structured • Compact |
| - Include Metadata | Whether to include PDF metadata (like author, creation date) in the output. |
| - Split by Pages | Whether to split the extracted text output into an array of page texts instead of one combined string. |
| - Version | PDF.js version used for parsing (internal detail, default "v1.10.100"). |
| - Image Format | Output image format when converting PDF pages: PNG (with transparency) or JPEG (smaller size). |
| - DPI (Resolution) | Dots per inch for image conversion; higher values yield better quality but larger files (range 72–600). |
| - Width | Width in pixels for output images (0 means auto based on DPI). |
| - Height | Height in pixels for output images (0 means auto based on DPI). |
| - Preserve Aspect Ratio | Maintain original aspect ratio when resizing images. |
Output
The node outputs an array of items corresponding to each input item processed.
For Parse PDF operation:
- The output JSON contains the extracted text under the specified output property name.
- If Split by Pages is enabled, this property holds an array of strings, each representing one page's text.
- Optionally includes PDF metadata if enabled.
- Additional statistics like total text length, word count, and page count are included.
- Binary data is passed through unchanged.
For Convert to Image operation:
- The output JSON contains an array of objects, each describing one converted page image with properties such as page number, width, height, size, format, and the binary property name holding the image.
- The binary data contains base64-encoded image buffers keyed by generated binary property names (e.g.,
image_page_1). - Images are provided in the chosen format (PNG or JPEG).
Dependencies
- Uses the pdf-parse library for PDF text extraction.
- Uses the pdf2pic library for converting PDF pages to images.
- Requires network access if fetching PDFs from URLs.
- No special environment variables needed beyond standard n8n credentials for HTTP requests if applicable.
Troubleshooting
Error: "URL is required when source is set to URL"
Ensure the URL property is filled when selecting URL as the PDF source.Error: "Invalid URL format"
Verify the URL is correctly formatted and accessible.Error: "Failed to fetch PDF from URL"
Check network connectivity and that the URL points to a valid PDF file.Error: "PDF file is empty or could not be read"
Confirm the binary data or fetched file is a valid, non-empty PDF.Error: "File does not appear to be a valid PDF"
The file header does not match PDF signature; verify the input file is a proper PDF.Image conversion failures
May occur if requested page numbers exceed document length or due to unsupported PDF features; check page range and PDF integrity.Performance considerations
Parsing large PDFs or converting many pages at high DPI may consume significant memory and time.
Links and References
- pdf-parse GitHub repository – PDF text extraction library used internally.
- pdf2pic GitHub repository – Library for converting PDF pages to images.
- PDF.js project – Underlying PDF rendering engine referenced by version property.