pdf-extractor

n8n community node to extract text from password-protected PDFs - no external dependencies required

Package Information

Downloads: 41 weekly / 56 monthly

Latest Version: 1.2.0

Author: NAAI Studio

Available Nodes

PDF Extractor

Extract text from password-protected PDFs. No external dependencies required - works out of the box in n8n Docker.

Documentation

n8n-nodes-pdf-extractor

This is an n8n community node that extracts text from password-protected PDFs reliably using qpdf and pdftotext command-line tools.

This node was created to solve the known crashing issue with the built-in "Extract from File" PDF node.

n8n is a fair-code licensed workflow automation platform.

Features

✅ Extract text from password-protected PDFs
✅ Decrypt PDFs and return as binary for further processing
✅ No crashes - uses battle-tested command-line tools instead of buggy JavaScript libraries
✅ Layout preservation - maintains original text positioning
✅ Page range selection - extract specific pages only
✅ Multiple encodings - UTF-8, Latin1, ASCII7

Prerequisites

Before using this node, you must install the required tools in your n8n container:

docker exec -u root n8n apk add --no-cache qpdf poppler-utils

For persistent installation, add this to your Docker Compose file:

services:
  n8n:
    image: n8nio/n8n:latest
    # ... other config
    entrypoint: /bin/sh
    command:
      - -c
      - |
        apk add --no-cache qpdf poppler-utils
        exec tini -- /docker-entrypoint.sh

Installation

Via n8n UI (Recommended)

Go to Settings → Community Nodes
Click Install
Enter: n8n-nodes-pdf-extractor
Click Install

Via npm

cd ~/.n8n/nodes
npm install n8n-nodes-pdf-extractor

Operations

Extract Text

Extracts text content from a PDF file.

Parameters:

Binary Property: Name of the binary property containing the PDF (default: data)
Password: Password to decrypt the PDF (leave empty if not encrypted)

Options:

Layout Mode: Maintain original text layout (default: true)
Page Range: Extract specific pages (e.g., "1-5" or "1,3,5")
Output Property: JSON property name for extracted text (default: text)
Encoding: Text encoding (UTF-8, Latin1, ASCII7)

Decrypt Only

Decrypts a password-protected PDF and returns it as a binary file for further processing.

Example Usage

Extract text from bank statement

[Gmail Trigger] → [PDF Extractor] → [AI/LLM] → [Google Sheets]

Gmail Trigger receives email with PDF attachment
PDF Extractor extracts text with password
AI extracts structured data
Save to Google Sheets

Why This Node?

The built-in n8n "Extract from File" node uses pdf-parse JavaScript library which:

❌ Crashes n8n container with certain PDF encryption types
❌ Causes "SIGILL" errors on Alpine Linux
❌ Has memory issues with large PDFs

This node uses:

✅ qpdf - Industry-standard PDF manipulation tool
✅ pdftotext (poppler-utils) - Robust text extraction from PDFs

Troubleshooting

"Required tools not found"

Install the required tools:

docker exec -u root n8n apk add --no-cache qpdf poppler-utils

"Invalid password for PDF file"

Check that the password is correct. Some PDFs use owner password vs user password.

Empty text output

The PDF might be scanned/image-based. This node extracts text layers only. For scanned PDFs, use OCR tools.

Resources

License

MIT

pdf-extractorInstall