pdf-extractor

n8n community node to extract text from password-protected PDFs - no external dependencies required

Package Information

Downloads: 41 weekly / 56 monthly
Latest Version: 1.2.0
Author: NAAI Studio

Documentation

n8n-nodes-pdf-extractor

This is an n8n community node that extracts text from password-protected PDFs reliably using qpdf and pdftotext command-line tools.

This node was created to solve the known crashing issue with the built-in "Extract from File" PDF node.

n8n is a fair-code licensed workflow automation platform.

Features

  • Extract text from password-protected PDFs
  • Decrypt PDFs and return as binary for further processing
  • No crashes - uses battle-tested command-line tools instead of buggy JavaScript libraries
  • Layout preservation - maintains original text positioning
  • Page range selection - extract specific pages only
  • Multiple encodings - UTF-8, Latin1, ASCII7

Prerequisites

Before using this node, you must install the required tools in your n8n container:

docker exec -u root n8n apk add --no-cache qpdf poppler-utils

For persistent installation, add this to your Docker Compose file:

services:
  n8n:
    image: n8nio/n8n:latest
    # ... other config
    entrypoint: /bin/sh
    command:
      - -c
      - |
        apk add --no-cache qpdf poppler-utils
        exec tini -- /docker-entrypoint.sh

Installation

Via n8n UI (Recommended)

  1. Go to SettingsCommunity Nodes
  2. Click Install
  3. Enter: n8n-nodes-pdf-extractor
  4. Click Install

Via npm

cd ~/.n8n/nodes
npm install n8n-nodes-pdf-extractor

Operations

Extract Text

Extracts text content from a PDF file.

Parameters:

  • Binary Property: Name of the binary property containing the PDF (default: data)
  • Password: Password to decrypt the PDF (leave empty if not encrypted)

Options:

  • Layout Mode: Maintain original text layout (default: true)
  • Page Range: Extract specific pages (e.g., "1-5" or "1,3,5")
  • Output Property: JSON property name for extracted text (default: text)
  • Encoding: Text encoding (UTF-8, Latin1, ASCII7)

Decrypt Only

Decrypts a password-protected PDF and returns it as a binary file for further processing.

Example Usage

Extract text from bank statement

[Gmail Trigger] → [PDF Extractor] → [AI/LLM] → [Google Sheets]
  1. Gmail Trigger receives email with PDF attachment
  2. PDF Extractor extracts text with password
  3. AI extracts structured data
  4. Save to Google Sheets

Why This Node?

The built-in n8n "Extract from File" node uses pdf-parse JavaScript library which:

  • ❌ Crashes n8n container with certain PDF encryption types
  • ❌ Causes "SIGILL" errors on Alpine Linux
  • ❌ Has memory issues with large PDFs

This node uses:

  • qpdf - Industry-standard PDF manipulation tool
  • pdftotext (poppler-utils) - Robust text extraction from PDFs

Troubleshooting

"Required tools not found"

Install the required tools:

docker exec -u root n8n apk add --no-cache qpdf poppler-utils

"Invalid password for PDF file"

Check that the password is correct. Some PDFs use owner password vs user password.

Empty text output

The PDF might be scanned/image-based. This node extracts text layers only. For scanned PDFs, use OCR tools.

Resources

License

MIT

Discussion