pdf-to-csv

n8n community node to convert PDF documents to CSV format with advanced structure detection, smart extraction method selection, and enhanced table preservation for ERP documents

Package Information

Downloads: 409 weekly / 409 monthly

Latest Version: 2.7.0

Author: jkong0221

Available Nodes

PDF to CSV

Convert PDF documents to CSV format using coordinate-based extraction or OCR

Documentation

n8n-nodes-pdf-to-csv

An n8n community node for converting PDF documents to CSV format. This node provides robust PDF parsing capabilities with coordinate-based extraction and OCR support for image-based PDFs.

Features

📄 Convert PDF documents to CSV format
🔗 Support for both binary data and URL inputs
🎯 Three parsing methods: Coordinate-based, OCR, and Auto-detect
📊 Multiple output formats (CSV string, JSON array, binary data, Excel)
🔍 OCR support for scanned/image-based PDFs using Tesseract.js
⚡ Fast coordinate-based extraction for native PDF text
🤖 Auto-detect with intelligent fallback from coordinate to OCR
⚙️ Configurable CSV delimiters and headers
🔧 Built-in error handling and validation

Installation

Community Nodes (Recommended)

Go to Settings > Community Nodes in your n8n instance
Select Install
Enter n8n-nodes-pdf-to-csv
Agree to the risks and select Install

Manual Installation

Clone this repository or download the source code
Install dependencies:
```
pnpm install
```
Build the node:
```
pnpm build
```

Link the node to your n8n installation:

pnpm link
cd ~/.n8n/custom
pnpm link n8n-nodes-pdf-to-csv

Restart your n8n instance

Docker Installation

If you're using n8n with Docker, you can install this node by:

Create a Dockerfile extending the n8n image:

FROM n8nio/n8n
USER root
RUN npm install -g n8n-nodes-pdf-to-csv
USER node

Build and run your custom image:

docker build -t n8n-custom .
docker run -it --rm --name n8n -p 5678:5678 n8n-custom

Usage

Basic PDF to CSV Conversion

Add the PDF to CSV node to your workflow
Configure the input type:
- Binary Data: Use when PDF comes from a previous node (e.g., HTTP Request, Google Drive)
- URL: Provide a direct URL to the PDF file
Choose parsing method:
- Auto-Detect: Tries coordinate-based first, falls back to OCR if needed (recommended)
- Coordinate-Based (Fast): Uses coordinate analysis for native PDF text (fastest)
- OCR (Image-Based): Uses Tesseract.js for scanned/image PDFs (slower but works with any PDF)
Configure output format:
- CSV String: Returns formatted CSV text
- JSON Array: Returns structured JSON data
- Binary Data: Returns downloadable CSV file

Input Configuration

Binary Data Input

{
  "inputType": "binaryData",
  "binaryPropertyName": "data"
}

URL Input

{
  "inputType": "url",
  "pdfUrl": "https://example.com/document.pdf"
}

Parsing Methods

Auto-Detect (Recommended)

Intelligent method that tries coordinate-based extraction first, then falls back to OCR if needed. Best for:

Unknown PDF types
Mixed document collections
When you want the fastest method that works

Coordinate-Based (Fast)

Uses coordinate analysis to detect table structure from native PDF text. Best for:

PDFs created digitally (not scanned)
Documents with clear table structures
When speed is important
Most business reports and invoices

OCR (Image-Based)

Uses Tesseract.js optical character recognition to extract text from PDF images. Best for:

Scanned documents
Image-based PDFs
PDFs where coordinate extraction fails
Documents with complex layouts or mixed content

Note: OCR is slower but more universally compatible with different PDF types.

Example Workflow

{
  "nodes": [
    {
      "parameters": {
        "url": "https://example.com/report.pdf",
        "options": {}
      },
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 1,
      "position": [250, 300],
      "id": "http-request",
      "name": "Download PDF"
    },
    {
      "parameters": {
        "operation": "convert",
        "inputType": "binaryData",
        "binaryPropertyName": "data",
        "parsingMethod": "auto",
        "outputFormat": "csvString",
        "csvDelimiter": ",",
        "includeHeaders": true,
        "skipEmptyLines": true
      },
      "type": "n8n-nodes-pdf-to-csv.pdfToCsv",
      "typeVersion": 1,
      "position": [450, 300],
      "id": "pdf-to-csv",
      "name": "PDF to CSV"
    }
  ]
}

Configuration Options

Parameter	Type	Default	Description
`inputType`	Options	`binaryData`	Source of PDF file (binaryData/url)
`binaryPropertyName`	String	`data`	Name of binary property containing PDF
`pdfUrl`	String	-	URL of PDF file to convert
`parsingMethod`	Options	`auto`	Method for parsing PDF content (auto/coordinate/ocr)
`csvDelimiter`	String	`,`	Delimiter for CSV output
`includeHeaders`	Boolean	`true`	Treat first row as headers
`skipEmptyLines`	Boolean	`true`	Skip empty lines in PDF
`outputFormat`	Options	`csvString`	Format of output data

Supported File Types

PDF documents (.pdf)
Password-protected PDFs are not currently supported

Error Handling

The node includes comprehensive error handling for:

Invalid PDF files
Network errors when fetching URLs
Parsing failures
Memory limitations for large files

Errors can be handled using n8n's built-in error handling mechanisms.

Limitations

Large PDF files may consume significant memory
Complex PDF layouts may not parse perfectly with auto-detection
Scanned PDFs (images) require OCR preprocessing
Password-protected PDFs are not supported

Development

Prerequisites

Node.js 18.10 or higher
pnpm 7.18 or higher

Setup

git clone https://github.com/your-username/n8n-nodes-pdf-to-csv.git
cd n8n-nodes-pdf-to-csv
pnpm install

Build

pnpm build

Development Mode

pnpm dev

Linting

pnpm lint
pnpm lintfix

Testing

pnpm test

Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes
Run tests and linting: pnpm test && pnpm lint
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

📧 Email: your.email@example.com
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Changelog

v1.0.0

Initial release
Basic PDF to CSV conversion
Multiple parsing methods
Flexible output formats
Comprehensive error handling

pdf-to-csvInstall