n8ntools-document-processor

N8N Tools - Document Processor: Process and analyze documents with OCR, text extraction, and format conversion

Package Information

Released: 9/16/2025
Downloads: 34 weekly / 202 monthly
Latest Version: 4.4.2
Author: N8N Tools

Documentation

N8N Tools - Document Processor

npm version
npm downloads
License: MIT

Process and analyze documents with OCR, text extraction, and format conversion capabilities. This N8N community node provides comprehensive document processing through the N8N Tools platform.

✨ Features

  • 📄 Text Extraction: Extract text from various document formats
  • 🔍 OCR Processing: Extract text from images and scanned documents
  • 🔄 Format Conversion: Convert between PDF, DOCX, TXT, HTML, MD, RTF
  • 📊 Metadata Extraction: Get document properties and information
  • ✂️ Page Splitting: Split documents into individual pages
  • 🔗 Document Merging: Combine multiple documents
  • 🌍 Multi-language OCR: Support for Portuguese, English, Spanish, French, German
  • 💰 Cost Tracking: Usage monitoring and budget controls

🚀 Quick Start

Installation

Install this node in your N8N instance:

Via Community Nodes (Recommended)

  1. Go to Settings > Community Nodes in your N8N interface
  2. Click Install a community node
  3. Enter n8n-nodes-n8ntools-document-processor
  4. Click Install

Via npm

npm install n8n-nodes-n8ntools-document-processor

Setup Credentials

  1. Sign up at N8N Tools and get your API key
  2. In N8N, create new N8N Tools API credentials
  3. Enter your API URL: https://api.n8ntools.io
  4. Enter your API key

📖 Usage

Supported Operations

Operation Description Input Output
Extract Text Extract text content PDF, DOCX, DOC, RTF Plain text
Extract Metadata Get document properties Any document JSON metadata
Convert Format Change document format Various formats PDF, DOCX, TXT, HTML, MD, RTF
Split Pages Split into individual pages PDF, DOCX ZIP with pages
Merge Documents Combine multiple documents Multiple files Single document
OCR Processing Extract text from images PDF, images Text with OCR

Example Workflow

[File Trigger] → [N8N Tools Document Processor] → [Extract Data] → [Database/Email]

Configuration Example

Invoice Text Extraction:

{
  "operation": "extractText",
  "inputSource": "binaryData",
  "binaryPropertyName": "data",
  "advancedOptions": {
    "extractImages": true,
    "extractTables": true,
    "preserveFormatting": true
  }
}

⚙️ Node Parameters

Input Configuration

  • Input Source: Binary Data, File URL, or Base64
  • Binary Property: Name of binary property (default: "data")
  • File URL: Direct URL to document file
  • Base64 Data: Base64 encoded document content

Operation-Specific Options

Format Conversion

  • Target Format: PDF, DOCX, TXT, HTML, MD, RTF

Page Splitting

  • Page Range: Specific pages (e.g., "1-5") or "all"

OCR Processing

  • Language: Portuguese, English, Spanish, French, German, Auto-detect

Advanced Options

  • Extract Images: Include images from document
  • Extract Tables: Parse table data
  • Preserve Formatting: Maintain original formatting
  • Password: For password-protected documents

📤 Output Data

Text Extraction Result

{
  "text": "This is the extracted text content...",
  "wordCount": 1250,
  "pageCount": 3,
  "hasImages": true,
  "hasTables": true,
  "images": [
    {
      "page": 1,
      "base64": "iVBORw0KGgoAAAANSUhEUgAA...",
      "format": "png"
    }
  ],
  "tables": [
    {
      "page": 2,
      "rows": 5,
      "columns": 3,
      "data": [["Header1", "Header2", "Header3"], ...]
    }
  ],
  "success": true,
  "operation": "extractText",
  "creditsUsed": 2,
  "originalFilename": "invoice.pdf"
}

Format Conversion Result

Returns the converted document as binary data with metadata:

{
  "success": true,
  "operation": "convertFormat",
  "originalFilename": "document.pdf",
  "convertedFilename": "document.docx",
  "targetFormat": "docx",
  "creditsUsed": 1
}

Metadata Extraction Result

{
  "filename": "report.pdf",
  "fileSize": 2048000,
  "mimeType": "application/pdf",
  "pageCount": 15,
  "author": "John Doe",
  "title": "Annual Report 2024",
  "subject": "Company Performance",
  "keywords": ["business", "report", "annual"],
  "creationDate": "2024-01-15T10:30:00Z",
  "modificationDate": "2024-01-16T14:20:00Z",
  "hasPassword": false,
  "isEncrypted": false,
  "success": true
}

🔧 Supported File Formats

Input Formats

  • PDF: PDF documents (including password-protected)
  • Microsoft Word: DOCX, DOC
  • Text: TXT, RTF
  • Web: HTML, XML
  • Images: PNG, JPG, TIFF (for OCR)

Output Formats

  • PDF: Portable Document Format
  • DOCX: Microsoft Word (newer format)
  • TXT: Plain text
  • HTML: HyperText Markup Language
  • MD: Markdown
  • RTF: Rich Text Format

🔍 OCR Capabilities

Supported Languages

  • Portuguese (por): Optimized for Brazilian Portuguese
  • English (eng): US and UK English
  • Spanish (spa): Latin American and Iberian Spanish
  • French (fra): French language support
  • German (deu): German language support
  • Auto-detect (auto): Automatic language detection

OCR Example

{
  "operation": "ocrProcessing",
  "inputSource": "fileUrl",
  "fileUrl": "https://example.com/scanned-invoice.pdf",
  "ocrLanguage": "por",
  "advancedOptions": {
    "extractTables": true,
    "preserveFormatting": true
  }
}

🛠️ Advanced Use Cases

Invoice Processing Pipeline

[Email Trigger] → [Download Attachment] → [Extract Text] → [Parse Data] → [Update CRM]

Document Classification

[File Upload] → [Extract Metadata] → [Classify Type] → [Route to Process]

Bulk Document Conversion

[File Monitor] → [Document Processor] → [Convert to PDF] → [Archive]

Contract Analysis

[Document Input] → [Extract Text] → [Find Key Terms] → [Generate Summary]

📊 Processing Examples

Extract Contract Details

// Extract specific information from legal documents
{
  "operation": "extractText",
  "advancedOptions": {
    "extractTables": true,
    "preserveFormatting": true
  }
}
// Then use regex or NLP to find specific clauses

Convert Legacy Documents

// Convert old DOC files to modern formats
{
  "operation": "convertFormat",
  "targetFormat": "docx"
}

Process Scanned Forms

// OCR processing for form data extraction
{
  "operation": "ocrProcessing",
  "ocrLanguage": "eng",
  "advancedOptions": {
    "extractTables": true // For form fields
  }
}

💸 Pricing & Limits

  • Text Extraction: 1 credit per document
  • Format Conversion: 1 credit per conversion
  • OCR Processing: 2 credits per document
  • Page Splitting: 1 credit per document
  • Document Merging: 1 credit per operation
  • File Size Limit: 100MB per document
  • Page Limit: 500 pages per document

🚨 Error Handling

Common errors and solutions:

// Password-protected document
{
  "error": "Document is password protected",
  "success": false,
  "suggestion": "Provide password in advancedOptions"
}

// Unsupported format
{
  "error": "Unsupported file format: .xyz",
  "success": false,
  "suggestion": "Check supported input formats"
}

// OCR language not detected
{
  "error": "Could not detect document language",
  "success": false,
  "suggestion": "Specify OCR language manually"
}

Password-Protected Documents

{
  "advancedOptions": {
    "password": "your-document-password"
  }
}

🔄 Integration Examples

With PDF Generator

[Data] → [Generate PDF] → [Extract Text] → [Validate Content]

With Web Scraper

[Scrape URLs] → [Download PDFs] → [Process Documents] → [Store Data]

With Email

[Email Attachment] → [Process Document] → [Extract Key Info] → [Reply with Summary]

🔗 Related Packages

📋 Requirements

  • N8N version 0.174.0 or higher
  • N8N Tools account and API key
  • Node.js 18+ (for development)

🆘 Support

📄 License

MIT License - see LICENSE file for details.


Part of the N8N Tools ecosystemWebsiteAll Packages

Discussion