n8ntools-document-processor

N8N Tools - Document Processor: Process and analyze documents with OCR, text extraction, and format conversion

Package Information

Downloads: 34 weekly / 202 monthly

Latest Version: 4.4.2

Author: N8N Tools

Available Nodes

N8N Tools - Document Processor

Process documents with OCR, text extraction and AI analysis using N8N Tools platform

Documentation

N8N Tools - Document Processor

Process and analyze documents with OCR, text extraction, and format conversion capabilities. This N8N community node provides comprehensive document processing through the N8N Tools platform.

✨ Features

📄 Text Extraction: Extract text from various document formats
🔍 OCR Processing: Extract text from images and scanned documents
🔄 Format Conversion: Convert between PDF, DOCX, TXT, HTML, MD, RTF
📊 Metadata Extraction: Get document properties and information
✂️ Page Splitting: Split documents into individual pages
🔗 Document Merging: Combine multiple documents
🌍 Multi-language OCR: Support for Portuguese, English, Spanish, French, German
💰 Cost Tracking: Usage monitoring and budget controls

🚀 Quick Start

Installation

Install this node in your N8N instance:

Via Community Nodes (Recommended)

Go to Settings > Community Nodes in your N8N interface
Click Install a community node
Enter n8n-nodes-n8ntools-document-processor
Click Install

Via npm

npm install n8n-nodes-n8ntools-document-processor

Setup Credentials

Sign up at N8N Tools and get your API key
In N8N, create new N8N Tools API credentials
Enter your API URL: https://api.n8ntools.io
Enter your API key

📖 Usage

Supported Operations

Operation	Description	Input	Output
Extract Text	Extract text content	PDF, DOCX, DOC, RTF	Plain text
Extract Metadata	Get document properties	Any document	JSON metadata
Convert Format	Change document format	Various formats	PDF, DOCX, TXT, HTML, MD, RTF
Split Pages	Split into individual pages	PDF, DOCX	ZIP with pages
Merge Documents	Combine multiple documents	Multiple files	Single document
OCR Processing	Extract text from images	PDF, images	Text with OCR

Example Workflow

[File Trigger] → [N8N Tools Document Processor] → [Extract Data] → [Database/Email]

Configuration Example

Invoice Text Extraction:

{
  "operation": "extractText",
  "inputSource": "binaryData",
  "binaryPropertyName": "data",
  "advancedOptions": {
    "extractImages": true,
    "extractTables": true,
    "preserveFormatting": true
  }
}

⚙️ Node Parameters

Input Configuration

Input Source: Binary Data, File URL, or Base64
Binary Property: Name of binary property (default: "data")
File URL: Direct URL to document file
Base64 Data: Base64 encoded document content

Operation-Specific Options

Format Conversion

Target Format: PDF, DOCX, TXT, HTML, MD, RTF

Page Splitting

Page Range: Specific pages (e.g., "1-5") or "all"

OCR Processing

Language: Portuguese, English, Spanish, French, German, Auto-detect

Advanced Options

Extract Images: Include images from document
Extract Tables: Parse table data
Preserve Formatting: Maintain original formatting
Password: For password-protected documents

📤 Output Data

Text Extraction Result

{
  "text": "This is the extracted text content...",
  "wordCount": 1250,
  "pageCount": 3,
  "hasImages": true,
  "hasTables": true,
  "images": [
    {
      "page": 1,
      "base64": "iVBORw0KGgoAAAANSUhEUgAA...",
      "format": "png"
    }
  ],
  "tables": [
    {
      "page": 2,
      "rows": 5,
      "columns": 3,
      "data": [["Header1", "Header2", "Header3"], ...]
    }
  ],
  "success": true,
  "operation": "extractText",
  "creditsUsed": 2,
  "originalFilename": "invoice.pdf"
}

Format Conversion Result

Returns the converted document as binary data with metadata:

{
  "success": true,
  "operation": "convertFormat",
  "originalFilename": "document.pdf",
  "convertedFilename": "document.docx",
  "targetFormat": "docx",
  "creditsUsed": 1
}

Metadata Extraction Result

{
  "filename": "report.pdf",
  "fileSize": 2048000,
  "mimeType": "application/pdf",
  "pageCount": 15,
  "author": "John Doe",
  "title": "Annual Report 2024",
  "subject": "Company Performance",
  "keywords": ["business", "report", "annual"],
  "creationDate": "2024-01-15T10:30:00Z",
  "modificationDate": "2024-01-16T14:20:00Z",
  "hasPassword": false,
  "isEncrypted": false,
  "success": true
}

🔧 Supported File Formats

Input Formats

PDF: PDF documents (including password-protected)
Microsoft Word: DOCX, DOC
Text: TXT, RTF
Web: HTML, XML
Images: PNG, JPG, TIFF (for OCR)

Output Formats

PDF: Portable Document Format
DOCX: Microsoft Word (newer format)
TXT: Plain text
HTML: HyperText Markup Language
MD: Markdown
RTF: Rich Text Format

🔍 OCR Capabilities

Supported Languages

Portuguese (por): Optimized for Brazilian Portuguese
English (eng): US and UK English
Spanish (spa): Latin American and Iberian Spanish
French (fra): French language support
German (deu): German language support
Auto-detect (auto): Automatic language detection

OCR Example

{
  "operation": "ocrProcessing",
  "inputSource": "fileUrl",
  "fileUrl": "https://example.com/scanned-invoice.pdf",
  "ocrLanguage": "por",
  "advancedOptions": {
    "extractTables": true,
    "preserveFormatting": true
  }
}

🛠️ Advanced Use Cases

Invoice Processing Pipeline

[Email Trigger] → [Download Attachment] → [Extract Text] → [Parse Data] → [Update CRM]

Document Classification

[File Upload] → [Extract Metadata] → [Classify Type] → [Route to Process]

Bulk Document Conversion

[File Monitor] → [Document Processor] → [Convert to PDF] → [Archive]

Contract Analysis

[Document Input] → [Extract Text] → [Find Key Terms] → [Generate Summary]

📊 Processing Examples

Extract Contract Details

// Extract specific information from legal documents
{
  "operation": "extractText",
  "advancedOptions": {
    "extractTables": true,
    "preserveFormatting": true
  }
}
// Then use regex or NLP to find specific clauses

Convert Legacy Documents

// Convert old DOC files to modern formats
{
  "operation": "convertFormat",
  "targetFormat": "docx"
}

Process Scanned Forms

// OCR processing for form data extraction
{
  "operation": "ocrProcessing",
  "ocrLanguage": "eng",
  "advancedOptions": {
    "extractTables": true // For form fields
  }
}

💸 Pricing & Limits

Text Extraction: 1 credit per document
Format Conversion: 1 credit per conversion
OCR Processing: 2 credits per document
Page Splitting: 1 credit per document
Document Merging: 1 credit per operation
File Size Limit: 100MB per document
Page Limit: 500 pages per document

🚨 Error Handling

Common errors and solutions:

// Password-protected document
{
  "error": "Document is password protected",
  "success": false,
  "suggestion": "Provide password in advancedOptions"
}

// Unsupported format
{
  "error": "Unsupported file format: .xyz",
  "success": false,
  "suggestion": "Check supported input formats"
}

// OCR language not detected
{
  "error": "Could not detect document language",
  "success": false,
  "suggestion": "Specify OCR language manually"
}

Password-Protected Documents

{
  "advancedOptions": {
    "password": "your-document-password"
  }
}

🔄 Integration Examples

With PDF Generator

[Data] → [Generate PDF] → [Extract Text] → [Validate Content]

With Web Scraper

[Scrape URLs] → [Download PDFs] → [Process Documents] → [Store Data]

With Email

[Email Attachment] → [Process Document] → [Extract Key Info] → [Reply with Summary]

🔗 Related Packages

PDF Generator: Create PDFs from processed data
Web Scraper: Scrape documents from websites

📋 Requirements

N8N version 0.174.0 or higher
N8N Tools account and API key
Node.js 18+ (for development)

🆘 Support

📧 Email: support@n8ntools.io
📖 Documentation: docs.n8ntools.io
💬 Community: Discord
🐛 Issues: GitHub

📄 License

MIT License - see LICENSE file for details.

Part of the N8N Tools ecosystem • Website • All Packages