Package Information
Downloads: 7,704 weekly / 27,134 monthly
Latest Version: 1.2.2
Author: mazix
Documentation
n8n Document Converter Node
n8n community node for converting documents to JSON/text. Supports 15+ formats with AI-friendly output.
Table of Contents
Supported Formats
| Category | Formats | Details |
|---|---|---|
| Documents | DOCX, DOC, ODT, TXT, PDF | Text, HTML, or Markdown output for DOCX |
| Spreadsheets | XLSX, ODS, CSV | Multi-sheet parsing for XLSX/ODS and CSV |
| Presentations | PPTX, PPT, ODP | Text extraction |
| Web & Data | HTML, HTM, XML, JSON | Structure-aware parsing |
| E-commerce | YML (Yandex Market) | Specialized shop/offers/categories parsing |
Features
Core
- Automatic file type detection via magic bytes
- Strategy pattern: each format has its own processing pipeline
- DOCX output: plain text (default), HTML, or Markdown (GFM tables, headings, bold/italic)
- DOCX → Markdown ideal for AI/LLM/RAG pipelines
- XLSX multi-sheet processing with Excel-style column names (A, B, C...)
- JSON flattening for nested structures
- YML (Yandex Market) specialized parser
usableAsTool: truefor n8n AI Agent integration
Reliability
- Concurrency control via promise pool (Set-based, no race conditions)
- Fallback chains: DOCX uses officeparser -> mammoth, DOC/PPT uses CFB signature check + officeparser
- File name sanitization (path traversal protection)
- Configurable file size limits (up to 100MB)
- Custom error classes with descriptive messages
Installation
Via n8n UI (recommended)
Settings -> Community nodes -> Install
Package name: @mazix/n8n-nodes-converter-documents
Via CLI
cd ~/.n8n
npm install @mazix/n8n-nodes-converter-documents
# Restart n8n
Manual
git clone https://github.com/mazixs/n8n-node-converter-documents.git
cd n8n-node-converter-documents
npm install
npm run build
# Copy dist/ and package.json to ~/.n8n/custom-nodes/n8n-node-converter-documents/
Usage
- Add "Convert File to JSON" node to your workflow
- Connect a node that provides binary data (e.g., Read Binary File, HTTP Request)
- Configure parameters:
| Parameter | Default | Description |
|---|---|---|
| Binary Property | data |
Name of the binary property with the file |
| Output Format (DOCX) | text |
text, html, or markdown (GFM with tables) |
| Max File Size (MB) | 50 |
File size limit |
| Max Concurrency | 4 |
Parallel file processing |
Output Examples
Text document (DOCX, PDF, TXT, etc.)
{
"text": "Extracted document content...",
"metadata": {
"fileName": "report.docx",
"fileSize": 12345,
"fileType": "docx",
"processedAt": "2026-02-08T00:00:00.000Z"
}
}
DOCX with HTML output
{
"text": "<p>Introduction</p><table><tr><td>Name</td><td>Value</td></tr>...</table>",
"metadata": { "fileName": "data.docx", "fileType": "docx" }
}
DOCX with Markdown output
{
"text": "# Introduction\n\n| Name | Value |\n| --- | --- |\n| Item | 100 |\n\n**Bold text** and _italic_",
"metadata": { "fileName": "data.docx", "fileType": "docx" }
}
XLSX (multi-sheet)
{
"sheets": {
"Products": [
{ "A": "ID", "B": "Name", "C": "Price" },
{ "A": 1, "B": "Apple", "C": 100 }
],
"Orders": [
{ "A": "Order", "B": "Qty" },
{ "A": 101, "B": 5 }
]
},
"metadata": { "fileName": "data.xlsx", "fileType": "xlsx" }
}
JSON (flattened)
{
"text": "{\n \"user.name\": \"John\",\n \"user.address.city\": \"London\"\n}",
"warning": "Multi-level JSON structure was converted to flat object"
}
YML (Yandex Market)
{
"text": "{ \"shop\": { \"name\": \"MyShop\" }, \"currencies\": [...], \"categories\": [...], \"offers\": [...] }"
}
Architecture
Project Structure
src/
├── FileToJsonNode.node.ts # Node class (~220 lines)
├── types.ts # Interfaces (JsonResult, StrategyFn, YML types)
├── errors.ts # Custom error classes
├── helpers.ts # extractViaOfficeParser, limitExcelSheet
├── strategies/
│ └── index.ts # All format strategies
├── processors/
│ └── yml.ts # Yandex Market YML processor
└── utils/
├── sanitize.ts # File name sanitization
├── promisePool.ts # Concurrency control (Set-based)
├── columns.ts # numberToColumn (1→A, 27→AA)
├── flatten.ts # JSON flattening
└── index.ts # Barrel export
Processing Flow
Input binary → detect file type (magic bytes) → select strategy → process → output JSON
│
┌────────────────────────┼────────────────────┐
│ │ │
Text formats Spreadsheets Special
(DOCX, PDF, TXT, (XLSX, CSV, ODS) (XML, JSON,
PPTX, HTML, ODT, YML, HTML)
ODP, DOC, PPT)
Technology Stack
| Component | Library | Version |
|---|---|---|
| DOCX/PDF/PPTX/OD* | officeparser | ^6.0.4 |
| DOCX (HTML/MD) | mammoth | ^1.11.0 |
| HTML → Markdown | node-html-markdown | ^2.0.0 |
| XLSX | read-excel-file | ^6.0.3 |
| CSV | papaparse | ^5.5.3 |
| XML/YML | fast-xml-parser | ^5.3.4 |
| HTML | node-html-parser | ^7.0.2 |
| Encoding | chardet | ^2.1.1 |
| File type | file-type | 16.5.4 |
| n8n SDK | n8n-workflow | ^2.7.0 |
| Runtime | Node.js | 22.x |
| Language | TypeScript | 5.8 (strict) |
| Tests | Jest | 30.x |
Development
npm install # Install dependencies
npm run build # Compile TypeScript
npm test # Run 57 tests (13 suites)
npm run lint # ESLint check
npm run dev # Watch mode
npm run test:coverage # Coverage report
CI/CD
- CI (
ci.yml): lint -> build -> test -> security audit on push/PR to main/develop - Auto Release (
auto-release.yml): creates GitHub Release + git tag on version bump in main - npm publish: manual (
npm publish --access public)
Limitations
| Limitation | Details |
|---|---|
| Legacy XLS | Binary Excel not supported, convert to XLSX |
| file-type | Pinned to v16.5.4 (last CJS version, v17+ is ESM-only) |
| Scanned PDFs | Image-based PDFs return empty text (no OCR) |
| Large files | PDF/XLSX load into RAM; use Max File Size to control |
License
MIT © mazix