Convert File to JSON (Enhanced)

DOCX / XML / YML / XLSX / CSV / PDF / TXT / PPTX / HTML → JSON|text (with Excel row/column preservation)

Overview

This node converts various file types from binary data into JSON or plain text representations. It supports formats such as DOCX, XML, YML, XLSX, CSV, PDF, TXT, PPTX, HTML, and others. The node is useful when you want to extract structured or textual content from files within an n8n workflow for further processing, analysis, or integration.

Common scenarios include:

Extracting spreadsheet data from Excel files (XLSX, CSV) into JSON objects.
Parsing Word documents (DOCX) or PDFs into plain text.
Converting XML or YML files into JSON format.
Processing HTML files to extract clean text content.
Handling multiple sheets in Excel files either grouped by file or output as separate items.

Practical example:

You receive invoices as XLSX files via email attachments. This node can convert each sheet of the invoice into JSON objects, optionally including original row numbers and sheet names, enabling automated data extraction and downstream processing like database insertion or API calls.

Properties

Name	Meaning
Binary Property	Name of the binary property that contains the file (default: "data").
Max File Size (MB)	Maximum allowed file size in megabytes (1 to 100 MB, default: 50).
Max Concurrency	Maximum number of files processed concurrently (1 to 10, default: 4).
Include Original Row Numbers	For Excel files, include the original row number from the source file in the `origRow` property (boolean).
Include File Name	Include the filename in each sheet object (boolean, default: true).
Include Sheet Name	Include the sheet name in each sheet object (boolean, default: true).
Output Sheets as Separate Items	Output each sheet as a separate workflow item instead of grouping by file. Text files (PDF, DOCX, etc.) will be ignored when enabled (boolean, default: false).

Output

The node outputs JSON data representing the contents of the input files:

For Excel files (XLSX, CSV), the output includes an object with sheets as keys. Each sheet contains:
- fileName (optional): The name of the file.
- sheetName (optional): The name of the sheet.
- data: An array of row objects where each key corresponds to a column letter (e.g., "A", "B", "C") and values are cell contents.
- If enabled, each row may also contain an origRow property indicating the original row number in the source file.
For text-based files (DOCX, PDF, TXT, HTML, XML, YML, JSON, PPTX, ODT, ODP, ODS), the output contains a text field with extracted plain text or JSON stringified content.
When "Output Sheets as Separate Items" is enabled, each Excel sheet is output as a separate workflow item with properties:
- rows: Array of row objects as above.
- Optionally fileName and sheetName.
- Metadata fields: fileType, fileSize, processedAt.
When disabled, all files are grouped into a single item containing:
- files: Array of parsed file JSON/text objects.
- totalFiles: Number of processed files.
- processedAt: Timestamp of processing.

If the node encounters empty files or files without text content, it throws an error.

Dependencies

Requires an API key credential or access to binary file data within the workflow.
Uses several external libraries bundled internally for parsing different file types:
- exceljs for Excel files.
- mammoth for DOCX text extraction.
- pdf-parse for PDF text extraction.
- xml2js for XML/YML parsing.
- papaparse for CSV parsing.
- cheerio and sanitize-html for HTML text extraction.
- chardet and iconv-lite for character encoding detection and decoding.
No additional external service configuration is required beyond providing the binary file data.

Troubleshooting

File too large error: If the input file exceeds the configured maximum file size, the node will throw a "File is too large" error. Increase the "Max File Size" property or reduce the file size.
Unsupported file type error: If the file extension or detected file type is not supported, the node throws an unsupported format error. Ensure the input files are among supported types (DOCX, XLSX, CSV, PDF, TXT, XML, YML, JSON, PPTX, HTML, ODT, ODP, ODS).
Empty file error: Files with no content or empty text after extraction cause an error. Verify the input files contain readable content.
Missing binary property error: If the specified binary property does not exist on the input item, the node throws an error. Confirm the correct binary property name is set.
Invalid file name error: File names containing path traversal characters or invalid characters will cause an error. Ensure filenames are sanitized and valid.
Concurrency issues: Setting very high concurrency might lead to resource exhaustion or slowdowns. Adjust "Max Concurrency" according to your system capabilities.

Links and References

ExcelJS GitHub - Used for Excel file parsing.
Mammoth.js - DOCX text extraction.
pdf-parse - PDF text extraction.
xml2js - XML parsing.
PapaParse - CSV parsing.
Cheerio - HTML parsing.
sanitize-html - HTML sanitization.

These resources provide more details about the underlying parsing technologies used by the node.

Convert File to JSON (Enhanced)Install