Convert File to JSON (Enhanced)
Overview
This node converts various file types from binary data into JSON or plain text representations. It supports formats such as DOCX, XML, YML, XLSX, CSV, PDF, TXT, PPTX, HTML, and others. The node is useful when you want to extract structured or textual content from files within an n8n workflow for further processing, analysis, or integration.
Common scenarios include:
- Extracting spreadsheet data from Excel files (XLSX, CSV) into JSON objects.
- Parsing Word documents (DOCX) or PDFs into plain text.
- Converting XML or YML files into JSON format.
- Processing HTML files to extract clean text content.
- Handling multiple sheets in Excel files either grouped by file or output as separate items.
Practical example:
- You receive invoices as XLSX files via email attachments. This node can convert each sheet of the invoice into JSON objects, optionally including original row numbers and sheet names, enabling automated data extraction and downstream processing like database insertion or API calls.
Properties
| Name | Meaning |
|---|---|
| Binary Property | Name of the binary property that contains the file (default: "data"). |
| Max File Size (MB) | Maximum allowed file size in megabytes (1 to 100 MB, default: 50). |
| Max Concurrency | Maximum number of files processed concurrently (1 to 10, default: 4). |
| Include Original Row Numbers | For Excel files, include the original row number from the source file in the origRow property (boolean). |
| Include File Name | Include the filename in each sheet object (boolean, default: true). |
| Include Sheet Name | Include the sheet name in each sheet object (boolean, default: true). |
| Output Sheets as Separate Items | Output each sheet as a separate workflow item instead of grouping by file. Text files (PDF, DOCX, etc.) will be ignored when enabled (boolean, default: false). |
Output
The node outputs JSON data representing the contents of the input files:
For Excel files (XLSX, CSV), the output includes an object with sheets as keys. Each sheet contains:
fileName(optional): The name of the file.sheetName(optional): The name of the sheet.data: An array of row objects where each key corresponds to a column letter (e.g., "A", "B", "C") and values are cell contents.- If enabled, each row may also contain an
origRowproperty indicating the original row number in the source file.
For text-based files (DOCX, PDF, TXT, HTML, XML, YML, JSON, PPTX, ODT, ODP, ODS), the output contains a
textfield with extracted plain text or JSON stringified content.When "Output Sheets as Separate Items" is enabled, each Excel sheet is output as a separate workflow item with properties:
rows: Array of row objects as above.- Optionally
fileNameandsheetName. - Metadata fields:
fileType,fileSize,processedAt.
When disabled, all files are grouped into a single item containing:
files: Array of parsed file JSON/text objects.totalFiles: Number of processed files.processedAt: Timestamp of processing.
If the node encounters empty files or files without text content, it throws an error.
Dependencies
- Requires an API key credential or access to binary file data within the workflow.
- Uses several external libraries bundled internally for parsing different file types:
exceljsfor Excel files.mammothfor DOCX text extraction.pdf-parsefor PDF text extraction.xml2jsfor XML/YML parsing.papaparsefor CSV parsing.cheerioandsanitize-htmlfor HTML text extraction.chardetandiconv-litefor character encoding detection and decoding.
- No additional external service configuration is required beyond providing the binary file data.
Troubleshooting
File too large error: If the input file exceeds the configured maximum file size, the node will throw a "File is too large" error. Increase the "Max File Size" property or reduce the file size.
Unsupported file type error: If the file extension or detected file type is not supported, the node throws an unsupported format error. Ensure the input files are among supported types (DOCX, XLSX, CSV, PDF, TXT, XML, YML, JSON, PPTX, HTML, ODT, ODP, ODS).
Empty file error: Files with no content or empty text after extraction cause an error. Verify the input files contain readable content.
Missing binary property error: If the specified binary property does not exist on the input item, the node throws an error. Confirm the correct binary property name is set.
Invalid file name error: File names containing path traversal characters or invalid characters will cause an error. Ensure filenames are sanitized and valid.
Concurrency issues: Setting very high concurrency might lead to resource exhaustion or slowdowns. Adjust "Max Concurrency" according to your system capabilities.
Links and References
- ExcelJS GitHub - Used for Excel file parsing.
- Mammoth.js - DOCX text extraction.
- pdf-parse - PDF text extraction.
- xml2js - XML parsing.
- PapaParse - CSV parsing.
- Cheerio - HTML parsing.
- sanitize-html - HTML sanitization.
These resources provide more details about the underlying parsing technologies used by the node.