Docx Extractor

Extract DOCX files to other formats

Overview

The "Docx Extractor" node processes DOCX files provided as binary input and converts them into one or more text-based formats: HTML, plain text (TXT), Markdown, or all three simultaneously. This node is useful when you need to extract readable content from Word documents for further processing, analysis, or integration with other systems that consume text or HTML.

Common scenarios include:

Converting uploaded DOCX reports into HTML for web display.
Extracting raw text from DOCX files for indexing or search.
Generating Markdown versions of DOCX content for documentation workflows.
Producing multiple output formats at once for flexible downstream use.

Properties

Name	Meaning
Input Binary Field	The name of the input field containing the binary DOCX file data to be processed.
Output Format	The format to convert the DOCX file into. Options: HTML, TXT (plain text), Markdown, All (all three formats).

Output

The node outputs JSON objects with the following fields depending on the selected output format:

html: Contains the extracted content converted to HTML.
textContent: Contains the extracted raw text content.
markdown: Contains the extracted content converted to Markdown.

If "All" is selected, all three fields are present in the output JSON.

Each output item corresponds to an input item and includes a pairing reference to maintain item order.

No binary output is produced; the node only outputs textual representations of the DOCX content.

Dependencies

Uses the mammoth library to convert DOCX files to HTML and extract raw text.
Uses the turndown library to convert HTML to Markdown.
Requires the input binary data to have the MIME type application/vnd.openxmlformats-officedocument.wordprocessingml.document.
No external API keys or services are required.
The node expects the DOCX file to be provided as binary data in the specified input field.

Troubleshooting

Error: No binary data found in field "X" for item Y
This occurs if the specified input field does not contain binary data for the given item. Ensure the input field name matches the actual binary property name and that the input contains valid binary data.
Error: Unsupported MIME type: ... Expected DOCX file.
The node only supports DOCX files. Verify that the input binary data has the correct MIME type (application/vnd.openxmlformats-officedocument.wordprocessingml.document). Other file types will cause this error.
Conversion errors or empty output
If the DOCX file is corrupted or uses unsupported features, the conversion libraries may fail or produce incomplete results. Try validating the DOCX file or testing with a simpler document.
Continue On Fail behavior
If enabled, the node will output error messages per item instead of stopping execution on the first failure.

Links and References

Mammoth.js GitHub Repository – Library used for DOCX to HTML/text extraction.
Turndown GitHub Repository – Library used for HTML to Markdown conversion.

Docx ExtractorInstall