Overview
This node converts HTML content into a DOCX document using a local Pandoc executable. It supports inputting HTML directly or extracting it from a JSON field. The node is useful for automating document generation workflows where HTML content needs to be transformed into Word documents, such as generating reports, letters, or formatted documents from web content or other HTML sources.
Use Case Examples
- Convert a simple HTML snippet directly into a DOCX file for download or further processing.
- Extract HTML content from a JSON field in incoming data and convert it into a DOCX document.
- Embed local resources like images and CSS into the DOCX to produce standalone documents.
Properties
| Name | Meaning |
|---|---|
| HTML Source | Selects the source of the HTML input, either direct HTML input or from a JSON field. |
| HTML | The direct HTML string to convert to DOCX, used if HTML Source is 'Direct HTML'. |
| JSON Field | The JSON field path to extract the HTML string from, used if HTML Source is 'JSON Field'. |
| Output Binary Property | The name of the binary property where the resulting DOCX file will be stored. |
| File Name | The file name to assign to the generated DOCX document. |
| Pandoc Path | The path to the Pandoc executable, either absolute or available in system PATH. |
| Embed Resources | Whether to embed local resources (images, CSS) into the DOCX to produce a standalone document. |
| Resource Paths | Directories to search for local resources referenced in the HTML. |
| Reference DOCX Source | Selects a reference DOCX template to use during conversion, either none, a filesystem path, or a built-in minimal template. |
| Reference DOCX Path | Absolute path to the reference DOCX template, used if Reference DOCX Source is 'Filesystem Path'. |
| Whitespace Policy | How to normalize whitespace during conversion, either collapsing or preserving line breaks. |
| Use Minimal Reference | Whether to prefer using a minimal reference DOCX template when converting. |
| Timeout | Timeout in seconds for the Pandoc process to complete. |
| Additional Pandoc Arguments | Additional command-line arguments to pass to Pandoc for advanced customization. |
| Clean Output Mode | Enable output cleanup to simplify the DOCX structure, making it easier to diff and maintain. |
| Punctuation Normalization | Normalize common punctuation such as smart quotes and dashes, with options off or conservative. |
| Sanitize via CommonMark | Roundtrip HTML through CommonMark to simplify and sanitize the structure. |
| Strip Formatting Except Bold/Italic | Remove all formatting except bold and italic, preserving paragraph styles and numbering. |
| Remove Bookmarks | Remove bookmark tags from the DOCX output. |
| Collapse Empty Runs/Paragraphs | Remove empty runs and paragraphs produced by the conversion process. |
| Ensure xml:space="preserve" on Text | Ensure that text elements retain spacing by setting xml:space="preserve". |
| Normalization Profile (JSON) | Advanced JSON configuration for normalization profile to override defaults. |
Output
Binary
The node outputs the generated DOCX file as binary data under the specified binary property name.
JSON
error- Error message if the conversion fails for an item.
Dependencies
- Requires a local Pandoc executable accessible via the specified path or system PATH.
Troubleshooting
- Common issues include Pandoc not being installed or the path to the Pandoc executable being incorrect, resulting in errors when attempting to run the conversion.
- If the HTML source field specified does not contain a string, the node throws an error indicating the JSON field is not a string.
- Timeout errors may occur if the Pandoc process takes longer than the specified timeout parameter; increasing the timeout may resolve this.
- Errors related to resource embedding may occur if the Pandoc version does not support embedding resources; ensure the Pandoc version is compatible.
Links
- Pandoc Official Website - Official site for Pandoc, the document converter used by this node.