HTML to DOCX (Pandoc) icon

HTML to DOCX (Pandoc)

Convert HTML to DOCX using a local pandoc executable

Overview

This node converts HTML content into a DOCX document using a local Pandoc executable. It supports inputting HTML directly or extracting it from a JSON field. The node is useful for automating document generation workflows where HTML content needs to be transformed into Word documents, such as generating reports, letters, or formatted documents from web content or other HTML sources.

Use Case Examples

  1. Convert a simple HTML snippet directly into a DOCX file for download or further processing.
  2. Extract HTML content from a JSON field in incoming data and convert it into a DOCX document.
  3. Embed local resources like images and CSS into the DOCX to produce standalone documents.

Properties

Name Meaning
HTML Source Selects the source of the HTML input, either direct HTML input or from a JSON field.
HTML The direct HTML string to convert to DOCX, used if HTML Source is 'Direct HTML'.
JSON Field The JSON field path to extract the HTML string from, used if HTML Source is 'JSON Field'.
Output Binary Property The name of the binary property where the resulting DOCX file will be stored.
File Name The file name to assign to the generated DOCX document.
Pandoc Path The path to the Pandoc executable, either absolute or available in system PATH.
Embed Resources Whether to embed local resources (images, CSS) into the DOCX to produce a standalone document.
Resource Paths Directories to search for local resources referenced in the HTML.
Reference DOCX Source Selects a reference DOCX template to use during conversion, either none, a filesystem path, or a built-in minimal template.
Reference DOCX Path Absolute path to the reference DOCX template, used if Reference DOCX Source is 'Filesystem Path'.
Whitespace Policy How to normalize whitespace during conversion, either collapsing or preserving line breaks.
Use Minimal Reference Whether to prefer using a minimal reference DOCX template when converting.
Timeout Timeout in seconds for the Pandoc process to complete.
Additional Pandoc Arguments Additional command-line arguments to pass to Pandoc for advanced customization.
Clean Output Mode Enable output cleanup to simplify the DOCX structure, making it easier to diff and maintain.
Punctuation Normalization Normalize common punctuation such as smart quotes and dashes, with options off or conservative.
Sanitize via CommonMark Roundtrip HTML through CommonMark to simplify and sanitize the structure.
Strip Formatting Except Bold/Italic Remove all formatting except bold and italic, preserving paragraph styles and numbering.
Remove Bookmarks Remove bookmark tags from the DOCX output.
Collapse Empty Runs/Paragraphs Remove empty runs and paragraphs produced by the conversion process.
Ensure xml:space="preserve" on Text Ensure that text elements retain spacing by setting xml:space="preserve".
Normalization Profile (JSON) Advanced JSON configuration for normalization profile to override defaults.

Output

Binary

The node outputs the generated DOCX file as binary data under the specified binary property name.

JSON

  • error - Error message if the conversion fails for an item.

Dependencies

  • Requires a local Pandoc executable accessible via the specified path or system PATH.

Troubleshooting

  • Common issues include Pandoc not being installed or the path to the Pandoc executable being incorrect, resulting in errors when attempting to run the conversion.
  • If the HTML source field specified does not contain a string, the node throws an error indicating the JSON field is not a string.
  • Timeout errors may occur if the Pandoc process takes longer than the specified timeout parameter; increasing the timeout may resolve this.
  • Errors related to resource embedding may occur if the Pandoc version does not support embedding resources; ensure the Pandoc version is compatible.

Links

Discussion