DOCX to Text Enhanced

Converts DOCX file to text with page-aware RAG capabilities

Overview

This node converts DOCX files from binary input into text output with multiple formatting and structuring options. It supports three output modes:

  • Text Only (Legacy): Extracts plain text without any additional metadata or structure.
  • Enhanced with Metadata: Extracts text along with document metadata (title, author, page count, etc.) and a structural summary including headings and estimated pages. Optionally includes HTML representation for better preservation of formatting.
  • RAG-Ready Chunks: Splits the document text into page-aware chunks suitable for Retrieval-Augmented Generation (RAG) systems, preserving chunk size and overlap settings, plus metadata and structure.

Typical use cases include:

  • Quickly extracting raw text from DOCX files for indexing or search.
  • Obtaining structured document data with metadata and headings for content analysis or display.
  • Preparing documents as chunked inputs for AI workflows that require context windows, such as question answering or summarization using RAG techniques.

Properties

Name Meaning
Input Binary Field The name of the binary input field containing the DOCX file to process.
Output Mode The format of the output:
- Text Only (Legacy)
- Enhanced with Metadata
- RAG-Ready Chunks
Destination Output Field (Only for Text Only mode) The name of the JSON output field where the extracted plain text will be stored.
Include HTML (For Enhanced and RAG modes) Whether to include an HTML version of the document for better structure preservation.
Chunk Size (Words) (For RAG mode) Target number of words per chunk when splitting the document text.
Chunk Overlap (Words) (For RAG mode) Number of overlapping words between consecutive chunks to maintain context continuity.

Output

The node outputs JSON objects with different structures depending on the selected output mode:

  • Text Only (Legacy):

    {
      "<destinationOutputField>": "extracted plain text string"
    }
    

    A simple key-value pair where the value is the raw extracted text.

  • Enhanced with Metadata:

    {
      "text": "extracted plain text string",
      "html": "<optional HTML string if enabled>",
      "metadata": {
        "title": "document title",
        "author": "document author",
        "subject": "document subject",
        "description": "document description",
        "created": "creation date",
        "modified": "modification date",
        "wordCount": 1234,
        "pageCount": 10
      },
      "structure": {
        "headings": ["list", "of", "headings"],
        "sections": 3,
        "estimatedPages": 10
      }
    }
    

    Provides rich information about the document content and structure.

  • RAG-Ready Chunks:
    Extends the enhanced output by adding:

    {
      ...enhanced output fields...,
      "chunks": [
        {
          "content": "chunk text",
          "pageStart": 1,
          "pageEnd": 2,
          "section": "heading associated with chunk",
          "position": { "start": 0, "end": 300 },
          "chunkIndex": 0
        },
        ...
      ],
      "chunkingConfig": {
        "chunkSize": 300,
        "chunkOverlap": 50,
        "totalChunks": 5
      }
    }
    

    This splits the text into overlapping chunks aligned with pages and sections, useful for downstream AI processing.

The node does not output binary data.

Dependencies

  • Uses the mammoth library to extract raw text and convert DOCX to HTML.
  • Uses jszip to read DOCX internal XML files for metadata extraction.
  • Uses cheerio to parse XML and HTML content for metadata and structure extraction.
  • Requires the input DOCX file to be provided as binary data in the specified input field.

No external API keys or services are required. All processing is done locally within the node.

Troubleshooting

  • No binary data found for field "X": Ensure the input binary field name matches exactly the field containing the DOCX file. The node cannot proceed without valid binary input.
  • Error processing DOCX file: ...: Could indicate corrupted or unsupported DOCX files. Verify the file integrity and format.
  • Unknown output mode: ...: Check that the output mode parameter is set to one of the supported values (textOnly, enhanced, ragChunks).
  • Large DOCX files may increase processing time; consider chunking or limiting input size if performance issues arise.

Links and References

Discussion