Read Doc

Read content from doc/docx files

Overview

The Read Doc node reads and extracts content from Microsoft Word document files in .doc or .docx formats. It is useful when you want to process or analyze text data stored in Word documents within an n8n workflow. For example, you can extract plain text for indexing, searching, or further text processing, or extract HTML-formatted content (for .docx files) to preserve styling and structure.

Common scenarios include:

Automating document content extraction for data ingestion.
Converting uploaded Word documents into plain text or HTML for display or storage.
Counting words and characters in documents as part of content analysis workflows.

Properties

Name	Meaning
Binary Property	Name of the binary property that contains the document file data to be read.
Output Format	Format of the extracted content: - `Plain Text`: Extract only text content. - `HTML`: Extract content with HTML formatting (only for `.docx` files).
Additional Options	Collection of extra options: - Include Style Info (boolean): Whether to include style information in the HTML output (only applicable for `.docx` HTML output).

Output

The node outputs an array of items where each item contains:

json:
- content: The extracted document content as a string, either plain text or HTML depending on the selected output format.
- fileName: Original file name of the document.
- fileType: Either "docx" or "doc", indicating the document type.
- format: The output format used ("text" or "html").
- metadata: Additional metadata such as messages from the extraction library.
- wordCount: Number of words in the extracted content.
- characterCount: Number of characters in the extracted content.
binary: The original binary data from the input item is preserved if present.
pairedItem: Index linking back to the original input item.

No new binary data is created by this node; it only processes and outputs JSON content extracted from the input binary document.

Dependencies

Uses the mammoth library to extract content from .docx files.
Uses the textract library (via promisified fromBufferWithName) to extract content from .doc files.
Requires the input document to be provided as binary data in the specified binary property.
No external API keys or services are required.

Troubleshooting

Unsupported file type error: Occurs if the input file is not .doc or .docx. Ensure the binary data corresponds to a supported Word document.
Failed to extract content errors: May happen if the document is corrupted or unreadable. Verify the integrity of the input file.
Missing binary data: If the specified binary property does not exist or is empty, the node will throw an error. Confirm the correct binary property name is set and that the input contains valid binary data.
HTML output limited to .docx: Selecting HTML output for .doc files will cause an unsupported operation error since HTML extraction is only implemented for .docx.

To resolve issues, check the input binary data, file extensions, and node configuration properties carefully.

Links and References

Mammoth.js GitHub Repository – Used for extracting .docx content.
Textract GitHub Repository – Used for extracting .doc content.