Overview
The Read Doc node reads and extracts content from Microsoft Word document files in .doc or .docx formats. It is useful when you want to process or analyze text data stored in Word documents within an n8n workflow. For example, you can extract plain text for indexing, searching, or further text processing, or extract HTML-formatted content (for .docx files) to preserve styling and structure.
Common scenarios include:
- Automating document content extraction for data ingestion.
- Converting uploaded Word documents into plain text or HTML for display or storage.
- Counting words and characters in documents as part of content analysis workflows.
Properties
| Name | Meaning |
|---|---|
| Binary Property | Name of the binary property that contains the document file data to be read. |
| Output Format | Format of the extracted content: - Plain Text: Extract only text content.- HTML: Extract content with HTML formatting (only for .docx files). |
| Additional Options | Collection of extra options: - Include Style Info (boolean): Whether to include style information in the HTML output (only applicable for .docx HTML output). |
Output
The node outputs an array of items where each item contains:
json:content: The extracted document content as a string, either plain text or HTML depending on the selected output format.fileName: Original file name of the document.fileType: Either"docx"or"doc", indicating the document type.format: The output format used ("text"or"html").metadata: Additional metadata such as messages from the extraction library.wordCount: Number of words in the extracted content.characterCount: Number of characters in the extracted content.
binary: The original binary data from the input item is preserved if present.pairedItem: Index linking back to the original input item.
No new binary data is created by this node; it only processes and outputs JSON content extracted from the input binary document.
Dependencies
- Uses the
mammothlibrary to extract content from.docxfiles. - Uses the
textractlibrary (via promisifiedfromBufferWithName) to extract content from.docfiles. - Requires the input document to be provided as binary data in the specified binary property.
- No external API keys or services are required.
Troubleshooting
- Unsupported file type error: Occurs if the input file is not
.docor.docx. Ensure the binary data corresponds to a supported Word document. - Failed to extract content errors: May happen if the document is corrupted or unreadable. Verify the integrity of the input file.
- Missing binary data: If the specified binary property does not exist or is empty, the node will throw an error. Confirm the correct binary property name is set and that the input contains valid binary data.
- HTML output limited to
.docx: Selecting HTML output for.docfiles will cause an unsupported operation error since HTML extraction is only implemented for.docx.
To resolve issues, check the input binary data, file extensions, and node configuration properties carefully.
Links and References
- Mammoth.js GitHub Repository – Used for extracting
.docxcontent. - Textract GitHub Repository – Used for extracting
.doccontent.