PDF4me icon

PDF4me

Comprehensive PDF and document processing: generate barcodes, convert files, extract data, manipulate images, and automate workflows with the PDF4ME API

Actions80

Overview

This node operation extracts text content from Word documents. It supports multiple input methods for the Word file, including binary data from a previous node, a base64 encoded string, or a URL pointing to the document. Users can specify page ranges to extract text from specific pages only and apply options such as removing comments, headers/footers, and accepting tracked changes.

This node is beneficial in scenarios where automated processing of Word documents is needed, such as extracting textual data for indexing, analysis, or integration with other systems without manual intervention. For example, it can be used to extract contract clauses from uploaded contracts, parse reports for key information, or convert Word documents into plain text for further processing.

Properties

Name Meaning
Input Data Type Choose how to provide the Word file:
- Binary Data (from previous node)
- Base64 String (directly provide base64 encoded content)
- URL (link to the Word file)
Input Binary Field Name of the binary property containing the Word file (default "data"). Used when Input Data Type is Binary Data.
Base64 Word Content Base64 encoded content of the Word document. Used when Input Data Type is Base64 String.
Word URL URL to the Word file to extract text from. Used when Input Data Type is URL.
Document Name Name assigned to the document during processing (default "document.docx").
Start Page Number Starting page number for text extraction (default 1).
End Page Number Ending page number for text extraction (default 3).
Extraction Options Collection of boolean options to customize extraction:
- Remove Comments: Whether to remove comments from extracted text (default true)
- Remove Header/Footer: Whether to remove headers and footers (default true)
- Accept Changes: Whether to accept tracked changes (default true)
Advanced Options Additional advanced settings as JSON string profiles to adjust custom properties for API calls, e.g., output format customization.

Output

The node outputs JSON data containing the extracted text from the specified pages of the Word document. The structure typically includes the plain text content extracted after applying the selected extraction options. If the document contains multiple pages, the output may include concatenated or segmented text according to the page range specified.

No binary output is produced by this operation; the focus is on textual content extraction.

Dependencies

  • Requires access to an external document processing API service capable of handling Word document parsing and text extraction.
  • An API authentication token or key must be configured in n8n credentials to authorize requests to the external service.
  • Network access is required if using URL input type to fetch the Word document.

Troubleshooting

  • Common Issues:

    • Invalid or inaccessible URL when using URL input type results in failure to download the document.
    • Incorrect binary property name leads to missing input data errors.
    • Providing malformed base64 content causes decoding errors.
    • Specifying invalid page ranges (e.g., start page greater than end page) may cause unexpected results or errors.
  • Error Messages & Resolutions:

    • "Input binary property not found": Verify the binary field name matches the actual binary data property from the previous node.
    • "Failed to fetch document from URL": Check URL accessibility and correctness.
    • "Invalid base64 content": Ensure the base64 string is properly encoded without extra characters.
    • "Page range out of bounds": Adjust start and end page numbers within the document's actual page count.

Links and References

  • PDF4me API Documentation — Reference for advanced profile options and API capabilities related to document processing.

Discussion