PDF4me icon

PDF4me

Comprehensive PDF and document processing: generate barcodes, convert files, extract data, manipulate images, and automate workflows with the PDF4ME API

Actions80

Overview

This node operation, Extract Text By Expression, extracts text from PDF documents by applying a user-defined regular expression pattern. It supports multiple ways to provide the PDF input: as binary data from a previous node, as a base64-encoded string, or via a URL pointing to the PDF file.

Typical use cases include:

  • Extracting specific information such as percentages, email addresses, or codes embedded in PDFs.
  • Automating data extraction from invoices, reports, or contracts where certain patterns are known.
  • Processing multi-page PDFs selectively by specifying page ranges.

For example, you could extract all email addresses from a PDF invoice or find all occurrences of percentage values within a report.

Properties

Name Meaning
Input Data Type How the PDF file is provided for text extraction. Options:
• Binary Data (from previous node)
• Base64 String (base64 encoded PDF content)
• URL (link to the PDF file)
Input Binary Field The name of the binary property containing the PDF file when using Binary Data input type. Usually "data".
Base64 PDF Content The base64 encoded string representing the PDF document content, used when Input Data Type is Base64 String.
PDF URL The URL to the PDF file to extract text from, used when Input Data Type is URL.
Document Name The name assigned to the document during processing. Defaults to "document.pdf".
Expression The regular expression pattern to search for within the PDF text. For example, %, US, or an email regex like email@example.com.
Page Sequence Specifies which pages to process. Examples:
1- for all pages starting from page 1
1,2,3 for specific pages
1-5 for a range of pages
Advanced Options Optional JSON string to specify custom profiles or additional API options for processing. For instance, setting output data format or other API-specific parameters. See https://dev.pdf4me.com/apiv2/documentation/ for details.

Output

The node outputs JSON data containing the extracted text snippets that match the provided regular expression from the specified pages of the PDF document.

  • The json field includes the matched text results.
  • If multiple matches occur, they will be included accordingly.
  • The node does not output binary data for this operation; it focuses on textual extraction results.

Dependencies

  • Requires access to the PDF processing API service (a third-party PDF manipulation API).
  • An API key credential or authentication token must be configured in n8n to authorize requests to the external PDF service.
  • Internet access is needed if providing PDF files via URL.

Troubleshooting

  • Common Issues:

    • Incorrect regular expression syntax may cause no matches or errors.
    • Providing an invalid or inaccessible PDF URL will result in failure to fetch the document.
    • Specifying an incorrect binary property name when using binary input will cause the node to fail to locate the PDF file.
    • Page sequence strings must be correctly formatted; otherwise, the node might ignore pages or throw errors.
  • Error Messages and Resolutions:

    • "Failed to fetch PDF from URL": Check the URL accessibility and ensure it points directly to a valid PDF file.
    • "Invalid regular expression": Verify the regex pattern syntax.
    • "Binary property not found": Confirm the binary field name matches the actual binary data property from the previous node.
    • API authentication errors: Ensure the API key or credentials are properly set up in n8n.

Links and References

Discussion