PDF4me icon

PDF4me

Comprehensive PDF and document processing: generate barcodes, convert files, extract data, manipulate images, and automate workflows with the PDF4ME API

Actions80

Overview

This node operation, Extract Text By Expression, extracts text from a PDF document based on a user-provided regular expression pattern. It supports multiple ways to provide the PDF input: as binary data from a previous node, as a base64-encoded string, or via a URL pointing to the PDF file.

Typical use cases include:

  • Extracting specific information such as percentages, email addresses, or custom patterns from PDF reports or invoices.
  • Automating data extraction workflows where only certain text snippets matching a pattern are needed.
  • Processing multi-page PDFs with control over which pages to scan for the text.

For example, you could extract all email addresses from a contract PDF by providing an appropriate regex pattern and specifying the page range to search.

Properties

Name Meaning
Input Data Type How the PDF is provided:
• Binary Data (from previous node)
• Base64 String (direct content)
• URL (link to PDF file)
Input Binary Field The name of the binary property containing the PDF file when using Binary Data input type (usually "data").
Base64 PDF Content The base64 encoded string of the PDF document content when using Base64 String input type.
PDF URL The URL to the PDF file when using URL input type.
Document Name The name assigned to the document during processing (default "document.pdf").
Expression The regular expression pattern used to search for matching text within the PDF. Examples: %, US, email@example.com.
Page Sequence Specifies which pages to process:
"1-" means all pages starting from page 1
"1,2,3" means pages 1, 2, and 3
"1-5" means pages 1 through 5
Advanced Options Optional JSON string to specify custom profiles or additional API options for processing. For example, setting output data format or other API-specific parameters. See https://dev.pdf4me.com/apiv2/documentation/ for details.

Output

The node outputs an array of JSON objects, each representing the extracted text results for the corresponding input item.

  • The main output field is json, which contains the extracted text snippets that matched the provided regular expression.
  • If multiple matches are found, they will be included in the output accordingly.
  • The output does not include binary data; it focuses on textual extraction results.

Dependencies

  • Requires access to the PDF processing API service (a third-party PDF manipulation API).
  • Needs proper API authentication configured in n8n credentials (an API key or token).
  • Internet access may be required if the PDF is provided via URL.
  • The node relies on the external PDF4me API endpoints to perform the extraction.

Troubleshooting

  • Common issues:

    • Providing an invalid or inaccessible PDF URL will cause failures.
    • Incorrect base64 encoding or corrupted binary data will result in errors.
    • Invalid regular expression syntax can cause the extraction to fail or return no results.
    • Specifying an incorrect page sequence (e.g., pages out of range) may lead to empty output.
  • Error messages:

    • Errors related to network or API authentication usually indicate missing or invalid API credentials.
    • Parsing errors often point to malformed input data or unsupported PDF formats.
    • Regex-related errors suggest checking the pattern syntax.
  • Resolutions:

    • Verify the PDF source and ensure it is accessible and correctly formatted.
    • Test the regular expression independently before using it in the node.
    • Confirm API credentials and permissions are correctly set up in n8n.
    • Use the page sequence format as described to avoid invalid page references.

Links and References

Discussion