Actions80
- Add Attachment To PDF
- Add Barcode To PDF
- Add Form Fields To PDF
- Add HTML Header Footer
- Add Image Stamp To PDF
- Add Image Watermark To Image
- Add Margin To PDF
- Add Page Number To PDF
- Add Text Stamp To PDF
- Add Text Watermark To Image
- AI-Invoice Parser
- AI-Process Contract
- AI-Process HealthCard
- Classify Document
- Compress Image
- Compress PDF
- Convert HTML To PDF
- Convert Image Format
- Convert JSON To Excel
- Convert Markdown To PDF
- Convert PDF To Editable PDF Using OCR
- Convert PDF To Excel
- Convert PDF To PowerPoint
- Convert PDF To Word
- Convert To PDF
- Convert URL to PDF
- Convert VISIO
- Convert Word to PDF Form
- Create Images From PDF
- Create PDF/A
- Create Swiss QR Bill
- Crop Image
- Delete Blank Pages From PDF
- Delete Unwanted Pages From PDF
- Disable Tracking Changes In Word
- Enable Tracking Changes In Word
- Extract Attachment From PDF
- Extract Form Data From PDF
- Extract Pages From PDF
- Extract Resources
- Extract Table From PDF
- Extract Text By Expression
- Extract Text From Word
- Fill PDF Form
- Find And Replace Text
- Flip Image
- Flatten PDF
- Generate Barcode
- Generate Document Single
- Generate Documents Multiple
- Get Document From Pdf4me
- Get Image Metadata
- Get PDF Metadata
- Get Tracking Changes In Word
- Image Extract Text
- Linearize PDF
- Merge Multiple PDFs
- Overlay PDFs
- Parse Document
- Protect PDF
- Read Barcode From Image
- Read Barcode From PDF
- Read SwissQR Code
- Remove EXIF Tags From Image
- Repair PDF Document
- Replace Text With Image
- Replace Text With Image In Word
- Resize Image
- Rotate Document
- Rotate Image
- Rotate Image By EXIF Data
- Rotate PDF Page
- Sign PDF
- Split PDF By Barcode
- Split PDF By Swiss QR
- Split PDF By Text
- Split PDF Regular
- Unlock PDF
- Update Hyperlinks Annotation
- Upload File To PDF4me
Overview
This node operation, Extract Text By Expression, extracts text from a PDF document based on a user-provided regular expression pattern. It supports multiple ways to provide the PDF input: as binary data from a previous node, as a base64-encoded string, or via a URL pointing to the PDF file.
Typical use cases include:
- Extracting specific information such as percentages, email addresses, or custom patterns from PDF reports or invoices.
- Automating data extraction workflows where only certain text snippets matching a pattern are needed.
- Processing multi-page PDFs with control over which pages to scan for the text.
For example, you could extract all email addresses from a contract PDF by providing an appropriate regex pattern and specifying the page range to search.
Properties
| Name | Meaning |
|---|---|
| Input Data Type | How the PDF is provided: • Binary Data (from previous node) • Base64 String (direct content) • URL (link to PDF file) |
| Input Binary Field | The name of the binary property containing the PDF file when using Binary Data input type (usually "data"). |
| Base64 PDF Content | The base64 encoded string of the PDF document content when using Base64 String input type. |
| PDF URL | The URL to the PDF file when using URL input type. |
| Document Name | The name assigned to the document during processing (default "document.pdf"). |
| Expression | The regular expression pattern used to search for matching text within the PDF. Examples: %, US, email@example.com. |
| Page Sequence | Specifies which pages to process: • "1-" means all pages starting from page 1 • "1,2,3" means pages 1, 2, and 3 • "1-5" means pages 1 through 5 |
| Advanced Options | Optional JSON string to specify custom profiles or additional API options for processing. For example, setting output data format or other API-specific parameters. See https://dev.pdf4me.com/apiv2/documentation/ for details. |
Output
The node outputs an array of JSON objects, each representing the extracted text results for the corresponding input item.
- The main output field is
json, which contains the extracted text snippets that matched the provided regular expression. - If multiple matches are found, they will be included in the output accordingly.
- The output does not include binary data; it focuses on textual extraction results.
Dependencies
- Requires access to the PDF processing API service (a third-party PDF manipulation API).
- Needs proper API authentication configured in n8n credentials (an API key or token).
- Internet access may be required if the PDF is provided via URL.
- The node relies on the external PDF4me API endpoints to perform the extraction.
Troubleshooting
Common issues:
- Providing an invalid or inaccessible PDF URL will cause failures.
- Incorrect base64 encoding or corrupted binary data will result in errors.
- Invalid regular expression syntax can cause the extraction to fail or return no results.
- Specifying an incorrect page sequence (e.g., pages out of range) may lead to empty output.
Error messages:
- Errors related to network or API authentication usually indicate missing or invalid API credentials.
- Parsing errors often point to malformed input data or unsupported PDF formats.
- Regex-related errors suggest checking the pattern syntax.
Resolutions:
- Verify the PDF source and ensure it is accessible and correctly formatted.
- Test the regular expression independently before using it in the node.
- Confirm API credentials and permissions are correctly set up in n8n.
- Use the page sequence format as described to avoid invalid page references.