Actions80
- Extract Text From Word
- Find And Replace Text
- Convert PDF To Editable PDF Using OCR
- Create Swiss QR Bill
- Split PDF By Barcode
- Split PDF By Swiss QR
- Split PDF By Text
- Split PDF Regular
- Create PDF/A
- Convert HTML To PDF
- Convert Markdown To PDF
- Upload File To PDF4me
- Add Attachment To PDF
- Add Barcode To PDF
- Add Form Fields To PDF
- Fill PDF Form
- Add HTML Header Footer
- Add Image Stamp To PDF
- Add Margin To PDF
- Add Page Number To PDF
- Add Text Stamp To PDF
- AI-Invoice Parser
- AI-Process HealthCard
- AI-Process Contract
- Generate Barcode
- Classify Document
- Parse Document
- Linearize PDF
- Flatten PDF
- Convert To PDF
- Json To Excel
- Convert PDF To Excel
- Convert PDF To Word
- Convert PDF To PowerPoint
- Convert VISIO
- Crop Image
- Delete Blank Pages From PDF
- Delete Unwanted Pages From PDF
- Extract Pages
- Merge Multiple PDFs
- Overlay PDFs
- Rotate Document
- Rotate Page
- Sign PDF
- URL to PDF
- Add Image Watermark To Image
- Add Text Watermark To Image
- Compress Image
- Convert Image Format
- Create Images From PDF
- Flip Image
- Get Image Metadata
- Image Extract Text
- Remove EXIF Tags From Image
- Replace Text With Image
- Replace Text With Image In Word
- Resize Image
- Rotate Image
- Rotate Image By EXIF Data
- Compress PDF
- Get PDF Metadata
- Repair PDF Document
- Get Document From Pdf4me
- Update Hyperlinks Annotation
- Protect Document
- Unlock PDF
- Disable Tracking Changes In Word
- Enable Tracking Changes In Word
- Generate Document Single
- Generate Documents Multiple
- Get Tracking Changes In Word
- Read Barcode From Image
- Read Barcode From PDF
- Read SwissQR Code
- Extract Form Data From PDF
- Extract Pages From PDF
- Extract Attachment From PDF
- Extract Text By Expression
- Extract Table From PDF
- Extract Resources
Overview
This node operation, Extract Text By Expression, extracts text from PDF documents by applying a user-defined regular expression pattern. It supports multiple ways to provide the PDF input: as binary data from a previous node, as a base64-encoded string, or via a URL pointing to the PDF file.
Typical use cases include:
- Extracting specific information such as percentages, email addresses, or codes embedded in PDFs.
- Automating data extraction from invoices, reports, or contracts where certain patterns are known.
- Processing multi-page PDFs selectively by specifying page ranges.
For example, you could extract all email addresses from a PDF invoice or find all occurrences of percentage values within a report.
Properties
| Name | Meaning |
|---|---|
| Input Data Type | How the PDF file is provided for text extraction. Options: • Binary Data (from previous node) • Base64 String (base64 encoded PDF content) • URL (link to the PDF file) |
| Input Binary Field | The name of the binary property containing the PDF file when using Binary Data input type. Usually "data". |
| Base64 PDF Content | The base64 encoded string representing the PDF document content, used when Input Data Type is Base64 String. |
| PDF URL | The URL to the PDF file to extract text from, used when Input Data Type is URL. |
| Document Name | The name assigned to the document during processing. Defaults to "document.pdf". |
| Expression | The regular expression pattern to search for within the PDF text. For example, %, US, or an email regex like email@example.com. |
| Page Sequence | Specifies which pages to process. Examples: • 1- for all pages starting from page 1 • 1,2,3 for specific pages • 1-5 for a range of pages |
| Advanced Options | Optional JSON string to specify custom profiles or additional API options for processing. For instance, setting output data format or other API-specific parameters. See https://dev.pdf4me.com/apiv2/documentation/ for details. |
Output
The node outputs JSON data containing the extracted text snippets that match the provided regular expression from the specified pages of the PDF document.
- The
jsonfield includes the matched text results. - If multiple matches occur, they will be included accordingly.
- The node does not output binary data for this operation; it focuses on textual extraction results.
Dependencies
- Requires access to the PDF processing API service (a third-party PDF manipulation API).
- An API key credential or authentication token must be configured in n8n to authorize requests to the external PDF service.
- Internet access is needed if providing PDF files via URL.
Troubleshooting
Common Issues:
- Incorrect regular expression syntax may cause no matches or errors.
- Providing an invalid or inaccessible PDF URL will result in failure to fetch the document.
- Specifying an incorrect binary property name when using binary input will cause the node to fail to locate the PDF file.
- Page sequence strings must be correctly formatted; otherwise, the node might ignore pages or throw errors.
Error Messages and Resolutions:
- "Failed to fetch PDF from URL": Check the URL accessibility and ensure it points directly to a valid PDF file.
- "Invalid regular expression": Verify the regex pattern syntax.
- "Binary property not found": Confirm the binary field name matches the actual binary data property from the previous node.
- API authentication errors: Ensure the API key or credentials are properly set up in n8n.
Links and References
- PDF4me API Documentation
- Regular Expressions Guide: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
- n8n Documentation on Credentials: https://docs.n8n.io/credentials/overview/