PDF4me icon

PDF4me

Comprehensive PDF and document processing: generate barcodes, convert files, extract data, manipulate images, and automate workflows with the PDF4ME API

Actions80

Overview

This node operation, "Extract Resources," is designed to extract various resources such as text and images from PDF documents. It supports multiple input methods for providing the PDF file: binary data from a previous node, a base64 encoded string, or a URL pointing to the PDF file. Users can specify extraction options to control whether text, images, or both are extracted, and can also define advanced options like which pages to process and custom API profiles.

This node is beneficial in scenarios where automated processing of PDF content is required, such as extracting textual data for analysis, retrieving embedded images for reuse, or preparing document contents for further workflows. For example, it can be used to extract all images from a set of invoices or to pull out text from specific pages of a contract PDF.

Properties

Name Meaning
Input Data Type Choose how to provide the PDF file to extract resources from. Options: Binary Data (from previous node), Base64 String (directly provide base64 encoded PDF content), URL (link to PDF file).
Input Binary Field Name of the binary property containing the PDF file when using Binary Data input type. Usually "data" for file uploads.
Base64 PDF Content Base64 encoded string representing the PDF document content. Used when Input Data Type is "Base64 String".
PDF URL URL to the PDF file to extract resources from. Used when Input Data Type is "URL".
Document Name Name assigned to the document during processing. Defaults to "document.pdf".
Extraction Options Collection of options to specify what to extract from the PDF:
- Extract Text: Whether to extract text content (boolean).
- Extract Images: Whether to extract images (boolean).
Advanced Options Additional settings for extraction:
- Pages: Specify pages to extract from using formats like "all", "1,2", or "2-5".
- Custom Profiles: JSON string to adjust custom properties or API-specific options.

Output

The output contains the extracted resources from the PDF document. The json field will include the extracted text and/or images depending on the selected extraction options. If images are extracted, they may be provided as binary data or encoded strings suitable for further processing in n8n workflows.

Dependencies

  • Requires access to the PDF file either as binary data, base64 content, or via a URL.
  • Likely depends on an external PDF processing API or service (not explicitly named) that performs the actual resource extraction.
  • Requires appropriate API credentials or authentication tokens configured in n8n to access the external PDF processing service.

Troubleshooting

  • Common Issues:
    • Providing an incorrect or inaccessible URL may cause failures in fetching the PDF.
    • Incorrect base64 encoding or corrupted binary data will prevent successful extraction.
    • Specifying invalid page ranges in the "Pages" option could lead to errors or empty results.
  • Error Messages:
    • Errors related to file retrieval usually indicate network issues or invalid URLs.
    • Extraction errors might mention unsupported file formats or corrupted PDFs.
  • Resolutions:
    • Verify URLs and ensure the PDF is publicly accessible or properly authenticated.
    • Confirm base64 strings are correctly encoded without extra characters.
    • Use valid page range syntax and test with "all" if unsure.

Links and References

Discussion