PDF Utils icon

PDF Utils

Inspect and split PDF files using pure npm packages

Overview

This node, named PDF Utils, provides operations to inspect and split PDF files using pure npm packages. It is useful for workflows that need to analyze PDF structure to determine if a PDF is text-based (vectorial) or image-based, and optionally split multi-page PDFs into individual pages. Practical examples include verifying PDF content type before processing or splitting large PDFs into single pages for further handling.

Use Case Examples

  1. Inspect a PDF to check if it is vectorial by analyzing the text content length and page count.
  2. Inspect a PDF and conditionally split it into individual pages if it is not vectorial (image-based).
  3. Split a multi-page PDF into separate single-page PDF files.

Properties

Name Meaning
Binary Property Name of the binary property containing the PDF file to be processed.
Text Threshold Minimum text length on the first page to consider the PDF as vectorial (text-based). Only applicable for inspect and inspectAndSplit operations.

Output

JSON

  • pageCount - Total number of pages in the PDF.
  • isMultiPage - Boolean indicating if the PDF has more than one page.
  • isVectorial - Boolean indicating if the PDF is considered vectorial (text-based) based on the text threshold.
  • textLength - Length of the extracted text from the first page.
  • firstPageText - Extracted text snippet (up to 200 characters) from the first page.
  • pageNumber - Page number of the split PDF page (only present in split or inspectAndSplit outputs).
  • originalFileName - Original file name of the PDF being processed (only present in split or inspectAndSplit outputs).
  • error - Error message if the node fails and continueOnFail is enabled.

Dependencies

  • Uses 'pdfjs-dist' for PDF inspection and 'pdf-lib' for splitting PDFs.

Troubleshooting

  • Common errors include failure to read or parse the PDF file, which may be caused by corrupted or unsupported PDF formats.
  • If the binary property name is incorrect or the binary data is missing, the node will throw an error.
  • To resolve errors, ensure the input binary data contains a valid PDF file and the binary property name matches the input data.
  • If the node fails during splitting, verify the PDF is not encrypted or corrupted.

Links

  • pdf-lib - Library used for splitting PDF documents.
  • pdfjs-dist - Library used for inspecting PDF content and extracting text.

Discussion