PDF Utils icon

PDF Utils

Inspect and split PDF files using pure npm packages

Overview

This node provides utilities to inspect and split PDF files using pure npm packages. It supports three main operations: Inspect, Inspect and Split, and Split. The Inspect operation analyzes the PDF structure to determine if it is vectorial (text-based) by checking the text length against a threshold. The Inspect and Split operation inspects the PDF and splits it into individual pages only if it is not vectorial. The Split operation splits a multi-page PDF into individual pages regardless of its content. This node is useful for workflows that need to analyze PDF content type or process multi-page PDFs by splitting them into single pages for further handling.

Use Case Examples

  1. Inspect a PDF to check if it is text-based or image-based before deciding further processing.
  2. Automatically split scanned (non-vectorial) PDFs into individual pages while leaving text-based PDFs intact.
  3. Split any multi-page PDF into separate single-page PDFs for individual processing or storage.

Properties

Name Meaning
Binary Property Name of the binary property containing the PDF file to be processed.
Text Threshold Minimum text length to consider the PDF as vectorial (text-based). Used in Inspect and Inspect and Split operations.
Output Binary Property Name for the output binary property of split PDFs. Used in Split and Inspect and Split operations.

Output

JSON

  • pageCount - Total number of pages in the PDF (from Inspect and Inspect and Split operations).
  • isMultiPage - Boolean indicating if the PDF has more than one page (from Inspect and Inspect and Split operations).
  • isVectorial - Boolean indicating if the PDF is considered vectorial (text-based) based on the text threshold (from Inspect and Inspect and Split operations).
  • textLength - Length of the extracted text from the first page (from Inspect and Inspect and Split operations).
  • firstPageText - Extracted text snippet (first 200 characters) from the first page (from Inspect and Inspect and Split operations).
  • pageNumber - Page number of the split PDF page (from Split and Inspect and Split operations).
  • originalFileName - Original file name of the PDF being processed (from Split and Inspect and Split operations).
  • error - Error message if the operation fails and continueOnFail is enabled.

Dependencies

  • pdfjs-dist
  • pdf-lib

Troubleshooting

  • Ensure the input binary property contains a valid PDF file; otherwise, the node will throw an error.
  • If the PDF is encrypted or corrupted, the inspection or splitting may fail with an error message indicating failure to inspect or split PDF.
  • Set the Text Threshold appropriately; a very low threshold might misclassify image-based PDFs as vectorial, and a very high threshold might misclassify text-based PDFs as non-vectorial.
  • If the node is set to continue on fail, errors will be output as JSON with an error message property.

Links

  • pdfjs-dist - Library used for PDF inspection and text extraction.
  • pdf-lib - Library used for splitting PDF documents into individual pages.

Discussion