Actions3
Overview
This node provides utilities to inspect and split PDF files using pure npm packages. It supports three main operations: Inspect, Inspect and Split, and Split. The Inspect operation analyzes the PDF structure to determine if it is vectorial (text-based) by checking the text length against a threshold. The Inspect and Split operation inspects the PDF and splits it into individual pages only if it is not vectorial. The Split operation splits a multi-page PDF into individual pages regardless of its content. This node is useful for workflows that need to analyze PDF content type or process multi-page PDFs by splitting them into single pages for further handling.
Use Case Examples
- Inspect a PDF to check if it is text-based or image-based before deciding further processing.
- Automatically split scanned (non-vectorial) PDFs into individual pages while leaving text-based PDFs intact.
- Split any multi-page PDF into separate single-page PDFs for individual processing or storage.
Properties
| Name | Meaning |
|---|---|
| Binary Property | Name of the binary property containing the PDF file to be processed. |
| Text Threshold | Minimum text length to consider the PDF as vectorial (text-based). Used in Inspect and Inspect and Split operations. |
| Output Binary Property | Name for the output binary property of split PDFs. Used in Split and Inspect and Split operations. |
Output
JSON
pageCount- Total number of pages in the PDF (from Inspect and Inspect and Split operations).isMultiPage- Boolean indicating if the PDF has more than one page (from Inspect and Inspect and Split operations).isVectorial- Boolean indicating if the PDF is considered vectorial (text-based) based on the text threshold (from Inspect and Inspect and Split operations).textLength- Length of the extracted text from the first page (from Inspect and Inspect and Split operations).firstPageText- Extracted text snippet (first 200 characters) from the first page (from Inspect and Inspect and Split operations).pageNumber- Page number of the split PDF page (from Split and Inspect and Split operations).originalFileName- Original file name of the PDF being processed (from Split and Inspect and Split operations).error- Error message if the operation fails and continueOnFail is enabled.
Dependencies
- pdfjs-dist
- pdf-lib
Troubleshooting
- Ensure the input binary property contains a valid PDF file; otherwise, the node will throw an error.
- If the PDF is encrypted or corrupted, the inspection or splitting may fail with an error message indicating failure to inspect or split PDF.
- Set the Text Threshold appropriately; a very low threshold might misclassify image-based PDFs as vectorial, and a very high threshold might misclassify text-based PDFs as non-vectorial.
- If the node is set to continue on fail, errors will be output as JSON with an error message property.
Links
- pdfjs-dist - Library used for PDF inspection and text extraction.
- pdf-lib - Library used for splitting PDF documents into individual pages.