PDF Extractor

Extracts text and markdown from PDF files

Overview

The PDF Extractor node processes PDF files provided as binary data and extracts their textual content in plain text, Markdown format, or both. It is useful for workflows that need to analyze, transform, or repurpose the contents of PDF documents without manual intervention. For example, it can be used to convert PDF reports into editable text formats for further processing, indexing, or integration with other systems.

Properties

Name	Meaning
Input Binary Field	The name of the input field containing the binary PDF file data to be processed.
Output Format	The format to convert the PDF content into. Options: TXT (plain text), Markdown, or All (both).

Output

The node outputs JSON objects with the extracted content from each PDF item:

If "TXT" is selected, the output contains a textContent field with the plain text extracted from the PDF.
If "Markdown" is selected, the output contains a markdown field with the content converted to Markdown format.
If "All" is selected, both textContent and markdown fields are included.

Each output item is paired with its corresponding input item index. The node does not output binary data; it only returns textual representations extracted from the PDF.

Dependencies

The node depends on an internal helper function to read binary data buffers from the input.
It uses an external library (imported as pdf2md) to perform the conversion from PDF binary data to text and Markdown.
No external API keys or credentials are required.

Troubleshooting

Missing Binary Data Error: If the specified input binary field does not exist or contains no data for an item, the node throws an error indicating the missing binary data for that item. Ensure the input field name matches the actual binary property containing the PDF.
Conversion Failures: Errors during PDF parsing or conversion will cause the node to fail unless "Continue On Fail" is enabled, in which case the error message is returned in the output JSON under the error field.
Incorrect Output Format: Selecting an unsupported output format will result in no output for that item. Use one of the provided options: "txt", "markdown", or "all".

Links and References

PDF to Markdown Conversion Library (Note: Replace with actual library link if available)
n8n Documentation on Binary Data Handling

PDF ExtractorInstall