Overview
The PDF Extractor node processes PDF files provided as binary data and extracts their textual content in plain text, Markdown format, or both. It is useful for workflows that need to analyze, transform, or repurpose the contents of PDF documents without manual intervention. For example, it can be used to convert PDF reports into editable text formats for further processing, indexing, or integration with other systems.
Properties
| Name | Meaning |
|---|---|
| Input Binary Field | The name of the input field containing the binary PDF file data to be processed. |
| Output Format | The format to convert the PDF content into. Options: TXT (plain text), Markdown, or All (both). |
Output
The node outputs JSON objects with the extracted content from each PDF item:
- If "TXT" is selected, the output contains a
textContentfield with the plain text extracted from the PDF. - If "Markdown" is selected, the output contains a
markdownfield with the content converted to Markdown format. - If "All" is selected, both
textContentandmarkdownfields are included.
Each output item is paired with its corresponding input item index. The node does not output binary data; it only returns textual representations extracted from the PDF.
Dependencies
- The node depends on an internal helper function to read binary data buffers from the input.
- It uses an external library (imported as
pdf2md) to perform the conversion from PDF binary data to text and Markdown. - No external API keys or credentials are required.
Troubleshooting
- Missing Binary Data Error: If the specified input binary field does not exist or contains no data for an item, the node throws an error indicating the missing binary data for that item. Ensure the input field name matches the actual binary property containing the PDF.
- Conversion Failures: Errors during PDF parsing or conversion will cause the node to fail unless "Continue On Fail" is enabled, in which case the error message is returned in the output JSON under the
errorfield. - Incorrect Output Format: Selecting an unsupported output format will result in no output for that item. Use one of the provided options: "txt", "markdown", or "all".
Links and References
- PDF to Markdown Conversion Library (Note: Replace with actual library link if available)
- n8n Documentation on Binary Data Handling