PDF to CSV

Convert PDF documents to CSV format

Overview

This node converts PDF documents into CSV or other structured data formats. It supports input as either binary PDF data or a URL pointing to a PDF file. The node extracts text from the PDF and applies various parsing methods to detect and reconstruct tables or structured data within the document. The output can be returned as a CSV string, JSON array, downloadable binary CSV, or an Excel (.xlsx) file.

Common scenarios where this node is useful include:

Extracting tabular data from invoices, reports, or statements in PDF format.
Automating data ingestion workflows that require converting PDFs into spreadsheet-compatible formats.
Parsing structured or semi-structured PDF documents for further processing or analysis.

Practical example: A user receives monthly sales reports as PDFs and wants to automatically convert them into Excel files for integration with their accounting system. This node can fetch the PDF via URL or accept it as binary input, parse the tables inside, and output an Excel file ready for import.

Properties

Name	Meaning
Input Type	Choose whether the PDF input comes from binary data (uploaded file) or a URL. Options: "Binary Data", "URL".
Binary Property	(When Input Type is Binary Data) Name of the binary property containing the PDF file. Default is "data".
PDF URL	(When Input Type is URL) URL of the PDF file to convert.
Parsing Method	Method used to parse the PDF text into tables or structured data. Options: - Auto Detect Tables: Automatically detect table structures. - Smart Pattern Detection: Advanced pattern detection for structured reports. - Column-Based Table Reconstruction: Reconstruct tables from column-flowing PDFs. - Line by Line: Parse text line by line. - Custom Delimiter: Use a custom regex delimiter to split text.
Custom Delimiter	(When Parsing Method is Custom Delimiter) Regular expression pattern for splitting text. Default is multiple spaces (`\s+`).
CSV Delimiter	Delimiter character to use in the output CSV file. Default is comma (`,`).
Include Headers	Whether to treat the first row as headers in the output. Boolean, default true.
Skip Empty Lines	Whether to skip empty lines when parsing the PDF text. Boolean, default true.
Output Format	Format of the output data. Options: - CSV String: Return CSV as a string. - JSON Array: Return parsed data as a JSON array. - Binary Data: Return CSV as base64-encoded binary data for download. - Excel File: Return data as an Excel (.xlsx) file.

Output

The node outputs an array of items, each containing:

json: The main output data, which varies depending on the selected output format:
- For CSV String: a string containing the CSV data.
- For JSON Array: an array of objects representing rows (if headers are included) or arrays of values.
- For Binary Data: a message confirming successful conversion.
- For Excel File: a message including the number of rows and columns processed.
binary (optional): Present if output format is Binary Data or Excel File, containing:
- data: Base64-encoded content of the generated file (CSV or XLSX).
- mimeType: MIME type of the file (text/csv for CSV, Excel MIME type for XLSX).
- fileName: Suggested filename (converted.csv or converted.xlsx).

Dependencies

Requires the following npm packages bundled with the node:
- pdf-parse: To extract text content from PDF files.
- papaparse: To generate CSV strings from arrays.
- xlsx: To create Excel files from parsed data.
If using URL input, the node performs HTTP GET requests to fetch the PDF file.
No special environment variables or external API keys are required.

Troubleshooting

Common issues:
- Invalid or inaccessible PDF URL: Ensure the URL is correct and publicly accessible or accessible with provided credentials.
- Incorrect binary property name: Verify the binary property name matches the incoming binary data.
- Parsing errors due to unusual PDF layouts: Try different parsing methods (e.g., switch from auto-detect to smart pattern or column-based).
- Large PDFs may cause performance delays or memory issues.
Error messages:
- Errors related to missing binary data or invalid buffer indicate incorrect binary property configuration.
- HTTP request failures when fetching PDF URLs usually mean network issues or invalid URLs.
- Parsing errors might occur if the PDF text extraction fails; try testing with simpler PDFs or adjusting parsing options.
Resolution tips:
- Double-check all input parameters.
- Test with sample PDFs to find the best parsing method.
- Enable "Continue On Fail" in the node settings to handle individual item errors gracefully.

Links and References

pdf-parse GitHub – PDF text extraction library used internally.
PapaParse Documentation – CSV parsing and generation library.
SheetJS xlsx Documentation – Library for creating Excel files programmatically.