Blab Document Parse icon

Blab Document Parse

Convert documents into structured HTML/Markdown using Upstage Document Parse

Overview

This node integrates with the Upstage Document Parse API to convert documents into structured formats such as HTML, Markdown, or plain text. It supports synchronous parsing and asynchronous submission of files for document digitization, including OCR (Optical Character Recognition) options and advanced features like chart recognition and multipage table merging. This node is useful for automating document processing workflows, extracting structured content from scanned images, PDFs, or other document formats, and converting them into machine-readable formats for further analysis or storage.

Use Case Examples

  1. Automatically extract text and layout from scanned invoices or contracts for data entry automation.
  2. Convert research papers or reports into Markdown or HTML for easy web publishing.
  3. Submit large documents asynchronously for processing and retrieve results later to handle long-running tasks efficiently.

Properties

Name Meaning
Binary Property Name of the input item binary property that contains the file to be processed.
Model Selects the document parsing model to use, with options like 'document-parse' or 'document-parse-nightly'.
OCR Determines whether OCR is applied before layout detection, with 'Auto' applying OCR only to image documents and 'Force' always performing OCR.
Base64 Encoding Categories Specifies which categories of cropped base64 images to return, such as figures, tables, equations, or charts.
Merge Multipage Tables Boolean flag to merge tables that span multiple pages into a single table.
Output Formats Specifies which output formats to include in the response, such as HTML, Markdown, or Text.
Include Coordinates Boolean flag to include bounding box coordinates for each layout element in the output.
Chart Recognition Boolean flag to enable chart recognition, converting charts into tables if true.

Output

JSON

  • request_id - The unique identifier for the asynchronous document processing request.
  • submitted - Boolean indicating if the document was successfully submitted for asynchronous processing.
  • html - Extracted content formatted as HTML (available in synchronous mode).
  • markdown - Extracted content formatted as Markdown (available in synchronous mode).
  • text - Extracted content formatted as plain text (available in synchronous mode).
  • elements - Array of layout elements extracted from the document, such as paragraphs, tables, and figures (available in synchronous mode).
  • error - Error message if the processing failed.
  • statusCode - HTTP status code associated with an error.
  • timestamp - Timestamp when the error occurred.

Dependencies

  • Requires an API key credential for the Upstage Document Parse API (referred to as 'blabApi' in the node).

Troubleshooting

Error 'No binary data found in property "".' indicates the specified binary property does not exist or is empty in the input item. Verify the binary property name matches the input data.

  • Missing or invalid API credentials will cause authentication errors. Ensure the API key credential is correctly configured.
  • For asynchronous operations, a missing or invalid Request ID will cause errors. Provide a valid Request ID when retrieving results.
  • Network or API endpoint errors may occur; check internet connectivity and API service status.

Links

Discussion