Blab Document Parse icon

Blab Document Parse

Convert documents into structured HTML/Markdown using Upstage Document Parse

Overview

This node integrates with the Upstage Document Parse API to convert documents into structured formats such as HTML, Markdown, or plain text. It supports synchronous parsing by uploading a file, asynchronous submission and retrieval of parsing results, and listing of asynchronous requests. It is useful for automating document digitization workflows, extracting structured content from PDFs, images, or other document types, and converting charts to tables.

Use Case Examples

  1. Extract structured HTML content from uploaded PDF documents for further processing or display.
  2. Perform OCR on image documents before layout detection to extract text content.
  3. Submit large documents asynchronously and retrieve parsing results later to handle long processing times.
  4. Convert document elements into multiple output formats like HTML and Markdown for flexible use in different applications.

Properties

Name Meaning
Binary Property Name of the input item binary property that contains the file to be parsed.
Model Selects the document parsing model to use, e.g., 'document-parse' or 'document-parse-nightly'.
OCR Determines whether to perform OCR inference on the document before layout detection. 'Auto' applies OCR only to image documents; 'Force' always performs OCR.
Base64 Encoding Categories Select categories of layout elements (figure, table, equation, chart) for which cropped base64 images should be returned.
Merge Multipage Tables Whether to merge tables that span multiple pages into a single table.
Output Formats Specifies which output formats to include in the response. Each layout element will be formatted accordingly.
Include Coordinates Whether to return coordinates of bounding boxes for each layout element.
Chart Recognition Whether to use chart recognition to convert charts into tables.
Return Determines the part of the response to return when performing synchronous parsing.

Output

JSON

  • html - HTML formatted content of the parsed document.
  • markdown - Markdown formatted content of the parsed document.
  • text - Plain text content of the parsed document.
  • elements - Array of layout elements extracted from the document.
  • request_id - ID of the asynchronous request when submitting a document for async processing.
  • submitted - Boolean indicating if the async request was successfully submitted.
  • error - Error message if the request failed.
  • statusCode - HTTP status code associated with an error.
  • timestamp - Timestamp of when the error occurred.

Dependencies

  • Upstage Document Parse API
  • An API key credential for authentication

Troubleshooting

  • Error 'No binary data found in property' indicates the specified binary property does not exist or is empty; ensure the input item contains the file in the correct binary property.
  • Missing or invalid API credentials will cause authentication failures; verify the API key credential is correctly configured.
  • For asynchronous operations, providing an invalid or missing request ID will cause errors; ensure the request ID is correct when retrieving results.
  • Network or API endpoint errors may occur; check internet connectivity and API service status.

Links

Discussion