h2oGPTe icon

h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Actions198

Overview

This node operation allows users to add files from an Azure Blob Storage container into a specified document collection. It is designed for scenarios where documents stored in Azure Blob Storage need to be ingested and indexed within a collection for further processing, searching, or analysis. This is particularly useful for organizations managing large volumes of documents in Azure and wanting to integrate them into their AI-powered document management or search systems.

Practical examples include:

  • Automatically ingesting new files uploaded to an Azure Blob Storage container into a knowledge base collection.
  • Migrating existing document archives from Azure Blob Storage into a searchable document collection.
  • Enriching collections with documents stored remotely without manual download and upload steps.

Properties

Name Meaning
Collection ID String ID of the collection to add the ingested documents into.
Container Name of the Azure Blob Storage container where the files are located.
Paths Path or list of paths to files or directories within the Azure Blob Storage container to ingest.
Account Name Name of the Azure storage account.
Additional Options A set of optional parameters to customize ingestion behavior:
- Audio Input Language (string): Language of audio files, default "auto".
- Chunk By Page (boolean): Whether each page is a chunk.
- Credentials (json): Azure credentials object; if container is private, must include account_key or sas_token.
- Gen Doc Questions (boolean): Auto-generate sample questions per document using LLM.
- Gen Doc Summaries (boolean): Auto-generate document summaries using LLM.
- Handwriting Check (boolean): Check pages for handwriting and use specialized models.
- Ingest Mode (options): "standard" (default) or "agent_only" mode.
- Keep Tables As One Chunk (boolean): Keep table tokens in a single chunk.
- Metadata (json): Metadata to associate with the document.
- Ocr Model (string): OCR method to extract text from images, default "auto".
- Tesseract Lang (string): Language for Tesseract OCR.
- Timeout (number): Timeout in seconds for the ingestion request.

Output

The output JSON contains the response from the ingestion API call indicating the status of the ingestion job or process. It typically includes information about the ingestion task such as success confirmation, job IDs, or error messages if any occurred.

If the ingestion involves binary data (e.g., files), this node handles the ingestion process but does not output binary data directly; instead, it returns metadata about the ingestion operation.

Dependencies

  • Requires access to an Azure Blob Storage account with appropriate permissions.
  • Requires valid Azure credentials (such as account key or SAS token) if the container is private.
  • The node communicates with an external API endpoint /ingest/azure_blob_storage to perform the ingestion.
  • Proper configuration of API authentication (an API key credential) for the external service is necessary.

Troubleshooting

  • Authentication errors: Ensure that the Azure credentials provided are correct and have sufficient permissions to access the specified container and paths.
  • Timeouts: If ingestion takes too long, increase the timeout parameter or check network connectivity.
  • Invalid paths: Verify that the paths specified exist in the Azure Blob Storage container.
  • Permission denied: Confirm that the API key credential used has permission to perform ingestion operations.
  • Incorrect ingestion mode: Using "agent_only" mode bypasses standard ingestion; ensure this mode fits your use case.
  • OCR issues: If OCR extraction fails or produces incorrect text, try changing the OCR model or specifying the correct language for Tesseract.

Links and References


This summary is based on static analysis of the node's properties and routing configuration related to the "Adds Files From the Azure Blob Storage Into a Collection" operation under the Document Ingestion resource.

Discussion