h2oGPTe icon

h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Actions198

Overview

This node operation allows you to add files from Google Cloud Storage (GCS) into a specified document collection. It is designed for ingesting documents stored in GCS buckets or directories directly into a collection for further processing, indexing, or querying within the system.

Common scenarios where this node is beneficial include:

  • Automating the ingestion of large volumes of documents stored in GCS into a knowledge base or document management system.
  • Integrating cloud storage with AI-powered search or analysis workflows.
  • Periodically syncing or updating collections with new or updated files from GCS.

Practical example:

  • A company stores scanned contracts and reports in GCS. Using this node, they can automatically ingest these files into a collection that powers an internal search assistant, enabling employees to query contract details efficiently.

Properties

Name Meaning
Collection ID String ID of the collection to add the ingested documents into. This identifies the target collection where files will be added.
URLs The path or list of paths of GCS files or directories to ingest. Supports specifying multiple files or folders in GCS.
Additional Options A set of optional parameters to customize the ingestion process:
- Audio Input Language Language of audio files; default is "auto" for automatic detection.
- Chunk By Page Boolean flag indicating whether each page should be treated as a separate chunk. If true, keep_tables_as_one_chunk is ignored.
- Credentials JSON object holding the Google Cloud service account key. If omitted, only public buckets are accessible.
- Gen Doc Questions Boolean flag to auto-generate sample questions for each document using a large language model (LLM).
- Gen Doc Summaries Boolean flag to auto-generate document summaries using LLM.
- Handwriting Check Boolean flag to check pages for handwriting and use specialized models if handwriting is detected.
- Ingest Mode Mode of ingestion: "standard" (files ingested for retrieval-augmented generation) or "agent_only" (bypasses standard ingestion, used for agent-specific workflows).
- Keep Tables As One Chunk Boolean flag indicating whether tables identified by the parser should be kept as a single chunk.
- Metadata JSON object containing metadata to associate with the ingested documents.
- Ocr Model Method to extract text from images using AI-enabled OCR models. Default is "auto".
- Tesseract Lang Language code to use when OCR model is set to "tesseract".
- Timeout Timeout in seconds for the ingestion request. Default is 0 (no timeout).

Output

The node outputs a JSON response representing the result of the ingestion request. This typically includes information about the ingestion job status, any errors encountered, and metadata about the ingested documents or collection update.

If the ingestion involves binary data (e.g., files), the node handles it internally but does not output raw binary data directly. Instead, it provides references or statuses related to the ingestion process.

Dependencies

  • Requires access to Google Cloud Storage, optionally authenticated via a Google Cloud service account JSON key provided in the credentials property.
  • The node communicates with an external API endpoint /ingest/gcs to perform the ingestion.
  • Proper permissions on the GCS bucket and the target collection are necessary.
  • Network connectivity to the API and GCS endpoints is required.
  • No additional environment variables are explicitly required beyond the credentials passed.

Troubleshooting

  • Authentication Errors: If the credentials JSON is missing or invalid, the node may fail to access private GCS buckets. Ensure the service account key is correct and has appropriate permissions.
  • Timeouts: Large file sets or slow network connections may cause timeouts. Adjust the timeout property accordingly.
  • Invalid Paths: Incorrect or inaccessible GCS paths in the URLs property will cause ingestion failures. Verify the paths exist and are accessible.
  • Permission Denied: Lack of permission on the target collection or GCS bucket will result in errors. Confirm user and service account permissions.
  • Unsupported File Types: Some file types may not be supported by the ingestion backend. Check documentation for supported formats.
  • Handwriting Check Issues: Enabling handwriting check may increase processing time or require specific models; disable if unnecessary.

Links and References


This summary is based solely on static analysis of the provided source code and property definitions without runtime execution.

Discussion