h2oGPTe icon

h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Actions198

Overview

This node operation creates a job to add files from Google Cloud Storage (GCS) into a specified document collection. It is designed for ingesting documents stored in GCS buckets or directories into a collection for further processing, indexing, or querying within the system.

Typical use cases include:

  • Automating the ingestion of large volumes of documents stored in GCS into a centralized document management or search system.
  • Periodically updating collections with new or updated files from GCS.
  • Preparing documents for downstream AI-powered analysis, such as summarization, question generation, or semantic search.

For example, a user might specify a GCS bucket path containing PDFs and images, and this node will create a job that imports those files into a collection where they can be queried or processed by other workflows.

Properties

Name Meaning
Collection ID String ID of the collection to add the ingested documents into. This identifies the target collection for the ingestion job.
URLs The path or list of paths of GCS files or directories to ingest. Supports specifying multiple files or folders in GCS.
Additional Options A set of optional parameters to customize the ingestion behavior:
- Audio Input Language Language of audio files; default is "auto" for automatic detection.
- Chunk By Page Boolean flag indicating whether each page should be treated as a separate chunk. If true, keep_tables_as_one_chunk is ignored.
- Credentials JSON object holding the Google Cloud service account key. If omitted, only public buckets are accessible.
- Gen Doc Questions Boolean flag to auto-generate sample questions for each document using a large language model (LLM).
- Gen Doc Summaries Boolean flag to auto-generate document summaries using an LLM.
- Handwriting Check Boolean flag to check pages for handwriting and use specialized models if handwriting is detected.
- Ingest Mode Option to select the ingest mode: "standard" (files ingested for retrieval-augmented generation) or "agent_only" (bypasses standard ingestion, used for agent-specific workflows).
- Keep Tables As One Chunk Boolean flag indicating whether tables identified by the parser should be kept as a single chunk.
- Metadata JSON metadata to associate with the ingested documents.
- Ocr Model Method to extract text from images using AI-enabled OCR models. Default is "auto".
- Tesseract Lang Language code to use when OCR model is set to "tesseract".
- Timeout Timeout in seconds for the ingestion job request. Default is 0 (no timeout).

Output

The node outputs the response from the API call that creates the ingestion job. The output JSON typically contains details about the created job, such as its unique identifier, status, and any relevant metadata confirming the job creation.

If the ingestion job involves binary data (e.g., file uploads), the node handles it accordingly, but for this operation, the input is mainly URLs and JSON options, so binary output is not expected.

Dependencies

  • Requires access to the Google Cloud Storage service, including appropriate permissions to read the specified buckets or files.
  • If accessing private GCS buckets, a valid Google Cloud service account JSON key must be provided in the credentials property.
  • The node communicates with an external API endpoint responsible for managing ingestion jobs; thus, an API authentication token or key credential is required (configured in n8n).
  • Network connectivity to both the API server and GCS endpoints is necessary.

Troubleshooting

  • Authentication Errors: If the credentials JSON is missing or invalid, the job creation will fail due to unauthorized access to GCS. Ensure the service account key is correct and has sufficient permissions.
  • Invalid URLs: Providing incorrect or inaccessible GCS paths will cause the ingestion job to fail or skip files. Verify the URLs point to existing files or directories.
  • Timeouts: Large ingestion jobs may require increasing the timeout value to prevent premature termination.
  • Option Conflicts: Setting chunk_by_page to true ignores keep_tables_as_one_chunk. Be aware of these interactions to avoid unexpected chunking behavior.
  • Handwriting Check: Enabling handwriting detection may increase processing time; disable if not needed.

Links and References


This summary is based on static analysis of the node's properties and bundled source code related to the "Document Ingestion" resource and the "Creates a Job to Add Files From the Google Cloud Storage Into a Collection" operation.

Discussion