h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Join our community

Actions198

Overview

This node operation creates a job to add files from Google Cloud Storage (GCS) into a specified document collection. It is designed for ingesting documents stored in GCS buckets or directories into a collection for further processing, indexing, or querying within the system.

Typical use cases include:

Automating the ingestion of large volumes of documents stored in GCS into a centralized document management or search system.
Periodically updating collections with new or updated files from GCS.
Preparing documents for downstream AI-powered analysis, such as summarization, question generation, or semantic search.

For example, a user might specify a GCS bucket path containing PDFs and images, and this node will create a job that imports those files into a collection where they can be queried or processed by other workflows.

Properties

Name	Meaning
Collection ID	String ID of the collection to add the ingested documents into. This identifies the target collection for the ingestion job.
URLs	The path or list of paths of GCS files or directories to ingest. Supports specifying multiple files or folders in GCS.
Additional Options	A set of optional parameters to customize the ingestion behavior:
- Audio Input Language	Language of audio files; default is "auto" for automatic detection.
- Chunk By Page	Boolean flag indicating whether each page should be treated as a separate chunk. If true, `keep_tables_as_one_chunk` is ignored.
- Credentials	JSON object holding the Google Cloud service account key. If omitted, only public buckets are accessible.
- Gen Doc Questions	Boolean flag to auto-generate sample questions for each document using a large language model (LLM).
- Gen Doc Summaries	Boolean flag to auto-generate document summaries using an LLM.
- Handwriting Check	Boolean flag to check pages for handwriting and use specialized models if handwriting is detected.
- Ingest Mode	Option to select the ingest mode: "standard" (files ingested for retrieval-augmented generation) or "agent_only" (bypasses standard ingestion, used for agent-specific workflows).
- Keep Tables As One Chunk	Boolean flag indicating whether tables identified by the parser should be kept as a single chunk.
- Metadata	JSON metadata to associate with the ingested documents.
- Ocr Model	Method to extract text from images using AI-enabled OCR models. Default is "auto".
- Tesseract Lang	Language code to use when OCR model is set to "tesseract".
- Timeout	Timeout in seconds for the ingestion job request. Default is 0 (no timeout).

Output

The node outputs the response from the API call that creates the ingestion job. The output JSON typically contains details about the created job, such as its unique identifier, status, and any relevant metadata confirming the job creation.

If the ingestion job involves binary data (e.g., file uploads), the node handles it accordingly, but for this operation, the input is mainly URLs and JSON options, so binary output is not expected.

Dependencies

Requires access to the Google Cloud Storage service, including appropriate permissions to read the specified buckets or files.
If accessing private GCS buckets, a valid Google Cloud service account JSON key must be provided in the credentials property.
The node communicates with an external API endpoint responsible for managing ingestion jobs; thus, an API authentication token or key credential is required (configured in n8n).
Network connectivity to both the API server and GCS endpoints is necessary.

Troubleshooting

Authentication Errors: If the credentials JSON is missing or invalid, the job creation will fail due to unauthorized access to GCS. Ensure the service account key is correct and has sufficient permissions.
Invalid URLs: Providing incorrect or inaccessible GCS paths will cause the ingestion job to fail or skip files. Verify the URLs point to existing files or directories.
Timeouts: Large ingestion jobs may require increasing the timeout value to prevent premature termination.
Option Conflicts: Setting chunk_by_page to true ignores keep_tables_as_one_chunk. Be aware of these interactions to avoid unexpected chunking behavior.
Handwriting Check: Enabling handwriting detection may increase processing time; disable if not needed.

Links and References

This summary is based on static analysis of the node's properties and bundled source code related to the "Document Ingestion" resource and the "Creates a Job to Add Files From the Google Cloud Storage Into a Collection" operation.

h2oGPTeInstall