h2oGPTe icon

h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Actions198

Overview

This node operation creates a job to import one collection into another existing collection within the system. It is useful for scenarios where you want to merge or copy documents and metadata from a source collection into a destination collection, potentially applying various processing options during the import.

Practical examples include:

  • Migrating documents from an old project collection to a new one.
  • Aggregating multiple collections into a single unified collection.
  • Importing curated document sets with specific ingestion settings like OCR or chunking.

The operation triggers an asynchronous job that handles the import process, allowing large imports without blocking workflows.

Properties

Name Meaning
Collection ID The unique identifier of the destination collection where the source collection will be imported.
Source Collection ID The unique identifier of the source collection to be inserted into the destination collection.
Additional Options Various optional parameters to control how the import job behaves:
- Chunk By Page: Whether each page should be treated as a separate chunk (boolean). If true, keep_tables_as_one_chunk is ignored.
- Copy Document: Whether to save a new copy of the document during import (boolean).
- Gen Doc Questions: Whether to auto-generate sample questions for each document using a language model (boolean).
- Gen Doc Summaries: Whether to auto-generate document summaries using a language model (boolean).
- Handwriting Check: Whether to check pages for handwriting and use specialized models if found (boolean).
- Ingest Mode: Mode of ingestion; options are "standard" (files ingested for retrieval-augmented generation) or "agent_only" (bypasses standard ingestion) (string option).
- Keep Tables As One Chunk: Whether tables identified by the parser should be kept in a single chunk (boolean).
- Ocr Model: Which OCR model to use for extracting text from images; default is "auto" (string).
- Tesseract Lang: Language code used when OCR model is set to "tesseract" (string).
- Timeout: Timeout in seconds for the job request (number).

Output

The output JSON contains the full HTTP response from the API call that creates the import job. This typically includes details about the created job such as its unique job ID, status, and any relevant metadata about the import process.

If the node supports binary data output, it would relate to files or documents involved in the import, but this operation primarily deals with JSON responses describing the job.

Dependencies

  • Requires an API key credential for authentication to the external service managing collections and jobs.
  • The base URL for API requests is configured via credentials.
  • The node depends on the external service's API endpoint /collections/{collection_id}/import_collection_job to create the import job.

Troubleshooting

  • Invalid Collection IDs: Ensure both the destination and source collection IDs are valid and accessible by the authenticated user.
  • Timeouts: If the import job times out, increase the timeout parameter or check network connectivity.
  • Permission Errors: Verify that the API key has sufficient permissions to read the source collection and write to the destination collection.
  • Unsupported Option Combinations: For example, setting chunk_by_page to true ignores keep_tables_as_one_chunk; ensure options are set correctly.
  • API Errors: Review error messages returned by the API for hints on misconfiguration or invalid parameters.

Links and References

  • Refer to the external service's API documentation for collections and job management endpoints.
  • Documentation on ingestion modes and OCR models may provide additional context for advanced options.

Discussion