h2oGPTe icon

h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Actions198

Overview

This node operation allows you to add files stored in AWS S3 storage into a specified document collection. It is designed for ingesting documents from S3 buckets, enabling further processing such as search, summarization, or question answering within the collection. This is particularly useful when you have large volumes of documents stored in AWS S3 and want to integrate them into your document management or AI-powered knowledge systems.

Typical use cases include:

  • Automatically importing new documents uploaded to an S3 bucket into a searchable collection.
  • Enriching collections with audio, image, or text files stored remotely on AWS.
  • Leveraging additional options like OCR, handwriting detection, or generating summaries/questions using language models during ingestion.

Properties

Name Meaning
Collection ID String ID of the collection to add the ingested documents into.
URLs The path or list of paths of S3 files or directories to ingest.
Audio Input Language Language of audio files; default is "auto" for automatic detection.
Chunk By Page Boolean flag indicating whether each page should be treated as a separate chunk. If true, keep_tables_as_one_chunk is ignored.
Credentials JSON object containing S3 credentials. If omitted, only public buckets are accessible.
Gen Doc Questions Boolean flag to auto-generate sample questions for each document using a large language model (LLM).
Gen Doc Summaries Boolean flag to auto-generate document summaries using an LLM.
Handwriting Check Boolean flag to check pages for handwriting and use specialized models if handwriting is detected.
Ingest Mode Mode of ingestion: "standard" (files ingested for retrieval-augmented generation) or "agent_only" (bypasses standard ingestion).
Keep Tables As One Chunk Boolean flag indicating whether tables identified by the parser should be kept as a single chunk.
Metadata JSON object containing metadata to associate with the ingested documents.
Ocr Model Method to extract text from images using AI-enabled optical character recognition (OCR) models. Default is "auto".
Region AWS region name used for interaction with AWS services.
Tesseract Lang Language code to use when OCR model is set to "tesseract".
Timeout Timeout in seconds for the ingestion request. Zero means no timeout.

Output

The output contains the JSON response from the ingestion API call, which typically includes details about the ingestion job or confirmation of successful ingestion. The exact structure depends on the backend service but generally includes status information and identifiers related to the ingested documents or job.

If binary data were involved (e.g., file uploads), it would be handled accordingly, but this operation primarily deals with JSON responses confirming ingestion.

Dependencies

  • Requires access to AWS S3 storage, including appropriate permissions to read files from the specified buckets.
  • If private buckets are used, valid AWS credentials must be provided via the "Credentials" property.
  • The node communicates with an external API endpoint (configured via credentials) that handles the ingestion process.
  • Optional dependencies include AI models for OCR, handwriting detection, and language models for generating summaries or questions, which are managed by the backend service.

Troubleshooting

  • Access Denied Errors: Ensure that the provided AWS credentials have sufficient permissions to access the specified S3 buckets and objects.
  • Timeouts: Large files or slow network connections may cause timeouts. Increase the "Timeout" property as needed.
  • Invalid URLs: Verify that the "URLs" property correctly specifies existing S3 paths or directories.
  • Incorrect Region: Make sure the "Region" matches the AWS region where the S3 buckets reside.
  • OCR or Handwriting Detection Issues: If these features are enabled but not working as expected, verify that the backend supports the selected OCR model and that the input files are compatible.
  • Empty or Missing Metadata: If metadata is required for downstream processes, ensure it is correctly formatted as JSON.

Links and References

Discussion