h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Join our community

Actions198

Overview

This node operation allows you to add files stored in AWS S3 storage into a specified document collection. It is designed for ingesting documents from S3 buckets, enabling further processing such as search, summarization, or question answering within the collection. This is particularly useful when you have large volumes of documents stored in AWS S3 and want to integrate them into your document management or AI-powered knowledge systems.

Typical use cases include:

Automatically importing new documents uploaded to an S3 bucket into a searchable collection.
Enriching collections with audio, image, or text files stored remotely on AWS.
Leveraging additional options like OCR, handwriting detection, or generating summaries/questions using language models during ingestion.

Properties

Name	Meaning
Collection ID	String ID of the collection to add the ingested documents into.
URLs	The path or list of paths of S3 files or directories to ingest.
Audio Input Language	Language of audio files; default is "auto" for automatic detection.
Chunk By Page	Boolean flag indicating whether each page should be treated as a separate chunk. If true, `keep_tables_as_one_chunk` is ignored.
Credentials	JSON object containing S3 credentials. If omitted, only public buckets are accessible.
Gen Doc Questions	Boolean flag to auto-generate sample questions for each document using a large language model (LLM).
Gen Doc Summaries	Boolean flag to auto-generate document summaries using an LLM.
Handwriting Check	Boolean flag to check pages for handwriting and use specialized models if handwriting is detected.
Ingest Mode	Mode of ingestion: "standard" (files ingested for retrieval-augmented generation) or "agent_only" (bypasses standard ingestion).
Keep Tables As One Chunk	Boolean flag indicating whether tables identified by the parser should be kept as a single chunk.
Metadata	JSON object containing metadata to associate with the ingested documents.
Ocr Model	Method to extract text from images using AI-enabled optical character recognition (OCR) models. Default is "auto".
Region	AWS region name used for interaction with AWS services.
Tesseract Lang	Language code to use when OCR model is set to "tesseract".
Timeout	Timeout in seconds for the ingestion request. Zero means no timeout.

Output

The output contains the JSON response from the ingestion API call, which typically includes details about the ingestion job or confirmation of successful ingestion. The exact structure depends on the backend service but generally includes status information and identifiers related to the ingested documents or job.

If binary data were involved (e.g., file uploads), it would be handled accordingly, but this operation primarily deals with JSON responses confirming ingestion.

Dependencies

Requires access to AWS S3 storage, including appropriate permissions to read files from the specified buckets.
If private buckets are used, valid AWS credentials must be provided via the "Credentials" property.
The node communicates with an external API endpoint (configured via credentials) that handles the ingestion process.
Optional dependencies include AI models for OCR, handwriting detection, and language models for generating summaries or questions, which are managed by the backend service.

Troubleshooting

Access Denied Errors: Ensure that the provided AWS credentials have sufficient permissions to access the specified S3 buckets and objects.
Timeouts: Large files or slow network connections may cause timeouts. Increase the "Timeout" property as needed.
Invalid URLs: Verify that the "URLs" property correctly specifies existing S3 paths or directories.
Incorrect Region: Make sure the "Region" matches the AWS region where the S3 buckets reside.
OCR or Handwriting Detection Issues: If these features are enabled but not working as expected, verify that the backend supports the selected OCR model and that the input files are compatible.
Empty or Missing Metadata: If metadata is required for downstream processes, ensure it is correctly formatted as JSON.

h2oGPTe

Actions198

Overview

Properties

Output

Dependencies

Troubleshooting

Links and References

Discussion

h2oGPTeInstall

Actions198

Overview

Properties

Output

Dependencies

Troubleshooting

Links and References

Discussion

h2oGPTe