h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Join our community

Actions198

Overview

This node operation creates a job to ingest an uploaded document into a specified collection. It is designed for scenarios where documents have already been uploaded and need to be processed and added to a collection for further use, such as search, analysis, or AI-powered querying. This operation is beneficial when managing large volumes of documents that require ingestion workflows, including text extraction, chunking, metadata assignment, and optional features like handwriting detection or auto-generation of summaries and questions.

Practical examples include:

Automatically ingesting scanned PDFs into a knowledge base collection.
Adding audio transcripts with language detection into a searchable document collection.
Processing uploaded documents with specific OCR models and metadata for enterprise content management.

Properties

Name	Meaning
Upload IDs	ID(s) of the uploaded document(s) to be ingested.
Collection ID	The string ID of the target collection where the ingested documents will be added.
Additional Options	A set of optional parameters to customize the ingestion process:
- Audio Input Language	Language code for audio files; default is "auto" for automatic detection.
- Chunk By Page	Boolean flag indicating whether each page should be treated as a separate chunk. If true, `keep_tables_as_one_chunk` is ignored.
- Gen Doc Questions	Boolean flag to enable auto-generation of sample questions per document using a large language model (LLM).
- Gen Doc Summaries	Boolean flag to enable auto-generation of document summaries using LLM.
- Handwriting Check	Boolean flag to check pages for handwriting and use specialized models if found.
- Ingest Mode	Mode of ingestion: "standard" (default) for regular ingestion suitable for retrieval-augmented generation (RAG), or "agent_only" which bypasses standard ingestion.
- Keep Tables As One Chunk	Boolean flag indicating whether tables identified by the parser should be kept in a single chunk.
- Metadata	JSON object containing metadata to associate with the document during ingestion.
- Ocr Model	Specifies the OCR method to extract text from images; default is "auto".
- Permissions	String listing usernames who will have permissions to access the document.
- Restricted	Boolean flag indicating if the document should be restricted to certain users only.
- Tesseract Lang	Language code used specifically when the OCR model is set to "tesseract".
- Timeout	Number specifying the timeout duration in seconds for the ingestion job.

Output

The output contains the full response from the API call that creates the ingestion job. This typically includes details about the created job such as its unique identifier, status, and any relevant metadata confirming the ingestion request was accepted. The output is structured as JSON data.

If the ingestion involves binary data (e.g., files), this node handles the ingestion job creation but does not directly output binary data itself.

Dependencies

Requires an API key credential for authentication with the external service.
The node sends HTTP POST requests to the endpoint /uploads/{upload_ids}/ingest/job with query parameters and body data based on the input properties.
Proper configuration of the API URL and credentials in n8n is necessary.
Optional dependencies include availability of OCR models and LLM services for features like handwriting detection and auto-generation of summaries/questions.

Troubleshooting

Common Issues:
- Invalid or missing upload IDs can cause the job creation to fail.
- Incorrect collection ID or insufficient permissions may result in authorization errors.
- Timeout settings too low might cause premature termination of the ingestion job.
- Misconfiguration of OCR model or unsupported languages could lead to incomplete text extraction.
Error Messages:
- "Upload ID not found" indicates the provided upload ID does not exist or is inaccessible.
- "Collection not found" means the target collection ID is invalid.
- "Permission denied" suggests the API key or user lacks rights to add documents to the collection.
- "Timeout exceeded" implies the ingestion took longer than allowed; increase the timeout value.
Resolutions:
- Verify all required IDs are correct and accessible.
- Ensure the API key has appropriate permissions.
- Adjust timeout and ingestion options according to document size and complexity.
- Confirm OCR and language settings match the document content.

h2oGPTe

Actions198

Overview

Properties

Output

Dependencies

Troubleshooting

Links and References

Discussion

h2oGPTeInstall

Actions198

Overview

Properties

Output

Dependencies

Troubleshooting

Links and References

Discussion

h2oGPTe