h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Join our community

Actions198

Overview

This node operation allows users to add files from their local file system into a specified document collection within the system. It is designed for ingesting documents stored locally by specifying a root directory and a glob pattern to match files. The node supports various options to customize the ingestion process, such as language settings for audio files, chunking behavior, OCR model selection, handwriting detection, and timeout settings.

This functionality is beneficial in scenarios where organizations want to bulk import local documents into a centralized collection for further processing, searching, or analysis. For example, a user might have a folder of PDFs, images, or audio files on their computer that they want to ingest into a knowledge base or document management system.

Practical examples:

Importing scanned contracts or reports stored locally into a searchable document collection.
Adding audio recordings from a local directory with automatic language detection for transcription.
Ingesting a set of research papers matched by a glob pattern for semantic search and question answering.

Properties

Name	Meaning
Collection ID	String ID of the collection to add the ingested documents into. This identifies the target collection where files will be added.
Root Dir	String path of the root directory on the local file system where the node will look for files to ingest.
Glob	String glob pattern used to match files within the root directory. Only files matching this pattern will be ingested.
Additional Options	A collection of optional parameters to customize ingestion:
- Audio Input Language	Language code for audio files; default is "auto" for automatic detection.
- Chunk By Page	Boolean indicating whether each page should be treated as a separate chunk. If true, `keep_tables_as_one_chunk` is ignored.
- Gen Doc Questions	Boolean to enable auto-generation of sample questions for each document using a large language model (LLM).
- Gen Doc Summaries	Boolean to enable auto-generation of document summaries using an LLM.
- Handwriting Check	Boolean to enable checking pages for handwriting, which triggers specialized models if handwriting is detected.
- Ingest Mode	Option to select the ingest mode: "standard" (default) for regular ingestion suitable for retrieval-augmented generation (RAG), or "agent_only" to bypass standard ingestion.
- Keep Tables As One Chunk	Boolean indicating whether tables identified by the table parser should be kept as a single chunk.
- Ocr Model	String specifying the OCR method to extract text from images. Default is "auto". Supported methods include docTR, tesseract, etc.
- Tesseract Lang	Language code to use when OCR model is set to "tesseract".
- Timeout	Number specifying the timeout in seconds for the ingestion request. Default is 0 (no timeout).

Output

The node outputs JSON data representing the response from the ingestion API endpoint. This typically includes metadata about the ingestion job or confirmation of successful ingestion. The exact structure depends on the backend API but generally contains status information and identifiers related to the ingested documents or job.

The node does not output binary data.

Dependencies

Requires access to the backend API service that manages document collections and ingestion.
Requires an API authentication token or API key credential configured in n8n to authorize requests.
The local file system must be accessible to the environment running the node to read files from the specified root directory.
No additional external services are required unless specific ingestion options (like OCR) depend on them via the backend.

Troubleshooting

Common Issues:
- Incorrect Collection ID: Ensure the collection ID exists and is accessible with the provided credentials.
- Invalid Root Dir or Glob pattern: Verify the path exists and the glob pattern correctly matches intended files.
- Timeout errors: Increase the timeout value if ingestion takes longer than expected.
- Permission errors: Confirm the API key has sufficient permissions to ingest documents into the collection.
- Unsupported OCR model or language: Use supported values for OCR model and language options.
Error Messages:
- "Collection not found": The specified collection ID does not exist or is inaccessible.
- "No files matched the glob pattern": The glob pattern did not match any files in the root directory.
- "Timeout exceeded": The ingestion process took longer than the allowed timeout.
- "Unauthorized" or "Forbidden": Authentication failed or insufficient permissions.

Resolving these usually involves verifying input parameters, checking API credentials, and adjusting timeout or option settings.

Links and References

Glob Pattern Syntax
Optical Character Recognition (OCR)
Retrieval-Augmented Generation (RAG)
Documentation for the backend API managing document ingestion (refer to your platform's API docs).

h2oGPTeInstall