h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Join our community

Actions198

Overview

This node operation creates a job to crawl and ingest content from a specified URL into a document collection. It is designed to automate the process of importing web pages or documents linked from a given URL into a collection for further processing, indexing, or analysis.

Common scenarios where this node is beneficial include:

Automatically gathering and organizing web content related to a specific topic or project.
Building searchable document collections from online resources.
Ingesting website data for AI-powered search assistants or knowledge bases.

For example, you can use this node to create a job that crawls a company’s internal wiki URL and ingests all relevant pages into a centralized collection for team-wide access and querying.

Properties

Name	Meaning
Collection ID	String ID of the collection to add the ingested documents into (required).
URL	The URL string to crawl and ingest content from (required).
Additional Options	A set of optional parameters to customize the ingestion job:
- Audio Input Language	Language code for audio files; default is "auto" for automatic detection.
- Chunk By Page	Boolean indicating whether each page should be treated as a separate chunk. If true, `keep_tables_as_one_chunk` is ignored.
- Follow Links	Boolean indicating whether to recursively import all web pages linked from the initial URL. External links are ignored.
- Gen Doc Questions	Boolean to auto-generate sample questions for each document using a large language model (LLM).
- Gen Doc Summaries	Boolean to auto-generate document summaries using an LLM.
- Handwriting Check	Boolean to check pages for handwriting and use specialized models if found.
- Ingest Mode	Option to select ingest mode: "standard" (default) for regular ingestion or "agent_only" to bypass standard ingestion.
- Keep Tables As One Chunk	Boolean indicating whether tables identified by the parser should be kept in a single chunk.
- Max Depth	Number specifying max recursion depth when following links (only applies if `follow_links` is true). -1 means unlimited, 0 means no recursion.
- Max Documents	Maximum number of documents to ingest when following links (only applies if `follow_links` is true). 0 means system default limit.
- Ocr Model	Specifies which OCR model to use for extracting text from images. Default is "auto".
- Tesseract Lang	Language code used when OCR model is set to "tesseract".
- Timeout	Timeout in seconds for the ingestion job request. Default is 0 (no timeout).

Output

The node outputs the response from the API call that creates the ingestion job. This typically includes details about the created job such as its unique identifier, status, and any metadata returned by the service.

The output JSON structure will contain fields representing the job information, enabling tracking or further actions on the job.

No binary data output is expected from this operation.

Dependencies

Requires an API key credential configured in n8n to authenticate with the external service.
The node sends HTTP POST requests to the /ingest/website/job endpoint of the configured API base URL.
Proper network access to the target URL and the API service is necessary.
Optional dependencies include availability of OCR models and LLM services if related options are enabled.

Troubleshooting

Invalid Collection ID or URL: Ensure the collection ID exists and the URL is valid and accessible.
Timeouts: If the ingestion job times out, consider increasing the timeout property or checking network connectivity.
Permission Errors: Verify that the API key has sufficient permissions to create ingestion jobs and access the specified collection.
Follow Links Limitations: When enabling follow_links, be mindful of max_depth and max_documents to avoid excessive crawling.
OCR or LLM Failures: If OCR or document question/summary generation fails, verify that the respective models are available and properly configured in the backend.

Links and References

API Documentation for Document Ingestion
Optical Character Recognition Models
Large Language Model Integration
n8n HTTP Request Node Documentation (for understanding how API calls are made)

This summary is based on static analysis of the provided source code and property definitions for the "Document Ingestion" resource and the "Creates a Job to Crawl and Ingest a URL Into a Collection" operation.

h2oGPTeInstall