h2oGPTe icon

h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Actions198

Overview

This node operation creates a job to crawl and ingest content from a specified URL into a document collection. It is designed to automate the process of importing web pages or documents linked from a given URL into a collection for further processing, indexing, or analysis.

Common scenarios where this node is beneficial include:

  • Automatically gathering and organizing web content related to a specific topic or project.
  • Building searchable document collections from online resources.
  • Ingesting website data for AI-powered search assistants or knowledge bases.

For example, you can use this node to create a job that crawls a company’s internal wiki URL and ingests all relevant pages into a centralized collection for team-wide access and querying.

Properties

Name Meaning
Collection ID String ID of the collection to add the ingested documents into (required).
URL The URL string to crawl and ingest content from (required).
Additional Options A set of optional parameters to customize the ingestion job:
- Audio Input Language Language code for audio files; default is "auto" for automatic detection.
- Chunk By Page Boolean indicating whether each page should be treated as a separate chunk. If true, keep_tables_as_one_chunk is ignored.
- Follow Links Boolean indicating whether to recursively import all web pages linked from the initial URL. External links are ignored.
- Gen Doc Questions Boolean to auto-generate sample questions for each document using a large language model (LLM).
- Gen Doc Summaries Boolean to auto-generate document summaries using an LLM.
- Handwriting Check Boolean to check pages for handwriting and use specialized models if found.
- Ingest Mode Option to select ingest mode: "standard" (default) for regular ingestion or "agent_only" to bypass standard ingestion.
- Keep Tables As One Chunk Boolean indicating whether tables identified by the parser should be kept in a single chunk.
- Max Depth Number specifying max recursion depth when following links (only applies if follow_links is true). -1 means unlimited, 0 means no recursion.
- Max Documents Maximum number of documents to ingest when following links (only applies if follow_links is true). 0 means system default limit.
- Ocr Model Specifies which OCR model to use for extracting text from images. Default is "auto".
- Tesseract Lang Language code used when OCR model is set to "tesseract".
- Timeout Timeout in seconds for the ingestion job request. Default is 0 (no timeout).

Output

The node outputs the response from the API call that creates the ingestion job. This typically includes details about the created job such as its unique identifier, status, and any metadata returned by the service.

The output JSON structure will contain fields representing the job information, enabling tracking or further actions on the job.

No binary data output is expected from this operation.

Dependencies

  • Requires an API key credential configured in n8n to authenticate with the external service.
  • The node sends HTTP POST requests to the /ingest/website/job endpoint of the configured API base URL.
  • Proper network access to the target URL and the API service is necessary.
  • Optional dependencies include availability of OCR models and LLM services if related options are enabled.

Troubleshooting

  • Invalid Collection ID or URL: Ensure the collection ID exists and the URL is valid and accessible.
  • Timeouts: If the ingestion job times out, consider increasing the timeout property or checking network connectivity.
  • Permission Errors: Verify that the API key has sufficient permissions to create ingestion jobs and access the specified collection.
  • Follow Links Limitations: When enabling follow_links, be mindful of max_depth and max_documents to avoid excessive crawling.
  • OCR or LLM Failures: If OCR or document question/summary generation fails, verify that the respective models are available and properly configured in the backend.

Links and References


This summary is based on static analysis of the provided source code and property definitions for the "Document Ingestion" resource and the "Creates a Job to Crawl and Ingest a URL Into a Collection" operation.

Discussion