h2oGPTe icon

h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Actions198

Overview

The "Crawls and Ingest a URL Into a Collection" operation of the Document Ingestion resource allows users to crawl a specified web URL and ingest the content into a designated document collection. This process imports the web page or documents linked from the URL into the collection, optionally following links within the same domain to ingest additional pages.

This node is beneficial for scenarios where you want to automatically gather and index web content for internal search, knowledge bases, or AI-powered assistants. For example, a company could use it to ingest their website's documentation pages into a searchable collection or to gather competitor information from public web pages.

Practical examples:

  • Automatically ingest product manuals from a vendor’s website into your internal knowledge base.
  • Crawl and import blog posts from a specific URL to analyze trends or generate summaries.
  • Collect and index research articles linked from a university webpage for academic purposes.

Properties

Name Meaning
Collection ID String ID of the collection to add the ingested documents into (required).
URL The URL string to crawl and ingest content from (required).
Additional Options A set of optional parameters to customize the ingestion behavior:
- Audio Input Language Language code for audio files; default is "auto" for automatic detection.
- Chunk By Page Boolean indicating whether each page should be treated as a separate chunk. If true, keep_tables_as_one_chunk is ignored.
- Follow Links Boolean indicating whether to recursively import all web pages linked from the initial URL. External links are ignored.
- Gen Doc Questions Boolean to auto-generate sample questions for each document using a large language model (LLM).
- Gen Doc Summaries Boolean to auto-generate document summaries using an LLM.
- Handwriting Check Boolean to check pages for handwriting and use specialized models if found.
- Ingest Mode Mode of ingestion: "standard" (files ingested for retrieval-augmented generation) or "agent_only" (bypasses standard ingestion).
- Keep Tables As One Chunk Boolean indicating whether tables identified by the parser should be kept as a single chunk.
- Max Depth Maximum recursion depth when following links (only applies if follow_links is true). A value of 0 means no link following; -1 means unlimited.
- Max Documents Maximum number of documents to ingest when following links (only applies if follow_links is true). 0 means system default (automatic).
- Ocr Model Method to extract text from images using AI-enabled OCR models. Default is "auto".
- Tesseract Lang Language to use with the Tesseract OCR model.
- Timeout Timeout in seconds for the ingestion request. Default is 0 (no timeout).

Output

The node outputs JSON data representing the response from the ingestion API endpoint. This typically includes metadata about the ingestion job or the ingested documents, such as IDs, status, and any generated summaries or questions if those options were enabled.

If the ingestion involves binary data (e.g., files), the node handles them accordingly, but this operation primarily deals with JSON responses describing the ingestion result.

Dependencies

  • Requires an API key credential for authentication with the external service providing the ingestion API.
  • The node sends HTTP POST requests to the /ingest/website endpoint of the configured API base URL.
  • Proper network access to the target URLs is necessary for crawling.
  • Optional dependencies include AI models for generating document summaries and questions, and OCR models for image text extraction.

Troubleshooting

  • Timeouts: If the ingestion takes too long, consider increasing the Timeout property or checking network connectivity.
  • Permission errors: Ensure the API key credential has sufficient permissions to perform ingestion operations.
  • Invalid URL: Verify that the URL provided is accessible and correctly formatted.
  • Link following limits: When enabling Follow Links, be cautious with Max Depth and Max Documents to avoid excessive crawling.
  • OCR issues: If text extraction from images fails, try changing the Ocr Model or specifying the correct Tesseract Lang.
  • Handwriting detection: Enabling Handwriting Check may increase processing time; disable if not needed.

Links and References


This summary is based on static analysis of the node's properties and routing configuration for the "Crawls and Ingest a URL Into a Collection" operation under the Document Ingestion resource.

Discussion