h2oGPTe

h2oGPTe is an AI-powered search assistant for your internal teams to answer questions gleaned from large volumes of documents, websites and workplace content.

Join our community

Actions198

Overview

The "Crawls and Ingest a URL Into a Collection" operation of the Document Ingestion resource allows users to crawl a specified web URL and ingest the content into a designated document collection. This process imports the web page or documents linked from the URL into the collection, optionally following links within the same domain to ingest additional pages.

This node is beneficial for scenarios where you want to automatically gather and index web content for internal search, knowledge bases, or AI-powered assistants. For example, a company could use it to ingest their website's documentation pages into a searchable collection or to gather competitor information from public web pages.

Practical examples:

Automatically ingest product manuals from a vendor’s website into your internal knowledge base.
Crawl and import blog posts from a specific URL to analyze trends or generate summaries.
Collect and index research articles linked from a university webpage for academic purposes.

Properties

Name	Meaning
Collection ID	String ID of the collection to add the ingested documents into (required).
URL	The URL string to crawl and ingest content from (required).
Additional Options	A set of optional parameters to customize the ingestion behavior:
- Audio Input Language	Language code for audio files; default is "auto" for automatic detection.
- Chunk By Page	Boolean indicating whether each page should be treated as a separate chunk. If true, `keep_tables_as_one_chunk` is ignored.
- Follow Links	Boolean indicating whether to recursively import all web pages linked from the initial URL. External links are ignored.
- Gen Doc Questions	Boolean to auto-generate sample questions for each document using a large language model (LLM).
- Gen Doc Summaries	Boolean to auto-generate document summaries using an LLM.
- Handwriting Check	Boolean to check pages for handwriting and use specialized models if found.
- Ingest Mode	Mode of ingestion: "standard" (files ingested for retrieval-augmented generation) or "agent_only" (bypasses standard ingestion).
- Keep Tables As One Chunk	Boolean indicating whether tables identified by the parser should be kept as a single chunk.
- Max Depth	Maximum recursion depth when following links (only applies if `follow_links` is true). A value of 0 means no link following; -1 means unlimited.
- Max Documents	Maximum number of documents to ingest when following links (only applies if `follow_links` is true). 0 means system default (automatic).
- Ocr Model	Method to extract text from images using AI-enabled OCR models. Default is "auto".
- Tesseract Lang	Language to use with the Tesseract OCR model.
- Timeout	Timeout in seconds for the ingestion request. Default is 0 (no timeout).

Output

The node outputs JSON data representing the response from the ingestion API endpoint. This typically includes metadata about the ingestion job or the ingested documents, such as IDs, status, and any generated summaries or questions if those options were enabled.

If the ingestion involves binary data (e.g., files), the node handles them accordingly, but this operation primarily deals with JSON responses describing the ingestion result.

Dependencies

Requires an API key credential for authentication with the external service providing the ingestion API.
The node sends HTTP POST requests to the /ingest/website endpoint of the configured API base URL.
Proper network access to the target URLs is necessary for crawling.
Optional dependencies include AI models for generating document summaries and questions, and OCR models for image text extraction.

Troubleshooting

Timeouts: If the ingestion takes too long, consider increasing the Timeout property or checking network connectivity.
Permission errors: Ensure the API key credential has sufficient permissions to perform ingestion operations.
Invalid URL: Verify that the URL provided is accessible and correctly formatted.
Link following limits: When enabling Follow Links, be cautious with Max Depth and Max Documents to avoid excessive crawling.
OCR issues: If text extraction from images fails, try changing the Ocr Model or specifying the correct Tesseract Lang.
Handwriting detection: Enabling Handwriting Check may increase processing time; disable if not needed.

Links and References

This summary is based on static analysis of the node's properties and routing configuration for the "Crawls and Ingest a URL Into a Collection" operation under the Document Ingestion resource.

h2oGPTeInstall