HDW Web Parser (beta)

Parse and crawl websites using Horizon Data Wave API

Actions3

Overview

This node, named "HDW Web Parser (beta)", enables users to parse and crawl websites using the Horizon Data Wave API. It supports three main operations:

Scrape: Extract content from a single webpage in various formats such as Markdown, HTML, raw HTML, screenshots, and links.
Map: Discover URLs starting from a given URL, optionally filtering by search terms or sitemap usage.
Crawl: Perform a multi-page crawl starting from a specified URL, useful for gathering data across many pages within a site.

The Crawl operation is designed to start crawling multiple pages from a given URL with a configurable timeout. This is beneficial for scenarios like comprehensive website data extraction, SEO analysis, or content auditing where automated traversal of many pages is required.

Practical Example

Starting from a homepage URL, the node can crawl through all linked pages within a domain to collect structured data or metadata.
Use it to gather product information across an e-commerce site by crawling category and product pages automatically.
Crawl news websites to aggregate articles published over time.

Properties

Name	Meaning
Base URL	Custom API base URL to override the default Horizon Data Wave API endpoint.
URL	Starting URL for the crawl operation; the initial page from which crawling begins.
Timeout (seconds)	Maximum duration allowed for the crawl operation before it times out (default 300 sec).

Output

The output JSON contains the results returned by the Horizon Data Wave API crawl endpoint.
If the API returns an array, each element is output as a separate item.
The structure of each JSON object corresponds to the crawled data for individual pages or resources discovered during the crawl.
No binary data output is indicated for this operation.

Dependencies

Requires an API key credential for authentication with the Horizon Data Wave API service.
The node uses the Horizon Data Wave API base URL https://api.horizondatawave.ai/api/website by default but allows overriding via the "Base URL" property.
Network access to the API endpoint is necessary.
The node relies on n8n's HTTP request helper with authentication support.

Troubleshooting

Timeouts: If the crawl takes longer than the specified timeout, the operation may fail or return partial results. Increase the "Timeout (seconds)" value if needed.
Authentication errors: Ensure that the API key credential is correctly configured and valid.
Invalid URL: Providing an invalid or unreachable starting URL will cause errors. Verify the URL format and accessibility.
API errors: Any error messages returned by the Horizon Data Wave API will be surfaced in the node output if "Continue On Fail" is enabled; otherwise, they will stop execution.
Empty results: If no pages are crawled, check if the starting URL is correct and accessible, and verify network connectivity.

Links and References

Horizon Data Wave API Documentation (general reference, actual docs URL may vary)
n8n HTTP Request Node documentation: https://docs.n8n.io/nodes/n8n-nodes-base.httpRequest/