HDW Web Parser (beta) icon

HDW Web Parser (beta)

Parse and crawl websites using Horizon Data Wave API

Overview

This node, named "HDW Web Parser (beta)", enables users to parse and crawl websites using the Horizon Data Wave API. It supports three main operations:

  • Scrape: Extract content from a single webpage in various formats such as Markdown, HTML, raw HTML, screenshots, and links.
  • Map: Discover URLs starting from a given URL, optionally filtering by search terms or sitemap usage.
  • Crawl: Perform a multi-page crawl starting from a specified URL, useful for gathering data across many pages within a site.

The Crawl operation is designed to start crawling multiple pages from a given URL with a configurable timeout. This is beneficial for scenarios like comprehensive website data extraction, SEO analysis, or content auditing where automated traversal of many pages is required.

Practical Example

  • Starting from a homepage URL, the node can crawl through all linked pages within a domain to collect structured data or metadata.
  • Use it to gather product information across an e-commerce site by crawling category and product pages automatically.
  • Crawl news websites to aggregate articles published over time.

Properties

Name Meaning
Base URL Custom API base URL to override the default Horizon Data Wave API endpoint.
URL Starting URL for the crawl operation; the initial page from which crawling begins.
Timeout (seconds) Maximum duration allowed for the crawl operation before it times out (default 300 sec).

Output

  • The output JSON contains the results returned by the Horizon Data Wave API crawl endpoint.
  • If the API returns an array, each element is output as a separate item.
  • The structure of each JSON object corresponds to the crawled data for individual pages or resources discovered during the crawl.
  • No binary data output is indicated for this operation.

Dependencies

  • Requires an API key credential for authentication with the Horizon Data Wave API service.
  • The node uses the Horizon Data Wave API base URL https://api.horizondatawave.ai/api/website by default but allows overriding via the "Base URL" property.
  • Network access to the API endpoint is necessary.
  • The node relies on n8n's HTTP request helper with authentication support.

Troubleshooting

  • Timeouts: If the crawl takes longer than the specified timeout, the operation may fail or return partial results. Increase the "Timeout (seconds)" value if needed.
  • Authentication errors: Ensure that the API key credential is correctly configured and valid.
  • Invalid URL: Providing an invalid or unreachable starting URL will cause errors. Verify the URL format and accessibility.
  • API errors: Any error messages returned by the Horizon Data Wave API will be surfaced in the node output if "Continue On Fail" is enabled; otherwise, they will stop execution.
  • Empty results: If no pages are crawled, check if the starting URL is correct and accessible, and verify network connectivity.

Links and References

Discussion