HDW Web Parser (beta) icon

HDW Web Parser (beta)

Parse and crawl websites using Horizon Data Wave API

Overview

This node integrates with the Horizon Data Wave API to parse and crawl websites. Specifically, the Crawl operation starts a crawl from a given URL and explores multiple pages within the site. This is useful for gathering extensive data across many linked pages automatically.

Common scenarios include:

  • Collecting content or metadata from an entire website starting from a homepage or specific entry point.
  • Automating large-scale web data extraction where manual scraping of individual pages would be inefficient.
  • Monitoring changes or updates across multiple pages on a domain.

For example, you might start a crawl at https://example.com to gather all accessible pages' content or links, enabling comprehensive analysis or archiving.

Properties

Name Meaning
Base URL Custom API base URL to override the default Horizon Data Wave API endpoint.
URL Starting URL for the crawl operation; the initial webpage from which crawling begins.
Timeout (Seconds) Maximum time allowed for the crawl operation to run before timing out (in seconds).

Output

The node outputs JSON data representing the results of the crawl operation. The structure depends on the API response but generally includes details about the crawled pages such as URLs, page content summaries, metadata, or discovered links.

If the API returns multiple items (e.g., multiple pages), each item is output as a separate JSON object in the array of results.

No binary data output is indicated for this operation.

Dependencies

  • Requires an API key credential for authenticating with the Horizon Data Wave API.
  • The node uses the Horizon Data Wave API endpoint https://api.horizondatawave.ai/api/website by default, but this can be overridden via the "Base URL" property.
  • Network access to the API endpoint must be available.
  • Proper configuration of the API authentication credential in n8n is necessary.

Troubleshooting

  • Timeouts: If the crawl takes longer than the specified timeout, the operation may fail or return partial results. Increase the "Timeout (Seconds)" value if needed.
  • Authentication errors: Ensure the API key credential is valid and has sufficient permissions.
  • API errors: The node captures HTTP status codes and error messages returned by the API. Common issues include rate limiting, invalid URLs, or malformed requests.
  • Empty or incomplete results: Verify that the starting URL is correct and accessible. Also, check if the API service is operational.
  • Network issues: Confirm that your environment allows outbound HTTPS requests to the API endpoint.

Error responses include detailed information such as HTTP status, API error messages, request IDs, execution times, and token usage points to aid debugging.

Links and References

Discussion