HDW Web Parser (beta) icon

HDW Web Parser (beta)

Parse and crawl websites using Horizon Data Wave API

Overview

This node integrates with the Horizon Data Wave API to parse and crawl websites. Specifically, the Map operation discovers URLs starting from a given URL, optionally filtering them by a search term or limiting results. It can use sitemap.xml files and/or HTML links for discovery, and supports including subdomains in the results.

Common scenarios include:

  • Generating a list of URLs from a website for further processing or scraping.
  • Discovering site structure or content pages automatically.
  • Filtering URLs based on keywords to focus on relevant pages.

For example, you might start from a homepage URL and map all linked pages containing a certain keyword, limiting the output to 1000 URLs.

Properties

Name Meaning
Base URL Custom API base URL; leave empty to use the default Horizon Data Wave API endpoint.
URL Starting URL for URL discovery (required).
Search Term Optional keyword to filter discovered URLs by matching text.
Ignore Sitemap If true, skip sitemap.xml discovery and only use HTML links for URL discovery.
Sitemap Only If true, only use sitemap.xml for discovery, ignoring HTML links.
Include Subdomains If true, include URLs from subdomains in the results.
Limit Maximum number of URLs to return (default 1000).

Output

The node outputs an array of JSON objects, each representing a discovered URL and its associated metadata as returned by the Horizon Data Wave API. The exact fields depend on the API response but typically include URL strings and possibly additional info about each link.

No binary data is output by this operation.

Dependencies

  • Requires an API key credential for the Horizon Data Wave API.
  • The node makes authenticated HTTP POST requests to the API endpoints.
  • Optionally configurable base URL for the API if not using the default.

Troubleshooting

  • Common issues:

    • Invalid or missing API credentials will cause authentication failures.
    • Providing an invalid or unreachable starting URL may result in errors or empty results.
    • Setting conflicting options like both Ignore Sitemap and Sitemap Only may lead to unexpected behavior.
    • Exceeding the limit or timeout constraints could truncate results.
  • Error messages:

    • Errors from the API are propagated; typical messages relate to network issues, invalid parameters, or authentication failures.
    • To resolve, verify API credentials, check URL validity, and adjust parameters accordingly.
    • Use "Continue On Fail" option in n8n to handle errors gracefully if desired.

Links and References

  • Horizon Data Wave API Documentation (generic reference, actual docs should be consulted)
  • n8n HTTP Request Node documentation for understanding request options and authentication setup.

Discussion