HDW Web Parser (beta) icon

HDW Web Parser (beta)

Parse and crawl websites using Horizon Data Wave API

Overview

This node integrates with the Horizon Data Wave API to parse and crawl websites. It supports three main operations:

  • Scrape: Extract content from a single webpage in various formats such as Markdown, HTML, raw HTML, screenshots, and links.
  • Map: Discover URLs starting from a given URL by analyzing sitemap.xml and/or HTML links on the page, optionally filtering results by search terms or subdomains.
  • Crawl: Perform a multi-page crawl starting from a URL, gathering data across multiple linked pages.

The Map operation is useful for scenarios where you want to gather a list of URLs related to a website, for example to build a sitemap, analyze site structure, or prepare a batch scraping job. You can filter discovered URLs by keywords, limit the number of results, and control whether to include subdomains or rely solely on sitemap.xml.

Practical Examples

  • Discover all product pages on an e-commerce site by starting from the homepage URL and filtering URLs containing "product".
  • Generate a list of blog post URLs by crawling only the sitemap.xml of a news website.
  • Collect URLs from a corporate website including its subdomains for comprehensive site analysis.

Properties

Name Meaning
Base URL Custom API base URL to override the default Horizon Data Wave API endpoint.
URL Starting URL for URL discovery (required).
Search Term Optional keyword to filter discovered URLs; only URLs containing this term will be returned.
Ignore Sitemap If true, skip sitemap.xml discovery and use only HTML links found on pages.
Sitemap Only If true, use only sitemap.xml for discovery and ignore HTML links.
Include Subdomains If true, include URLs from subdomains of the starting URL in the results.
Limit Maximum number of URLs to return (minimum 1).

Output

The output is an array of JSON objects, each representing a discovered URL or related metadata returned by the API. The exact structure depends on the API response but generally includes URL strings and possibly additional information about each link.

No binary data output is produced by the Map operation.

Example output snippet:

[
  {
    "url": "https://example.com/page1",
    "lastModified": "2024-01-01T12:00:00Z"
  },
  {
    "url": "https://blog.example.com/post1",
    "lastModified": "2024-01-02T08:30:00Z"
  }
]

Dependencies

  • Requires an API key credential for the Horizon Data Wave API.
  • The node makes authenticated HTTP POST requests to the API endpoints /map (or custom base URL + /map).
  • No other external dependencies are needed.
  • Ensure the API key credential is configured properly in n8n.

Troubleshooting

  • Common issues:

    • Invalid or missing API key credential will cause authentication errors.
    • Providing an invalid or unreachable starting URL may result in HTTP errors or empty results.
    • Setting conflicting options like both Ignore Sitemap and Sitemap Only might lead to unexpected behavior.
    • Exceeding the Limit parameter or setting it too high could cause performance delays or API rate limiting.
  • Error messages:

    • Errors from the API are captured and returned with details such as HTTP status, API error message, request ID, execution time, and token usage.
    • Typical errors include authentication failures, invalid parameters, or server-side issues.
    • To resolve, verify credentials, check URL validity, and adjust parameters accordingly.
    • Use the "Continue On Fail" option in n8n to handle errors gracefully without stopping workflow execution.

Links and References

Discussion