HDW Web Parser (beta)

Parse and crawl websites using Horizon Data Wave API

Actions3

Overview

This node integrates with the Horizon Data Wave API to parse and crawl websites. It supports three main operations:

Scrape: Extract content from a single webpage in various formats such as Markdown, HTML, raw HTML, screenshots, and links.
Map: Discover URLs starting from a given URL by analyzing sitemap.xml and/or HTML links on the page, optionally filtering results by search terms or subdomains.
Crawl: Perform a multi-page crawl starting from a URL, gathering data across multiple linked pages.

The Map operation is useful for scenarios where you want to gather a list of URLs related to a website, for example to build a sitemap, analyze site structure, or prepare a batch scraping job. You can filter discovered URLs by keywords, limit the number of results, and control whether to include subdomains or rely solely on sitemap.xml.

Practical Examples

Discover all product pages on an e-commerce site by starting from the homepage URL and filtering URLs containing "product".
Generate a list of blog post URLs by crawling only the sitemap.xml of a news website.
Collect URLs from a corporate website including its subdomains for comprehensive site analysis.

Properties

Name	Meaning
Base URL	Custom API base URL to override the default Horizon Data Wave API endpoint.
URL	Starting URL for URL discovery (required).
Search Term	Optional keyword to filter discovered URLs; only URLs containing this term will be returned.
Ignore Sitemap	If true, skip sitemap.xml discovery and use only HTML links found on pages.
Sitemap Only	If true, use only sitemap.xml for discovery and ignore HTML links.
Include Subdomains	If true, include URLs from subdomains of the starting URL in the results.
Limit	Maximum number of URLs to return (minimum 1).

Output

The output is an array of JSON objects, each representing a discovered URL or related metadata returned by the API. The exact structure depends on the API response but generally includes URL strings and possibly additional information about each link.

No binary data output is produced by the Map operation.

Example output snippet:

[
  {
    "url": "https://example.com/page1",
    "lastModified": "2024-01-01T12:00:00Z"
  },
  {
    "url": "https://blog.example.com/post1",
    "lastModified": "2024-01-02T08:30:00Z"
  }
]

Dependencies

Requires an API key credential for the Horizon Data Wave API.
The node makes authenticated HTTP POST requests to the API endpoints /map (or custom base URL + /map).
No other external dependencies are needed.
Ensure the API key credential is configured properly in n8n.

Troubleshooting

Common issues:
- Invalid or missing API key credential will cause authentication errors.
- Providing an invalid or unreachable starting URL may result in HTTP errors or empty results.
- Setting conflicting options like both Ignore Sitemap and Sitemap Only might lead to unexpected behavior.
- Exceeding the Limit parameter or setting it too high could cause performance delays or API rate limiting.
Error messages:
- Errors from the API are captured and returned with details such as HTTP status, API error message, request ID, execution time, and token usage.
- Typical errors include authentication failures, invalid parameters, or server-side issues.
- To resolve, verify credentials, check URL validity, and adjust parameters accordingly.
- Use the "Continue On Fail" option in n8n to handle errors gracefully without stopping workflow execution.

Links and References

Horizon Data Wave API Documentation (hypothetical link)
n8n documentation on HTTP Request Node
General web crawling and scraping best practices articles and tutorials

HDW Web Parser (beta)Install