HDW Web Parser (beta)

Parse and crawl websites using Horizon Data Wave API

Actions3

Overview

This node integrates with the Horizon Data Wave (HDW) Web Parser API to perform web data extraction and crawling tasks. It supports three main operations:

Scrape: Extract content from a single webpage in various formats such as Markdown, HTML, raw HTML, screenshots, and links. It can focus on the main content area and supports mobile viewport emulation.
Map: Discover URLs starting from a given URL, optionally filtering by search terms, including or excluding subdomains, and controlling sitemap usage.
Crawl: Perform a crawl starting from a URL to traverse multiple pages within a time limit.

This node is useful for scenarios like gathering structured content from websites, generating site maps, or performing automated website analysis and data collection.

Practical Examples

Scraping product descriptions and images from an e-commerce page.
Mapping all accessible URLs on a company website for SEO auditing.
Crawling news articles starting from a homepage to collect recent posts.

Properties

Name	Meaning
Base URL	Custom API base URL for the HDW Web Parser API; leave empty to use the default endpoint.

Note: The node also supports additional properties depending on the selected operation:

For "Scrape" Operation:

Name	Meaning
URL	The webpage URL to scrape.
Formats	Content formats to extract: Markdown, HTML, Raw HTML, Screenshot, Links, Full Page Screenshot.
Only Main Content	Whether to extract only the main content, filtering out navigation, footers, etc.
Mobile	Use a mobile viewport for scraping.
Skip TLS Verification	Skip verification of TLS certificates (useful for self-signed certs).
Timeout (ms)	Maximum wait time in milliseconds for the page to load.
Remove Base64 Images	Remove base64 encoded images from the output.

For "Map" Operation:

Name	Meaning
URL	Starting URL for URL discovery.
Search Term	Optional term to filter discovered URLs.
Ignore Sitemap	Skip sitemap.xml discovery; use only HTML links.
Sitemap Only	Use only sitemap.xml for discovery; ignore HTML links.
Include Subdomains	Include URLs from subdomains in results.
Limit	Maximum number of URLs to return.

For "Crawl" Operation:

Name	Meaning
URL	Starting URL for the crawl.
Timeout (seconds)	Maximum duration in seconds for the crawl operation.

Output

The node outputs JSON data representing the response from the HDW Web Parser API:

For Scrape, the output includes extracted content in the requested formats (e.g., Markdown text, HTML snippets, screenshots as URLs or binary references, lists of links).
For Map, the output is an array of discovered URLs matching the criteria.
For Crawl, the output contains crawl results such as visited URLs and possibly extracted data per page.

If the API returns an array, each element is output as a separate item. Otherwise, the entire response is output as a single JSON object.

Binary data such as screenshots may be included as URLs or references but are not directly embedded in the output.

Dependencies

Requires an API authentication token credential for the Horizon Data Wave API.
Uses the HDW Web Parser API endpoint, defaulting to https://api.horizondatawave.ai/api/website unless overridden by the "Base URL" property.
Network access to the API endpoint must be available.
No other external dependencies.

Troubleshooting

Common issues:
- Invalid or missing API credentials will cause authentication failures.
- Incorrect URLs or unreachable targets may result in HTTP errors or timeouts.
- Setting very low timeout values might cause premature termination of requests.
- Skipping TLS verification can expose security risks; use only if necessary.
Error messages:
- Errors returned from the API are passed through in the output JSON under an error field if "Continue On Fail" is enabled.
- Network or HTTP errors will throw exceptions unless handled by the node's error handling settings.
Resolutions:
- Verify API credentials and permissions.
- Ensure target URLs are correct and accessible.
- Adjust timeout settings according to network conditions.
- Enable "Continue On Fail" to handle partial failures gracefully.

Links and References

Horizon Data Wave API Documentation (Assumed based on domain, verify actual docs)
n8n Documentation on HTTP Request Node for understanding HTTP request options.
General web scraping best practices and legal considerations.