Overview
This node, named "HDW Web Parser (beta)", enables users to parse and crawl websites using the Horizon Data Wave API. It supports three main operations:
- Scrape: Extract content from a single webpage in various formats such as Markdown, HTML, raw HTML, screenshots, and links.
- Map: Discover URLs starting from a given URL, optionally filtering by search terms or sitemap usage.
- Crawl: Perform a multi-page crawl starting from a specified URL, useful for gathering data across many pages within a site.
The Crawl operation is designed to start crawling multiple pages from a given URL with a configurable timeout. This is beneficial for scenarios like comprehensive website data extraction, SEO analysis, or content auditing where automated traversal of many pages is required.
Practical Example
- Starting from a homepage URL, the node can crawl through all linked pages within a domain to collect structured data or metadata.
- Use it to gather product information across an e-commerce site by crawling category and product pages automatically.
- Crawl news websites to aggregate articles published over time.
Properties
| Name | Meaning |
|---|---|
| Base URL | Custom API base URL to override the default Horizon Data Wave API endpoint. |
| URL | Starting URL for the crawl operation; the initial page from which crawling begins. |
| Timeout (seconds) | Maximum duration allowed for the crawl operation before it times out (default 300 sec). |
Output
- The output JSON contains the results returned by the Horizon Data Wave API crawl endpoint.
- If the API returns an array, each element is output as a separate item.
- The structure of each JSON object corresponds to the crawled data for individual pages or resources discovered during the crawl.
- No binary data output is indicated for this operation.
Dependencies
- Requires an API key credential for authentication with the Horizon Data Wave API service.
- The node uses the Horizon Data Wave API base URL
https://api.horizondatawave.ai/api/websiteby default but allows overriding via the "Base URL" property. - Network access to the API endpoint is necessary.
- The node relies on n8n's HTTP request helper with authentication support.
Troubleshooting
- Timeouts: If the crawl takes longer than the specified timeout, the operation may fail or return partial results. Increase the "Timeout (seconds)" value if needed.
- Authentication errors: Ensure that the API key credential is correctly configured and valid.
- Invalid URL: Providing an invalid or unreachable starting URL will cause errors. Verify the URL format and accessibility.
- API errors: Any error messages returned by the Horizon Data Wave API will be surfaced in the node output if "Continue On Fail" is enabled; otherwise, they will stop execution.
- Empty results: If no pages are crawled, check if the starting URL is correct and accessible, and verify network connectivity.
Links and References
- Horizon Data Wave API Documentation (general reference, actual docs URL may vary)
- n8n HTTP Request Node documentation: https://docs.n8n.io/nodes/n8n-nodes-base.httpRequest/
