Overview
This node integrates with the Horizon Data Wave (HDW) Web Parser API to perform web data extraction and crawling tasks. It supports three main operations:
- Scrape: Extract content from a single webpage in various formats such as Markdown, HTML, raw HTML, screenshots, and links. It can focus on the main content area and supports mobile viewport emulation.
- Map: Discover URLs starting from a given URL, optionally filtering by search terms, including or excluding subdomains, and controlling sitemap usage.
- Crawl: Perform a crawl starting from a URL to traverse multiple pages within a time limit.
This node is useful for scenarios like gathering structured content from websites, generating site maps, or performing automated website analysis and data collection.
Practical Examples
- Scraping product descriptions and images from an e-commerce page.
- Mapping all accessible URLs on a company website for SEO auditing.
- Crawling news articles starting from a homepage to collect recent posts.
Properties
| Name | Meaning |
|---|---|
| Base URL | Custom API base URL for the HDW Web Parser API; leave empty to use the default endpoint. |
Note: The node also supports additional properties depending on the selected operation:
For "Scrape" Operation:
| Name | Meaning |
|---|---|
| URL | The webpage URL to scrape. |
| Formats | Content formats to extract: Markdown, HTML, Raw HTML, Screenshot, Links, Full Page Screenshot. |
| Only Main Content | Whether to extract only the main content, filtering out navigation, footers, etc. |
| Mobile | Use a mobile viewport for scraping. |
| Skip TLS Verification | Skip verification of TLS certificates (useful for self-signed certs). |
| Timeout (ms) | Maximum wait time in milliseconds for the page to load. |
| Remove Base64 Images | Remove base64 encoded images from the output. |
For "Map" Operation:
| Name | Meaning |
|---|---|
| URL | Starting URL for URL discovery. |
| Search Term | Optional term to filter discovered URLs. |
| Ignore Sitemap | Skip sitemap.xml discovery; use only HTML links. |
| Sitemap Only | Use only sitemap.xml for discovery; ignore HTML links. |
| Include Subdomains | Include URLs from subdomains in results. |
| Limit | Maximum number of URLs to return. |
For "Crawl" Operation:
| Name | Meaning |
|---|---|
| URL | Starting URL for the crawl. |
| Timeout (seconds) | Maximum duration in seconds for the crawl operation. |
Output
The node outputs JSON data representing the response from the HDW Web Parser API:
- For Scrape, the output includes extracted content in the requested formats (e.g., Markdown text, HTML snippets, screenshots as URLs or binary references, lists of links).
- For Map, the output is an array of discovered URLs matching the criteria.
- For Crawl, the output contains crawl results such as visited URLs and possibly extracted data per page.
If the API returns an array, each element is output as a separate item. Otherwise, the entire response is output as a single JSON object.
Binary data such as screenshots may be included as URLs or references but are not directly embedded in the output.
Dependencies
- Requires an API authentication token credential for the Horizon Data Wave API.
- Uses the HDW Web Parser API endpoint, defaulting to
https://api.horizondatawave.ai/api/websiteunless overridden by the "Base URL" property. - Network access to the API endpoint must be available.
- No other external dependencies.
Troubleshooting
Common issues:
- Invalid or missing API credentials will cause authentication failures.
- Incorrect URLs or unreachable targets may result in HTTP errors or timeouts.
- Setting very low timeout values might cause premature termination of requests.
- Skipping TLS verification can expose security risks; use only if necessary.
Error messages:
- Errors returned from the API are passed through in the output JSON under an
errorfield if "Continue On Fail" is enabled. - Network or HTTP errors will throw exceptions unless handled by the node's error handling settings.
- Errors returned from the API are passed through in the output JSON under an
Resolutions:
- Verify API credentials and permissions.
- Ensure target URLs are correct and accessible.
- Adjust timeout settings according to network conditions.
- Enable "Continue On Fail" to handle partial failures gracefully.
Links and References
- Horizon Data Wave API Documentation (Assumed based on domain, verify actual docs)
- n8n Documentation on HTTP Request Node for understanding HTTP request options.
- General web scraping best practices and legal considerations.
