Overview
This node, named "HDW Web Parser (beta)", enables scraping and crawling of web pages using the Horizon Data Wave API. It supports three main operations:
- Scrape: Extract content from a single webpage in various formats such as Markdown, HTML, raw HTML, screenshots, and links.
- Map: Discover URLs starting from a given URL, optionally filtering by search terms or sitemap usage.
- Crawl: Perform a crawl starting from a URL to collect data from multiple pages within a specified timeout.
The node is useful for scenarios where automated extraction of web content is needed, such as content aggregation, SEO analysis, competitive research, or data collection for machine learning.
For example, you can scrape a news article's main content in Markdown format, capture a screenshot of a product page, or map all URLs on a website related to a specific topic.
Properties
| Name | Meaning |
|---|---|
| Base URL | Custom API base URL to override the default Horizon Data Wave API endpoint. |
| URL | The target webpage URL to scrape (required for Scrape operation). |
| Formats | Content formats to extract; options include Markdown, HTML, Raw HTML, Screenshot, Links, Full Page Screenshot. Default is Markdown. |
| Only Main Content | Whether to extract only the main content, filtering out navigation, footers, etc. (boolean, default true). |
| Mobile | Use mobile viewport rendering for scraping (boolean, default false). |
| Skip TLS Verification | Skip TLS certificate verification when loading the page (boolean, default false). |
| Timeout (ms) | Maximum time in milliseconds to wait for the page to load (default 1500 ms). |
| Remove Base64 Images | Remove base64 encoded images from the output (boolean, default false). |
These properties are specifically for the Scrape operation.
Output
The node outputs an array of JSON objects representing the scraped or crawled data. For the Scrape operation, the output JSON includes fields corresponding to the requested formats, such as:
- Markdown content of the page.
- HTML or raw HTML content.
- Screenshots as image data (likely base64 encoded).
- Extracted links from the page.
If multiple items are returned (e.g., from Map or Crawl operations), each item is output as a separate JSON object in the array.
Binary data output (such as screenshots) is represented within the JSON response, typically as encoded strings.
Dependencies
- Requires access to the Horizon Data Wave API service at
https://api.horizondatawave.ai/api/websiteby default, or a custom API base URL if provided. - Requires an API authentication token credential configured in n8n (referred generically as an API key credential).
- Uses HTTP POST requests with JSON payloads to interact with the API endpoints
/scrape,/map, and/crawl.
Troubleshooting
- Timeouts: If the page takes longer than the specified timeout to load, the request may fail or return incomplete data. Increase the "Timeout (ms)" property if necessary.
- TLS Errors: If the target site has problematic SSL certificates, enable "Skip TLS Verification" to bypass certificate checks.
- Authentication Failures: Ensure the API key credential is correctly configured and valid.
- Empty or Unexpected Output: Verify the URL is correct and accessible. Check that the requested formats are supported by the API.
- Base64 Image Removal: Enabling "Remove Base64 Images" will strip embedded images; disable it if images are required.
Error messages returned from the API are passed through; common errors relate to invalid URLs, authentication issues, or exceeding rate limits.
Links and References
- Horizon Data Wave API Documentation (assumed public API docs)
- n8n HTTP Request Node documentation for understanding HTTP interactions: https://docs.n8n.io/nodes/n8n-nodes-base.httpRequest/
