Overview
This node integrates with the Horizon Data Wave API to parse and crawl websites. It supports three main operations:
- Scrape: Extract content from a single webpage in various formats such as Markdown, HTML, raw HTML, screenshots, and links.
- Map: Discover URLs starting from a given URL by analyzing sitemap.xml and/or HTML links on the page, optionally filtering results by search terms or subdomains.
- Crawl: Perform a multi-page crawl starting from a URL, gathering data across multiple linked pages.
The Map operation is useful for scenarios where you want to gather a list of URLs related to a website, for example to build a sitemap, analyze site structure, or prepare a batch scraping job. You can filter discovered URLs by keywords, limit the number of results, and control whether to include subdomains or rely solely on sitemap.xml.
Practical Examples
- Discover all product pages on an e-commerce site by starting from the homepage URL and filtering URLs containing "product".
- Generate a list of blog post URLs by crawling only the sitemap.xml of a news website.
- Collect URLs from a corporate website including its subdomains for comprehensive site analysis.
Properties
| Name | Meaning |
|---|---|
| Base URL | Custom API base URL to override the default Horizon Data Wave API endpoint. |
| URL | Starting URL for URL discovery (required). |
| Search Term | Optional keyword to filter discovered URLs; only URLs containing this term will be returned. |
| Ignore Sitemap | If true, skip sitemap.xml discovery and use only HTML links found on pages. |
| Sitemap Only | If true, use only sitemap.xml for discovery and ignore HTML links. |
| Include Subdomains | If true, include URLs from subdomains of the starting URL in the results. |
| Limit | Maximum number of URLs to return (minimum 1). |
Output
The output is an array of JSON objects, each representing a discovered URL or related metadata returned by the API. The exact structure depends on the API response but generally includes URL strings and possibly additional information about each link.
No binary data output is produced by the Map operation.
Example output snippet:
[
{
"url": "https://example.com/page1",
"lastModified": "2024-01-01T12:00:00Z"
},
{
"url": "https://blog.example.com/post1",
"lastModified": "2024-01-02T08:30:00Z"
}
]
Dependencies
- Requires an API key credential for the Horizon Data Wave API.
- The node makes authenticated HTTP POST requests to the API endpoints
/map(or custom base URL +/map). - No other external dependencies are needed.
- Ensure the API key credential is configured properly in n8n.
Troubleshooting
Common issues:
- Invalid or missing API key credential will cause authentication errors.
- Providing an invalid or unreachable starting URL may result in HTTP errors or empty results.
- Setting conflicting options like both
Ignore SitemapandSitemap Onlymight lead to unexpected behavior. - Exceeding the
Limitparameter or setting it too high could cause performance delays or API rate limiting.
Error messages:
- Errors from the API are captured and returned with details such as HTTP status, API error message, request ID, execution time, and token usage.
- Typical errors include authentication failures, invalid parameters, or server-side issues.
- To resolve, verify credentials, check URL validity, and adjust parameters accordingly.
- Use the "Continue On Fail" option in n8n to handle errors gracefully without stopping workflow execution.
Links and References
- Horizon Data Wave API Documentation (hypothetical link)
- n8n documentation on HTTP Request Node
- General web crawling and scraping best practices articles and tutorials
