HDW Web Parser (beta) icon

HDW Web Parser (beta)

Parse and crawl websites using Horizon Data Wave API

Overview

This node integrates with the Horizon Data Wave (HDW) Web Parser API to perform web data extraction and crawling tasks. It supports three main operations:

  • Scrape: Extract content from a single webpage in various formats such as Markdown, HTML, raw HTML, screenshots, and links. It can focus on the main content area and supports mobile viewport emulation.
  • Map: Discover URLs starting from a given URL, optionally filtering by search terms, including or excluding subdomains, and controlling sitemap usage.
  • Crawl: Perform a crawl starting from a URL to traverse multiple pages within a time limit.

This node is useful for scenarios like gathering structured content from websites, generating site maps, or performing automated website analysis and data collection.

Practical Examples

  • Scraping product descriptions and images from an e-commerce page.
  • Mapping all accessible URLs on a company website for SEO auditing.
  • Crawling news articles starting from a homepage to collect recent posts.

Properties

Name Meaning
Base URL Custom API base URL for the HDW Web Parser API; leave empty to use the default endpoint.

Note: The node also supports additional properties depending on the selected operation:

For "Scrape" Operation:

Name Meaning
URL The webpage URL to scrape.
Formats Content formats to extract: Markdown, HTML, Raw HTML, Screenshot, Links, Full Page Screenshot.
Only Main Content Whether to extract only the main content, filtering out navigation, footers, etc.
Mobile Use a mobile viewport for scraping.
Skip TLS Verification Skip verification of TLS certificates (useful for self-signed certs).
Timeout (ms) Maximum wait time in milliseconds for the page to load.
Remove Base64 Images Remove base64 encoded images from the output.

For "Map" Operation:

Name Meaning
URL Starting URL for URL discovery.
Search Term Optional term to filter discovered URLs.
Ignore Sitemap Skip sitemap.xml discovery; use only HTML links.
Sitemap Only Use only sitemap.xml for discovery; ignore HTML links.
Include Subdomains Include URLs from subdomains in results.
Limit Maximum number of URLs to return.

For "Crawl" Operation:

Name Meaning
URL Starting URL for the crawl.
Timeout (seconds) Maximum duration in seconds for the crawl operation.

Output

The node outputs JSON data representing the response from the HDW Web Parser API:

  • For Scrape, the output includes extracted content in the requested formats (e.g., Markdown text, HTML snippets, screenshots as URLs or binary references, lists of links).
  • For Map, the output is an array of discovered URLs matching the criteria.
  • For Crawl, the output contains crawl results such as visited URLs and possibly extracted data per page.

If the API returns an array, each element is output as a separate item. Otherwise, the entire response is output as a single JSON object.

Binary data such as screenshots may be included as URLs or references but are not directly embedded in the output.

Dependencies

  • Requires an API authentication token credential for the Horizon Data Wave API.
  • Uses the HDW Web Parser API endpoint, defaulting to https://api.horizondatawave.ai/api/website unless overridden by the "Base URL" property.
  • Network access to the API endpoint must be available.
  • No other external dependencies.

Troubleshooting

  • Common issues:

    • Invalid or missing API credentials will cause authentication failures.
    • Incorrect URLs or unreachable targets may result in HTTP errors or timeouts.
    • Setting very low timeout values might cause premature termination of requests.
    • Skipping TLS verification can expose security risks; use only if necessary.
  • Error messages:

    • Errors returned from the API are passed through in the output JSON under an error field if "Continue On Fail" is enabled.
    • Network or HTTP errors will throw exceptions unless handled by the node's error handling settings.
  • Resolutions:

    • Verify API credentials and permissions.
    • Ensure target URLs are correct and accessible.
    • Adjust timeout settings according to network conditions.
    • Enable "Continue On Fail" to handle partial failures gracefully.

Links and References

Discussion