HDW Web Parser (beta) icon

HDW Web Parser (beta)

Parse and crawl websites using Horizon Data Wave API

Overview

The HDW Web Parser node enables scraping, mapping, and crawling of web pages using the Horizon Data Wave API. It is designed to extract structured content from websites, discover URLs starting from a given page, or perform multi-page crawls.

For the Scrape operation (the focus here), the node fetches content from a specified URL and extracts it in various formats such as Markdown, HTML, raw HTML, screenshots, and links. It can filter to only the main content of the page, exclude base64 images, and simulate mobile viewport rendering. This node is useful for automating data extraction from web pages for content analysis, archiving, or integration into workflows without manual copying.

Practical examples:

  • Extracting article text from news sites in Markdown format.
  • Capturing full-page screenshots of product pages for visual records.
  • Collecting all hyperlinks on a page for link analysis or further crawling.
  • Scraping content with mobile layout rendering to see how pages appear on phones.

Properties

Name Meaning
Base URL Custom API base URL to override the default Horizon Data Wave API endpoint.
URL The webpage URL to scrape.
Formats Content formats to extract. Options: Markdown, HTML, Raw HTML, Screenshot, Links, Full Page Screenshot.
Only Main Content Whether to extract only the main content, filtering out navigation bars, footers, etc.
Mobile Use a mobile viewport to render the page before scraping.
Skip TLS Verification Skip verification of TLS certificates (useful for self-signed or invalid certs).
Timeout (Ms) Maximum time in milliseconds to wait for the page to load before scraping.
Remove Base64 Images Remove base64 encoded images from the output to reduce size or avoid embedded image data.

Output

The node outputs JSON objects containing the scraped data according to the requested formats. The structure depends on the selected formats but generally includes:

  • Extracted textual content in Markdown or HTML.
  • Raw HTML source if requested.
  • URLs found on the page if "Links" format is selected.
  • Screenshot data (likely as binary or base64-encoded images) when screenshot options are chosen.

If multiple items are returned (e.g., multiple links), the output will be an array of JSON objects, each representing one item.

Binary data (screenshots) is included as part of the output but not detailed here; it typically represents image captures of the webpage.

Dependencies

  • Requires an API key credential for authenticating with the Horizon Data Wave API.
  • The node makes HTTP POST requests to the API endpoint (default https://api.horizondatawave.ai/api/website).
  • No additional environment variables are required beyond the API authentication setup.

Troubleshooting

  • Common issues:

    • Invalid or missing API credentials will cause authentication failures.
    • Network errors or incorrect URLs may result in request failures.
    • TLS certificate errors if the target site uses invalid SSL certificates (can be bypassed by enabling "Skip TLS Verification").
    • Timeouts if the page takes longer than the configured timeout to load.
  • Error messages:

    • API errors include HTTP status codes and error messages returned by the Horizon Data Wave API.
    • Detailed error information may be included in response headers or body, such as request IDs and execution times.
    • If "Continue On Fail" is enabled, errors are returned as JSON objects with error details instead of stopping execution.
  • Resolutions:

    • Verify API key validity and permissions.
    • Check the URL correctness and accessibility.
    • Increase timeout value for slow-loading pages.
    • Enable "Skip TLS Verification" cautiously if encountering SSL errors.
    • Review error details in output to diagnose specific API or network issues.

Links and References

Discussion