Overview
The node enables automated interaction with websites using the Hyperbrowser service. It supports multiple operations including crawling websites, scraping content, extracting specific data using AI, and controlling browser actions via different agents.
For the Crawl operation specifically, the node visits a starting URL and follows links to crawl multiple pages up to a specified limit. It collects content from these pages, optionally focusing only on the main content, and returns the aggregated results in a chosen format such as Markdown, HTML, or links.
This node is useful for scenarios like:
- Gathering content from multiple pages of a website for analysis or archiving.
- Automatically exploring site structure by following links.
- Collecting data for SEO audits or competitive research.
- Preparing datasets for machine learning or content summarization.
Example: Crawl a blog homepage and retrieve the main article content from up to 10 linked pages, outputting the results in Markdown format.
Properties
| Name | Meaning |
|---|---|
| URL | The starting webpage URL to begin crawling from. |
| Maximum Pages | Maximum number of pages to crawl (default 10). |
| Only Main Content | Whether to return only the main content of each page (true/false, default true). |
| Output Format | Format of the output content: HTML, Links, or Markdown (default Markdown). |
| Use Proxy | Whether to use a proxy server during crawling (true/false, default false). |
| Proxy Country | If using a proxy, specify the country for the proxy server (e.g., "US"). |
| Solve CAPTCHAs | Whether to attempt solving CAPTCHAs encountered during crawling (true/false, default false). |
| Timeout (Ms) | Maximum time in milliseconds to wait when navigating to each page (default 15000 ms). |
Output
The output JSON object for the Crawl operation includes:
url: The initial URL provided for crawling.data: The collected crawl data, which contains the content from the crawled pages formatted according to the selected output format (Markdown by default).status: Status information about the crawl operation (e.g., success or error status).
The node does not output binary data for this operation.
Dependencies
- Requires an API key credential for the Hyperbrowser service to authenticate requests.
- Relies on the external Hyperbrowser SDK to perform crawling and other web interactions.
- Optional proxy configuration requires access to proxy servers, potentially with geographic selection.
Troubleshooting
- Timeouts: Crawling may fail if pages take too long to load. Increase the "Timeout (Ms)" property if needed.
- CAPTCHA Challenges: If the target site uses CAPTCHAs, enabling "Solve CAPTCHAs" can help but may not always succeed.
- Proxy Issues: Using proxies incorrectly or specifying invalid countries may cause connection failures.
- Unsupported URLs: Some sites may block automated crawling or require authentication, leading to errors.
- API Key Errors: Ensure the Hyperbrowser API key is valid and has sufficient permissions.
Common error messages will include details about the failure reason. Enabling "Continue On Fail" allows processing to continue despite individual errors.