My Browserless

Interact with a Browserless instance for web scraping

Overview

This node interacts with a Browserless server to perform web scraping by loading a specified webpage and returning its HTML content. It is useful for automating the extraction of webpage data without needing to manage a browser instance directly. Typical use cases include scraping product details, news articles, or any dynamic content that requires waiting for certain elements to load before capturing the page.

For example, you can configure this node to scrape the HTML of a product page after ensuring the main heading (<h1>) has loaded, enabling reliable extraction of page content even if it loads asynchronously.

Properties

Name Meaning
Browserless Server URL The base URL of your Browserless server instance (e.g., http://localhost:3000).
API Token Your authentication token to authorize requests to the Browserless server.
Target URL The full URL of the webpage you want to scrape.
Wait For Selector A CSS selector to wait for on the page before returning the content (default is h1).

Output

The node outputs an array of items where each item contains a json object with a single field:

  • html: A string containing the full HTML content of the target webpage after the specified element has loaded.

No binary data output is produced by this node.

Dependencies

  • Requires access to a running Browserless server instance.
  • Requires a valid API token for authenticating requests to the Browserless server.
  • Uses HTTP POST requests to the /content endpoint of the Browserless server.
  • The node expects the Browserless server to support options such as waitFor (CSS selector) and Puppeteer's gotoOptions.

Troubleshooting

  • Common issues:

    • Incorrect or unreachable Browserless Server URL will cause connection failures.
    • Invalid or missing API token will result in authorization errors.
    • If the waitFor selector does not exist on the page, the request may timeout or return incomplete content.
    • Network issues or server downtime can cause request failures.
  • Error messages:

    • Authorization errors typically indicate invalid API tokens; verify and update the token.
    • Connection refused or timeout errors suggest the Browserless server URL is incorrect or the server is down.
    • Unexpected empty or partial HTML might mean the waitFor selector is incorrect or the page did not fully load.

Links and References

Discussion