Puppeteer icon

Puppeteer

Automate browser interactions using Puppeteer

Overview

This node uses Puppeteer, a headless browser automation library, to interact with web pages programmatically. Specifically, the "Get Page Content" operation loads a given URL and retrieves the full HTML content of the page along with HTTP response headers and status code.

Common scenarios where this node is beneficial include:

  • Web scraping: Extracting raw HTML content for parsing or data extraction.
  • Monitoring: Checking page content changes over time.
  • Automation workflows that require fetching dynamic page content rendered by JavaScript.

Practical example:

  • Fetch the HTML content of a product page on an e-commerce site to extract pricing or availability information.
  • Retrieve the content of a news article page to analyze or archive it.

Properties

Name Meaning
URL The web address of the page to load and retrieve content from.
Query Parameters Optional key-value pairs appended as query parameters to the URL before loading the page.
Options A collection of advanced settings controlling browser behavior and page loading:
- Batch Size Maximum number of pages to open simultaneously (affects memory and CPU usage).
- Browser WebSocket Endpoint Connect to an existing browser instance via WebSocket instead of launching a new one.
- Emulate Device Emulate a specific device's viewport and user agent (e.g., iPhone, iPad).
- Executable Path Custom path to the browser executable to use. Ignored if connecting via WebSocket.
- Extra Headers Additional HTTP headers to send with the page request.
- File Name Filename to assign to binary outputs (not applicable for Get Page Content).
- Launch Arguments Extra command line arguments passed to the browser process.
- Timeout Maximum navigation time in milliseconds (0 disables timeout).
- Wait Until Event to consider navigation complete: load, domcontentloaded, networkidle0, or networkidle2.
- Page Caching Enable or disable page-level caching (default enabled).
- Headless Mode Run browser without UI (default true).
- Use Chrome Headless Shell Run browser in headless shell mode (requires headless mode enabled and chrome-headless-shell in PATH).
- Stealth Mode Apply techniques to avoid detection as a headless browser.
- Proxy Server Use a custom proxy server for requests (e.g., localhost:8080, socks5://localhost:1080).
- Add Container Arguments Automatically add recommended arguments for container environments (--no-sandbox, etc.).

Output

The output is an array of items, each corresponding to an input item processed. For the "Get Page Content" operation, each item contains:

  • json:
    • body: The full HTML content of the loaded page as a string.
    • headers: An object containing HTTP response headers received when loading the page.
    • statusCode: The HTTP status code of the page response (e.g., 200).
    • url: The final URL loaded (including any redirects).
  • pairedItem: References the original input item index.

No binary data is produced for this operation.

Dependencies

  • Requires Puppeteer and puppeteer-extra libraries for browser automation.
  • Optionally uses the stealth plugin to evade headless detection.
  • If using "Browser WebSocket Endpoint," requires access to an existing browser instance exposing a WebSocket debugging endpoint.
  • No internal credential types are required, but if accessing pages behind authentication, users must handle that externally or via custom scripts.
  • Environment variables can influence allowed modules and console output during script execution but do not affect this operation directly.

Troubleshooting

  • Failed to launch/connect to browser: Indicates Puppeteer could not start or connect to a browser instance. Check that the executable path is correct, dependencies are installed, and no conflicting processes block browser launch.
  • Invalid URL: The provided URL is malformed or cannot be parsed. Ensure the URL is valid and properly formatted.
  • Request failed with status code XXX: The page returned a non-200 HTTP status code. This may indicate the page is unavailable, requires authentication, or blocked by firewall/proxy.
  • Timeout errors: Navigation took longer than the specified timeout. Increase the timeout value or check network conditions.
  • Memory/CPU overload: Using a large batch size opens many pages simultaneously, which can exhaust system resources. Reduce batch size to mitigate.
  • Stealth mode issues: Enabling stealth mode may cause unexpected behavior on some sites; disable if problems occur.
  • Proxy configuration: Incorrect proxy server strings can prevent page loading. Verify proxy format and accessibility.

Links and References

Discussion