Puppeteer icon

Puppeteer

Automate browser interactions using Puppeteer

Overview

This node uses Puppeteer, a headless browser automation library, to interact with web pages programmatically. The "Get Page Content" operation fetches the full HTML content of a specified URL, optionally applying query parameters and custom headers. It supports advanced browser options such as device emulation, stealth mode to avoid detection, proxy usage, and batch processing multiple pages concurrently.

Common scenarios include:

  • Scraping or extracting raw HTML content from websites for data analysis.
  • Testing or monitoring web page content changes.
  • Automating workflows that require fetching dynamic page content rendered by JavaScript.

Practical example:

  • Fetching the HTML content of a product page on an e-commerce site to extract pricing or availability information.
  • Retrieving the content of multiple URLs in parallel while emulating a mobile device to test responsive layouts.

Properties

Name Meaning
URL The web address of the page to retrieve content from.
Query Parameters Key-value pairs appended to the URL as query string parameters.
Batch Size Maximum number of pages to open simultaneously. Higher values increase resource usage (CPU, memory).
Browser WebSocket Endpoint WebSocket URL to connect to an existing browser instance instead of launching a new one.
Emulate Device Select a predefined device profile to emulate (e.g., screen size, user agent).
Executable Path File system path to a specific browser executable to use. Ignored if connecting via WebSocket endpoint.
Extra Headers Custom HTTP headers to send with the request.
File Name Filename to assign to binary outputs (only relevant for screenshot or PDF operations, not for Get Page Content).
Launch Arguments Additional command line arguments passed to the browser process.
Timeout Maximum navigation time in milliseconds before aborting. Set 0 to disable timeout.
Wait Until Event to wait for before considering navigation complete: load, domcontentloaded, networkidle0, or networkidle2.
Page Caching Enable or disable page-level caching. Defaults to enabled.
Headless Mode Run browser without UI. Defaults to true.
Use Chrome Headless Shell Run browser in minimal headless shell mode. Requires headless mode enabled and chrome-headless-shell in system PATH.
Stealth Mode Apply techniques to reduce detection of headless Puppeteer usage. Defaults to disabled.
Proxy Server Proxy server address to route browser traffic through (e.g., localhost:8080 or socks5://localhost:1080).
Add Container Arguments Automatically add recommended launch arguments for container environments (--no-sandbox, etc.). Defaults to true.

Output

The output is an array of items corresponding to each input item processed. For the "Get Page Content" operation, each output item contains:

  • json:
    • body: The full HTML content of the loaded page as a string.
    • headers: The HTTP response headers received when loading the page.
    • statusCode: The HTTP status code of the response (e.g., 200).
    • url: The final URL after any redirects.

No binary data is produced for this operation.

Dependencies

  • Requires Puppeteer and puppeteer-extra libraries for browser automation.
  • Optionally uses a stealth plugin to evade detection.
  • Supports connection to an external browser instance via WebSocket.
  • May require a compatible Chromium or Chrome browser installed or accessible via executable path.
  • Environment variables can influence allowed built-in and external modules for script execution sandboxing.
  • If using headless shell mode, requires chrome-headless-shell available in system PATH.

Troubleshooting

  • Invalid URL error: Occurs if the provided URL is malformed or cannot be parsed. Ensure the URL is valid and properly formatted.
  • Failed to launch/connect to browser: Happens if Puppeteer cannot start or connect to a browser instance. Check executable path, WebSocket endpoint, and system dependencies.
  • Timeout errors: Navigation may fail if the page takes longer than the configured timeout. Increase the timeout or set it to 0 to disable.
  • Resource limits exceeded: Setting a very high batch size can exhaust CPU/memory resources causing failures. Reduce batch size accordingly.
  • Stealth mode issues: Enabling stealth mode may cause unexpected behavior on some sites; disable if problems occur.
  • Proxy configuration errors: Incorrect proxy server strings can prevent page loading. Verify proxy format and availability.

Links and References

Discussion