Puppeteer icon

Puppeteer

Automate browser interactions using Puppeteer

Overview

This node uses Puppeteer, a headless browser automation library, to interact with web pages programmatically. Specifically, the "Get Screenshot" operation captures screenshots of web pages given their URLs. It supports capturing full-page or viewport-only screenshots in various image formats (PNG, JPEG, WebP) and allows customization of image quality for applicable formats.

Common scenarios where this node is beneficial include:

  • Automatically generating website previews or thumbnails.
  • Monitoring visual changes on web pages over time.
  • Archiving web page appearances for compliance or record-keeping.
  • Creating images for social media sharing or reports.

For example, you can input a URL of a product page and get a PNG screenshot of the entire scrollable page, which can then be used in marketing materials or automated reports.

Properties

Name Meaning
URL The web address of the page to capture.
Property Name The name of the binary property where the screenshot image data will be stored.
Type The image format for the screenshot: PNG, JPEG, or WebP.
Quality Image quality from 0 to 100; applies only to JPEG and WebP formats (not PNG).
Full Page Whether to capture the entire scrollable page (true) or just the visible viewport (false).
Query Parameters Additional query parameters to append to the URL before loading the page.
Batch Size Number of pages to open simultaneously; higher values use more memory and CPU.
Browser WebSocket Endpoint WebSocket URL to connect to an existing browser instance instead of launching a new one.
Emulate Device Optionally emulate a specific device's viewport and user agent.
Executable Path Path to a custom browser executable to use instead of the bundled one.
Extra Headers Custom HTTP headers to send with the page request.
File Name Filename to assign to the binary data output (useful for saving files downstream).
Launch Arguments Additional command line arguments to pass when launching the browser.
Timeout Maximum navigation time in milliseconds; 0 disables timeout.
Wait Until Event to wait for before considering navigation complete: load, domcontentloaded, networkidle0, networkidle2.
Page Caching Enable or disable page-level caching (default enabled).
Headless mode Run browser in headless mode (default true).
Use Chrome Headless Shell Run browser in headless shell mode (requires headless mode enabled and chrome-headless-shell in PATH).
Stealth mode Apply techniques to make headless browser detection harder.
Proxy Server Proxy server configuration string (e.g., localhost:8080, socks5://localhost:1080).
Add Container Arguments Add recommended launch arguments for container environments (--no-sandbox, etc.).

Output

The node outputs items containing binary data representing the screenshot image. The binary property name is configurable via the "Property Name" input. Each item includes:

  • binary: Contains the image data in the specified format (PNG, JPEG, or WebP).
  • json: Metadata about the response including:
    • headers: HTTP response headers from the page request.
    • statusCode: HTTP status code of the page response.
    • url: The final URL loaded (including any query parameters).

The binary data can be used downstream for saving to disk, uploading, or further processing.

Dependencies

  • Requires Puppeteer and puppeteer-extra libraries for browser automation.
  • Uses puppeteer-extra-plugin-stealth if stealth mode is enabled.
  • Supports connecting to an existing browser instance via WebSocket endpoint or launching a new Chromium browser.
  • No internal credential types are required, but if accessing protected pages, appropriate authentication headers or proxy settings may be needed.
  • Environment variables can influence behavior, e.g., enabling stdout logging or allowing external modules.

Troubleshooting

  • Invalid URL error: If the provided URL is malformed, the node will throw an error indicating an invalid URL. Ensure URLs are properly formatted.
  • Navigation timeout: If the page takes longer than the configured timeout to load, a timeout error occurs. Increase the timeout or check network conditions.
  • Failed to launch/connect to browser: Errors launching Chromium or connecting to a WebSocket endpoint indicate misconfiguration or missing dependencies. Verify executable paths, WebSocket URLs, and that Chromium is installed.
  • Permission errors in container environments: If running inside containers, ensure container-specific launch arguments are enabled (Add Container Arguments) to avoid sandboxing issues.
  • Stealth mode issues: Enabling stealth mode may cause unexpected behavior on some sites; disable it if problems arise.
  • Memory/CPU overload: Setting a high batch size opens many pages simultaneously, which can exhaust system resources. Reduce batch size if performance degrades.

Links and References

Discussion