Puppeteer icon

Puppeteer

Automate browser interactions using Puppeteer

Overview

The Get Page Content operation of the Puppeteer node retrieves the full HTML content of a web page. It automates browser actions using Puppeteer, allowing you to fetch dynamic or static web pages as they would appear in a real browser. This is particularly useful for scraping data from websites that require JavaScript rendering, testing web page output, or archiving web content.

Common scenarios:

  • Scraping product details from e-commerce sites that use client-side rendering.
  • Capturing the state of a web page after user interactions or authentication.
  • Monitoring changes on dynamic web pages.

Example:
You can use this node to fetch the rendered HTML of a news article page, including content loaded via JavaScript.


Properties

Below are the supported input properties for the Get Page Content operation:

Display Name Type Meaning
URL String (required) The web address of the page to retrieve.
Query Parameters Collection List of key-value pairs to append as query parameters to the URL.
Options Collection Advanced settings for browser behavior and request customization.
├─ Batch Size Number Maximum number of pages to open simultaneously. Higher values use more memory/CPU.
├─ Browser WebSocket Endpoint String Connects to an existing browser instance via WebSocket instead of launching a new one.
├─ Emulate Device Options Emulates a specific device (e.g., mobile, tablet) for the browser session.
├─ Executable path String Path to the browser executable. Ignored if WebSocket endpoint is set.
├─ Extra Headers Collection Additional HTTP headers to send with the request.
├─ File Name String Not used in this operation. (Relevant for PDF/Screenshot only.)
├─ Launch Arguments Collection Additional command-line arguments for the browser process.
├─ Timeout Number Maximum navigation time in milliseconds (default: 30000).
├─ Protocol Timeout Number Max time to wait for protocol responses (default: 30000 ms).
├─ Wait Until Options When to consider navigation successful (e.g., load, domcontentloaded, networkidle0/2).
├─ Page Caching Boolean Enable/disable page-level caching (default: true).
├─ Headless mode Boolean Run browser in headless mode (default: true).
├─ Use Chrome Headless Shell Boolean Use chrome-headless-shell binary (requires headless mode and shell in $PATH).
├─ Stealth mode Boolean Makes detection of automation harder (anti-bot evasion).
├─ Human typing mode Boolean Simulates human-like typing in input fields.
├─ Human Typing Options Collection Fine-tune delays and typo simulation for human typing mode.
├─ Proxy Server String Use a proxy server for outgoing requests.
└─ Add Container Arguments Boolean Adds recommended flags for container environments (default: true).

Output

The node outputs an array of items, each containing the following structure in the json field:

{
  "body": "<string>",         // The full HTML content of the fetched page.
  "headers": { ... },         // HTTP response headers returned by the server.
  "statusCode": <number>,     // HTTP status code of the response.
  "url": "<string>"           // The final URL after any redirects.
}
  • If an error occurs, the output will contain an error field with the error message.

Note: This operation does not output binary data.


Dependencies

  • External Services: None required for basic usage.
  • API Keys: Not required.
  • n8n Configuration:
    • For advanced options, you may need:
      • A compatible version of Puppeteer and its plugins.
      • Access to a browser executable (Chrome/Chromium) if not connecting via WebSocket.
      • Proper environment variables if running in a containerized environment (for example, to ensure Chrome runs correctly).

Troubleshooting

Common Issues:

  • Invalid URL:

    • Error: "Invalid URL: <your-url>"
    • Cause: The provided URL is malformed or missing.
    • Solution: Ensure the URL is complete and valid (including protocol, e.g., https://).
  • Navigation Timeout:

    • Error: "Navigation timeout of <timeout> ms exceeded"
    • Cause: The page took too long to load.
    • Solution: Increase the "Timeout" property or check your network connection.
  • Request failed with status code X:

    • Error: "Request failed with status code <number>"
    • Cause: The server responded with an error (e.g., 404, 500).
    • Solution: Check the target URL and server availability.
  • Failed to launch/connect to browser:

    • Error: "Failed to launch/connect to browser: <details>"
    • Cause: Missing browser executable, incompatible environment, or misconfigured options.
    • Solution: Verify Puppeteer dependencies, browser path, and environment setup.
  • Resource Limits:

    • High batch sizes or multiple simultaneous pages may exhaust system resources.
    • Solution: Lower the "Batch Size" or increase available memory/CPU.

Links and References

Discussion