Puppeteer icon

Puppeteer

Automate browser interactions using Puppeteer

Overview

This node uses Puppeteer, a headless browser automation library, to interact with web pages programmatically. The "Get Page Content" operation loads a specified URL and retrieves the full HTML content of the page along with HTTP response headers and status code.

Common scenarios where this node is beneficial include:

  • Web scraping: Extracting raw HTML content for further parsing or data extraction.
  • Monitoring website changes by fetching page content regularly.
  • Testing or validating web page responses in workflows.
  • Integrating dynamic web content into automation pipelines.

For example, you can use this node to fetch the HTML of a product page on an e-commerce site, then parse it in subsequent nodes to extract pricing or availability information.

Properties

Name Meaning
URL The web address of the page to load and retrieve content from.
Query Parameters Optional key-value pairs appended as query parameters to the URL before loading the page.
Options A collection of advanced settings controlling browser behavior and page loading:
- Batch Size Maximum number of pages to open simultaneously (affects memory and CPU usage).
- Browser WebSocket Endpoint WebSocket URL to connect to an existing browser instance instead of launching a new one.
- Browser WebSocket Headers Headers sent when connecting to the browser WebSocket endpoint.
- Emulate Device Select a device profile to emulate (e.g., mobile devices with specific screen sizes and user agents).
- Executable Path Path to a custom browser executable to use instead of the bundled one.
- Extra Headers Additional HTTP headers to send with page requests.
- File Name Filename to assign to binary outputs (not applicable for Get Page Content but used in other operations).
- Launch Arguments Extra command line arguments passed to the browser on launch.
- Timeout Maximum navigation time in milliseconds before aborting. Set 0 to disable timeout.
- Protocol Timeout Maximum wait time for protocol responses in milliseconds. Set 0 to disable timeout.
- Wait Until Event that determines when navigation is considered finished: load, domcontentloaded, networkidle0, or networkidle2.
- Page Caching Enable or disable page-level caching (default enabled).
- Headless Mode Run browser without UI (default true).
- Use Chrome Headless Shell Run browser in headless shell mode (requires headless mode enabled and chrome-headless-shell in PATH).
- Stealth Mode Apply techniques to make headless browser detection harder.
- Human Typing Mode Enables .typeHuman() function to simulate human-like typing.
- Human Typing Options Settings controlling delays and typo chances for human typing simulation.
- Proxy Server Custom proxy server configuration (e.g., localhost:8080 or socks5://localhost:1080).
- Add Container Arguments Automatically add recommended arguments for running inside container environments (--no-sandbox, etc.).

Output

The output contains JSON data with the following structure:

{
  "body": "<html>...</html>",       // Full HTML content of the loaded page
  "headers": {                     // HTTP response headers received
    "content-type": "text/html; charset=UTF-8",
    ...
  },
  "statusCode": 200,               // HTTP status code of the response
  "url": "https://example.com"    // Final URL after any redirects
}
  • The output is paired with the input item it corresponds to.
  • No binary data is produced by this operation.

Dependencies

  • Requires Puppeteer and puppeteer-extra libraries with stealth and human typing plugins.
  • Optionally connects to an existing browser via WebSocket if configured.
  • If using WebSocket connection with authentication, requires an API key credential or bearer token credential configured in n8n.
  • For emulating devices, relies on Puppeteer's known device descriptors.
  • Running in containerized environments may require enabling container-specific launch arguments.
  • If using headless shell mode, chrome-headless-shell must be installed and available in system PATH.

Troubleshooting

  • Timeout errors: If navigation takes longer than the configured timeout, increase the "Timeout" property or set it to 0 to disable.
  • Invalid URL error: Ensure the URL provided is valid and properly formatted.
  • Browser launch failures: Check that the executable path is correct or that the environment supports launching Chromium. In containers, ensure sandboxing flags are set correctly.
  • WebSocket connection issues: Verify the WebSocket endpoint URL and authentication headers if connecting to an existing browser.
  • Page content empty or incomplete: Adjust the "Wait Until" option to wait for appropriate page load events.
  • Stealth mode not working: Some sites may still detect headless browsers despite stealth mode; consider disabling or adjusting stealth settings.
  • Human typing simulation slow: Adjust human typing delay options to balance realism and speed.

Links and References

Discussion