Puppeteer

Automate browser interactions using Puppeteer

Actions4

Overview

This node uses Puppeteer, a headless browser automation library, to interact with web pages programmatically. Specifically, the "Get Page Content" operation loads a given URL and retrieves the full HTML content of the page along with HTTP response headers and status code.

Common scenarios where this node is beneficial include:

Web scraping: Extracting raw HTML content for parsing or data extraction.
Monitoring: Checking page content changes over time.
Automation workflows that require fetching dynamic page content rendered by JavaScript.

Practical example:

Fetch the HTML content of a product page on an e-commerce site to extract pricing or availability information.
Retrieve the content of a news article page to analyze or archive it.

Properties

Name	Meaning
URL	The web address of the page to load and retrieve content from.
Query Parameters	Optional key-value pairs appended as query parameters to the URL before loading the page.
Options	A collection of advanced settings controlling browser behavior and page loading:
- Batch Size	Maximum number of pages to open simultaneously (affects memory and CPU usage).
- Browser WebSocket Endpoint	Connect to an existing browser instance via WebSocket instead of launching a new one.
- Emulate Device	Emulate a specific device's viewport and user agent (e.g., iPhone, iPad).
- Executable Path	Custom path to the browser executable to use. Ignored if connecting via WebSocket.
- Extra Headers	Additional HTTP headers to send with the page request.
- File Name	Filename to assign to binary outputs (not applicable for Get Page Content).
- Launch Arguments	Extra command line arguments passed to the browser process.
- Timeout	Maximum navigation time in milliseconds (0 disables timeout).
- Wait Until	Event to consider navigation complete: `load`, `domcontentloaded`, `networkidle0`, or `networkidle2`.
- Page Caching	Enable or disable page-level caching (default enabled).
- Headless Mode	Run browser without UI (default true).
- Use Chrome Headless Shell	Run browser in headless shell mode (requires headless mode enabled and chrome-headless-shell in PATH).
- Stealth Mode	Apply techniques to avoid detection as a headless browser.
- Proxy Server	Use a custom proxy server for requests (e.g., `localhost:8080`, `socks5://localhost:1080`).
- Add Container Arguments	Automatically add recommended arguments for container environments (`--no-sandbox`, etc.).

Output

The output is an array of items, each corresponding to an input item processed. For the "Get Page Content" operation, each item contains:

json:
- body: The full HTML content of the loaded page as a string.
- headers: An object containing HTTP response headers received when loading the page.
- statusCode: The HTTP status code of the page response (e.g., 200).
- url: The final URL loaded (including any redirects).
pairedItem: References the original input item index.

No binary data is produced for this operation.

Dependencies

Requires Puppeteer and puppeteer-extra libraries for browser automation.
Optionally uses the stealth plugin to evade headless detection.
If using "Browser WebSocket Endpoint," requires access to an existing browser instance exposing a WebSocket debugging endpoint.
No internal credential types are required, but if accessing pages behind authentication, users must handle that externally or via custom scripts.
Environment variables can influence allowed modules and console output during script execution but do not affect this operation directly.

Troubleshooting

Failed to launch/connect to browser: Indicates Puppeteer could not start or connect to a browser instance. Check that the executable path is correct, dependencies are installed, and no conflicting processes block browser launch.
Invalid URL: The provided URL is malformed or cannot be parsed. Ensure the URL is valid and properly formatted.
Request failed with status code XXX: The page returned a non-200 HTTP status code. This may indicate the page is unavailable, requires authentication, or blocked by firewall/proxy.
Timeout errors: Navigation took longer than the specified timeout. Increase the timeout value or check network conditions.
Memory/CPU overload: Using a large batch size opens many pages simultaneously, which can exhaust system resources. Reduce batch size to mitigate.
Stealth mode issues: Enabling stealth mode may cause unexpected behavior on some sites; disable if problems occur.
Proxy configuration: Incorrect proxy server strings can prevent page loading. Verify proxy format and accessibility.