Overview
This node uses Puppeteer, a headless browser automation library, to interact with web pages programmatically. Specifically, the "Get Page Content" operation loads a web page at a specified URL and returns its full HTML content along with HTTP response headers and status code.
Common scenarios where this node is beneficial include:
- Scraping or extracting raw HTML content from websites for data processing.
- Capturing the fully rendered HTML after JavaScript execution (unlike simple HTTP requests).
- Automating web page interactions that require a real browser environment.
Practical example:
- You want to scrape product details from an e-commerce site that relies heavily on client-side rendering. Using this node, you can load the product page URL and get the complete HTML content as rendered by the browser, which includes dynamically loaded elements.
Properties
| Name | Meaning |
|---|---|
| URL | The web page URL to load and retrieve content from. |
| Query Parameters | Optional key-value pairs appended as query parameters to the URL before loading the page. |
| Options | A collection of advanced settings controlling browser behavior: |
| - Batch Size | Maximum number of pages to open simultaneously. Higher values increase resource usage. |
| - Browser WebSocket Endpoint | Connect to an existing browser instance via WebSocket instead of launching a new one. |
| - Emulate Device | Emulate a specific device's viewport and user agent (e.g., iPhone, iPad). |
| - Executable Path | Path to a custom browser executable to use instead of the bundled one. Ignored if connecting via WebSocket. |
| - Extra Headers | Additional HTTP headers to send with the page request. |
| - File Name | Filename to assign to binary outputs (only relevant for screenshot or PDF operations, not for Get Page Content). |
| - Launch Arguments | Extra command line arguments passed to the browser process. Ignored if connecting via WebSocket. |
| - Timeout | Maximum navigation time in milliseconds before aborting. Set 0 to disable timeout. |
| - Wait Until | When to consider navigation succeeded: load, domcontentloaded, networkidle0, or networkidle2. |
| - Page Caching | Enable or disable page-level caching. Defaults to enabled. |
| - Headless Mode | Run browser without UI. Defaults to true. |
| - Use Chrome Headless Shell | Run browser in headless shell mode (requires chrome-headless-shell in system PATH). |
| - Stealth Mode | Apply techniques to make headless browser detection harder. Defaults to false. |
| - Proxy Server | Use a custom proxy server for all browser traffic (e.g., localhost:8080 or socks5://localhost:1080). |
Output
The output JSON contains:
body: The full HTML content of the loaded page as a string.headers: An object containing HTTP response headers received when loading the page.statusCode: The HTTP status code of the page response.url: The final URL loaded (including any redirects).
There is no binary output for the "Get Page Content" operation.
Example output JSON snippet:
{
"body": "<!DOCTYPE html><html>...</html>",
"headers": {
"content-type": "text/html; charset=UTF-8",
"cache-control": "max-age=3600"
},
"statusCode": 200,
"url": "https://example.com/page"
}
Dependencies
- Requires Puppeteer and puppeteer-extra libraries for browser automation.
- Supports optional integration with a CAPTCHA solving service via an API key credential (used only if configured).
- No internal credential names are exposed; users must provide an appropriate API key credential if using CAPTCHA solving features.
- Node configuration may require setting environment variables to allow external modules or built-in Node.js modules.
- If using "Browser WebSocket Endpoint," a running browser instance accessible via WebSocket must be available.
- For stealth mode, the node uses a stealth plugin to reduce detection of headless browsing.
Troubleshooting
- Failed to launch/connect to browser: This error indicates Puppeteer could not start or connect to the browser. Check that the executable path is correct, required dependencies are installed, and no conflicting browser instances block the connection.
- Invalid URL: If the provided URL is malformed, the node will throw an error. Ensure URLs are valid and properly formatted.
- Request failed with status code >= 400: The page returned an error HTTP status. Verify the URL is accessible and does not require authentication or special headers.
- Timeout errors: If navigation takes longer than the specified timeout, increase the timeout value or check network conditions.
- Resource limits exceeded: Setting batch size too high may cause excessive memory/CPU usage leading to failures. Reduce batch size if encountering performance issues.
- Stealth mode issues: Some sites may still detect headless browsers despite stealth mode. Consider disabling stealth or adjusting options.
- Proxy server misconfiguration: Incorrect proxy format or unreachable proxy will cause connection failures.