Overview
This node uses Puppeteer, a headless browser automation library, to interact with web pages programmatically. Specifically, the "Get Page Content" operation loads a given URL and retrieves the full HTML content of the page along with HTTP response headers and status code.
Common scenarios where this node is beneficial include:
- Web scraping: Extracting raw HTML content for parsing or data extraction.
- Monitoring: Checking page content changes over time.
- Automation workflows that require fetching dynamic page content rendered by JavaScript.
Practical example:
- Fetch the HTML content of a product page on an e-commerce site to extract pricing or availability information.
- Retrieve the content of a news article page to analyze or archive it.
Properties
| Name | Meaning |
|---|---|
| URL | The web address of the page to load and retrieve content from. |
| Query Parameters | Optional key-value pairs appended as query parameters to the URL before loading the page. |
| Options | A collection of advanced settings controlling browser behavior and page loading: |
| - Batch Size | Maximum number of pages to open simultaneously (affects memory and CPU usage). |
| - Browser WebSocket Endpoint | Connect to an existing browser instance via WebSocket instead of launching a new one. |
| - Emulate Device | Emulate a specific device's viewport and user agent (e.g., iPhone, iPad). |
| - Executable Path | Custom path to the browser executable to use. Ignored if connecting via WebSocket. |
| - Extra Headers | Additional HTTP headers to send with the page request. |
| - File Name | Filename to assign to binary outputs (not applicable for Get Page Content). |
| - Launch Arguments | Extra command line arguments passed to the browser process. |
| - Timeout | Maximum navigation time in milliseconds (0 disables timeout). |
| - Wait Until | Event to consider navigation complete: load, domcontentloaded, networkidle0, or networkidle2. |
| - Page Caching | Enable or disable page-level caching (default enabled). |
| - Headless Mode | Run browser without UI (default true). |
| - Use Chrome Headless Shell | Run browser in headless shell mode (requires headless mode enabled and chrome-headless-shell in PATH). |
| - Stealth Mode | Apply techniques to avoid detection as a headless browser. |
| - Proxy Server | Use a custom proxy server for requests (e.g., localhost:8080, socks5://localhost:1080). |
| - Add Container Arguments | Automatically add recommended arguments for container environments (--no-sandbox, etc.). |
Output
The output is an array of items, each corresponding to an input item processed. For the "Get Page Content" operation, each item contains:
json:body: The full HTML content of the loaded page as a string.headers: An object containing HTTP response headers received when loading the page.statusCode: The HTTP status code of the page response (e.g., 200).url: The final URL loaded (including any redirects).
pairedItem: References the original input item index.
No binary data is produced for this operation.
Dependencies
- Requires Puppeteer and puppeteer-extra libraries for browser automation.
- Optionally uses the stealth plugin to evade headless detection.
- If using "Browser WebSocket Endpoint," requires access to an existing browser instance exposing a WebSocket debugging endpoint.
- No internal credential types are required, but if accessing pages behind authentication, users must handle that externally or via custom scripts.
- Environment variables can influence allowed modules and console output during script execution but do not affect this operation directly.
Troubleshooting
- Failed to launch/connect to browser: Indicates Puppeteer could not start or connect to a browser instance. Check that the executable path is correct, dependencies are installed, and no conflicting processes block browser launch.
- Invalid URL: The provided URL is malformed or cannot be parsed. Ensure the URL is valid and properly formatted.
- Request failed with status code XXX: The page returned a non-200 HTTP status code. This may indicate the page is unavailable, requires authentication, or blocked by firewall/proxy.
- Timeout errors: Navigation took longer than the specified timeout. Increase the timeout value or check network conditions.
- Memory/CPU overload: Using a large batch size opens many pages simultaneously, which can exhaust system resources. Reduce batch size to mitigate.
- Stealth mode issues: Enabling stealth mode may cause unexpected behavior on some sites; disable if problems occur.
- Proxy configuration: Incorrect proxy server strings can prevent page loading. Verify proxy format and accessibility.