Puppeteer icon

Puppeteer

Automate browser interactions using Puppeteer

Overview

This node uses Puppeteer to automate browser actions for web scraping and content retrieval. It supports operations like fetching page HTML content, taking screenshots, generating PDFs, and running custom scripts on web pages. It is useful for scenarios such as extracting data from websites, capturing visual snapshots, or automating interactions with web pages.

Use Case Examples

  1. Extract the HTML content of a webpage for data analysis.
  2. Capture a screenshot of a webpage for visual documentation.
  3. Generate a PDF of a webpage for offline reading or archiving.
  4. Run custom JavaScript on a webpage to interact with dynamic content or scrape specific elements.

Properties

Name Meaning
URL The web page URL to navigate to and interact with.
Query Parameters Additional query parameters to append to the URL when making the request.
Batch Size Maximum number of pages to open simultaneously to control resource usage.
Browser WebSocket Endpoint WebSocket URL to connect to an existing browser instance instead of launching a new one.
Protocol Protocol to use for browser communication, e.g., Chrome DevTools Protocol or WebDriver BiDi.
Emulate Device Emulate a specific device's viewport and user agent.
Executable path Path to a custom browser executable to use instead of the bundled one.
Extra Headers Custom HTTP headers to send with the page requests.
File Name File name to assign to binary data outputs like screenshots or PDFs.
Launch Arguments Additional command line arguments to pass to the browser instance.
Timeout Maximum navigation time in milliseconds before timing out.
Protocol Timeout Maximum time to wait for a protocol response in milliseconds.
Wait Until Event to wait for to consider navigation successful (e.g., load, domcontentloaded).
Page Caching Enable or disable page level caching.
Headless mode Run the browser in headless mode (no UI).
Use Chrome Headless Shell Run browser in headless shell mode, requires chrome-headless-shell in PATH.
Stealth mode Apply techniques to make headless Puppeteer harder to detect.
Human typing mode Enable human-like typing simulation on input elements.
Human Typing Options Settings to customize the human typing simulation behavior.
Proxy Server Custom proxy server configuration for browser requests.
Capture Downloads Automatically capture and return files downloaded during script execution.
Add Container Arguments Automatically add recommended Chrome arguments when running in container environments.

Output

JSON

  • body - The HTML content of the page (for Get Page Content operation).
  • headers - HTTP response headers from the page request.
  • statusCode - HTTP status code of the page response.
  • url - The final URL of the page after navigation and redirects.

Dependencies

  • puppeteer-extra
  • puppeteer-extra-plugin-stealth
  • puppeteer-extra-plugin-human-typing
  • puppeteer
  • vm2

Troubleshooting

  • Ensure the URL is valid and accessible to avoid navigation errors.
  • Check that the browser executable path is correct if using a custom browser.
  • If running in a container, verify that container arguments are properly set or disabled as needed.
  • Timeout errors can occur if the page takes too long to load; adjust the timeout settings accordingly.
  • When using stealth mode, some websites may still detect automation; try toggling stealth mode off if issues arise.
  • If capturing downloads, ensure the download path is writable and has sufficient space.

Links

Discussion