Puppeteer icon

Puppeteer

Automate browser interactions using Puppeteer

Overview

This node uses Puppeteer, a headless browser automation library, to interact with web pages and generate PDF documents from them. The "Get PDF" operation navigates to a specified URL, optionally applies query parameters, and renders the page as a PDF file with customizable options such as page size, margins, orientation, scaling, headers/footers, and background settings.

Common scenarios where this node is beneficial include:

  • Automatically generating PDFs of web reports or dashboards.
  • Archiving web pages in PDF format for compliance or record-keeping.
  • Creating printable versions of dynamic web content.
  • Automating PDF generation for invoices, tickets, or certificates hosted on web pages.

Practical example: You want to create a PDF snapshot of a sales dashboard at a specific URL every day. Using this node, you specify the dashboard URL, set the paper size to A4, enable background graphics, and save the resulting PDF binary data for further processing or storage.

Properties

Name Meaning
URL The web address of the page to convert into a PDF.
Property Name The name of the binary property where the generated PDF data will be stored.
Page Ranges Specifies which pages to print, e.g., "1-5, 8, 11-13". Optional.
Scale Scales the rendering of the web page; must be between 0.1 and 2. Default is 1 (normal scale).
Prefer CSS Page Size If true, any CSS @page size declared in the page takes priority over width, height, or format options.
Format Paper format type when printing the PDF (e.g., Letter, Legal, A4). Only used if "Prefer CSS Page Size" is false.
Height Custom paper height (number or string with unit). Used only if "Prefer CSS Page Size" is false.
Width Custom paper width (number or string with unit). Used only if "Prefer CSS Page Size" is false.
Landscape Whether to print the PDF in landscape orientation (true) or portrait (false).
Margin Collection of margin sizes (top, bottom, left, right) for the PDF. Each margin can be specified as a string with units.
Display Header/Footer Whether to show header and footer in the PDF.
Header Template HTML template for the header when "Display Header/Footer" is enabled. Supports classes like date, title, url, pageNumber, and totalPages for dynamic content injection.
Footer Template HTML template for the footer when "Display Header/Footer" is enabled. Supports classes like date.
Transparent Background If true, hides the default white background allowing transparent PDFs.
Background Graphics If true, includes background graphics in the PDF.
Query Parameters List of key-value pairs appended as query parameters to the URL before loading the page.
Options Additional Puppeteer launch and navigation options including:
- Batch Size: number of pages processed simultaneously.
- Browser WebSocket Endpoint: connect to existing browser.
- Emulate Device: simulate device.
- Executable Path: path to browser executable.
- Extra Headers: HTTP headers to send.
- File Name: filename for the output PDF.
- Launch Arguments: extra command line args.
- Timeout: max navigation time in ms.
- Wait Until: event to consider navigation succeeded.
- Page Caching: enable/disable caching.
- Headless mode: run browser headless.
- Use Chrome Headless Shell: use chrome-headless-shell.
- Stealth mode: evade detection.
- Proxy Server: proxy configuration.

Output

The node outputs an array of items, each containing:

  • binary: An object with a property named as per the "Property Name" input, containing the PDF data as binary. This binary data represents the generated PDF file.
  • json: Metadata about the response including:
    • headers: HTTP response headers from the page request.
    • statusCode: HTTP status code of the page request.
    • url: The final URL loaded (including query parameters).

This structure allows downstream nodes to access both the raw PDF file and metadata about the page fetch.

Dependencies

  • Requires Puppeteer and puppeteer-extra libraries for browser automation.
  • Optionally requires an API key credential for a CAPTCHA solving service if stealth mode with CAPTCHA bypass is enabled.
  • Node environment should allow launching Chromium or connecting to an existing browser via WebSocket.
  • If using custom devices emulation, relies on Puppeteer's known device descriptors.
  • For stealth mode, uses puppeteer-extra-plugin-stealth.
  • For CAPTCHA solving, uses puppeteer-extra-plugin-recaptcha with a 2Captcha API key.

Troubleshooting

  • Failed to launch/connect to browser: Indicates issues starting Chromium or connecting to a remote browser. Check executable path, WebSocket endpoint, and system dependencies.
  • Invalid URL: The provided URL is malformed or cannot be parsed. Verify the URL syntax and ensure it is reachable.
  • Request failed with status code XXX: The page returned an HTTP error status. Confirm the URL is correct and accessible.
  • Custom script must return an array of items: When running custom scripts, ensure the script returns an array of objects as expected.
  • Timeouts: Navigation may timeout if the page takes too long to load. Increase the timeout value or check network conditions.
  • Memory/CPU usage high: Increasing batch size or running many pages simultaneously consumes more resources. Reduce batch size if encountering performance issues.
  • Stealth mode detection: Some sites detect headless browsers despite stealth mode. Consider enabling stealth mode or adjusting launch arguments.

Links and References

Discussion