Scrapfly icon

Scrapfly

Scrapfly data collection APIs for web page scraping, screenshots, and AI data extraction

Overview

The Scrapfly node provides a comprehensive interface to Scrapfly's data collection APIs, enabling users to scrape web pages, capture screenshots, perform AI-powered data extraction, and access account information. Specifically, the Scrape Web Page operation allows users to programmatically retrieve content from any web page URL with advanced options such as proxy usage, JavaScript rendering, anti-scraping bypass, caching, and more.

This node is beneficial in scenarios where automated, reliable web scraping is needed without managing infrastructure or dealing with common scraping challenges like IP blocking, dynamic content loading, or bot detection. For example, it can be used to:

  • Extract product details from e-commerce sites that require JavaScript rendering.
  • Collect news articles or blog posts while avoiding anti-bot protections.
  • Capture structured data from complex web pages using custom extraction templates.
  • Automate monitoring of website changes with caching and retry logic.

Properties

Name Meaning
URL The web page URL to scrape.
Method HTTP method to use for the request. Options: GET, HEAD, OPTIONS, PATCH, POST, PUT.
Additional Fields A collection of optional parameters to customize the scrape:
- Body HTTP request body content (for methods like POST).
- Headers Custom HTTP headers as key-value pairs (e.g., Accept-Language).
- Retry Whether to retry the request on failures (boolean).
- Timeout Maximum time allowed for the scrape in milliseconds (default 150000 ms).
- Proxy Pool Proxy IP address pool to use. Options: Public Datacenter Pool, Public Residential Pool.
- Country Proxy geolocation country code (e.g., "us").
- Anti-Scraping Protection (ASP) Enable anti-scraping protection to bypass antibots (boolean).
- Cost Budget Budget limit for dynamic retries/upgrades during ASP to control cost (number).
- Render JS Enable JavaScript rendering via headless browser (boolean).
- Auto Scroll Automatically scroll down the page when rendering with headless browser (boolean).
- Rendering Wait Time in milliseconds to wait after page load before scraping when rendering JS (default 1000 ms).
- Rendering Stage Stage to wait for during rendering: Complete (wait full load) or Dom Content Load (faster, partial load).
- Wait For Selector XPath or CSS selector to wait for before scraping (string).
- JavaScript Injection JavaScript code to inject and execute in the headless browser context (string).
- JavaScript Scenario Base64-encoded JSON describing scripted interactions (clicks, waits, fills) for the headless browser.
- Screenshots Multiple named screenshots to take, each with a selector or "fullpage".
- Screenshot Flags Flags to customize screenshot behavior: Block Banners, Dark Mode, High Quality, Load Images, Print Media Format.
- Format Output format of scraped content: Clean HTML, JSON, Markdown, Raw (default), Text.
- Format Options Options for markdown format: No Links, No Images, Only Content.
- Extraction Template Base64-encoded JSON template to extract structured data from the scraped content.
- Extraction Prompt Instruction text for AI-based extraction or question answering on scraped content.
- Extraction Model AI model identifier for automatic parsing of scraped document.
- Session Alphanumeric session name to reuse cookies, fingerprint, and proxy across scrapes.
- Session Sticky Proxy Whether to reuse the same proxy IP within a session (boolean).
- Cache Enable cache layer to return cached content if available (boolean).
- Cache TTL Cache time-to-live in seconds (default 86400 = 1 day).
- Cache Clear Force refresh cache and scrape anew (boolean).
- Proxified Response Return raw page content as response body, replacing status code and headers accordingly (boolean).
- Debug Enable debug mode for detailed logs (boolean).
- Tags Add tags to group scrapes for filtering in dashboard; multiple string tags allowed.
- Operating System Emulate OS for user agent: Win11, Mac, Linux, Chromeos (default Win11). Cannot set OS and User-Agent header simultaneously.
- Language Web page language setting, configures Accept-Language header.
- Geolocation Spoof latitude and longitude for geolocation permission in format "latitude,longitude".
- DNS Retrieve target DNS information (boolean).
- SSL Retrieve remote SSL certificate and TLS info (only for HTTPS targets) (boolean).
- Correlation ID Helper ID to correlate groups of scrapes for monitoring purposes (string).
- Webhook Queue scrape request and redirect API response to a webhook endpoint by name (string).

Output

The node outputs an array of JSON objects representing the results of each scrape request. The structure of each output object depends on the requested options but generally includes:

  • The raw or formatted content of the scraped web page (HTML, JSON, Markdown, or plain text).
  • Metadata such as HTTP status code, headers, and timing information.
  • If enabled, extracted structured data based on extraction templates or AI models.
  • Screenshots data if requested (likely as URLs or binary references).
  • Debug information if debug mode is enabled.
  • Cache status and other operational metadata.

If the Proxified Response option is enabled, the output will contain the direct content of the page as the body, with status code and headers replaced accordingly, matching the selected format.

Binary data output is not explicitly described here, but screenshots or other media may be returned as URLs or references rather than raw binary.

Dependencies

  • Requires an active Scrapfly API key credential configured in n8n.
  • Depends on Scrapfly's external web scraping service for all operations.
  • Network connectivity to Scrapfly API endpoints.
  • Optional: Webhook endpoint configured in Scrapfly dashboard if using webhook queuing.
  • No additional local dependencies are required.

Troubleshooting

  • Common Issues:

    • Invalid or missing API key credential will cause authentication errors.
    • Incorrect URL format or unreachable URLs will result in request failures.
    • Using incompatible combinations of options (e.g., setting both OS and User-Agent header) may cause errors.
    • Exceeding timeout or cost budget limits may abort the scrape prematurely.
    • Improperly encoded extraction templates or JavaScript scenarios may fail silently or cause errors.
    • Proxy pool or country settings might lead to blocked requests if proxies are blacklisted.
  • Error Messages:

    • Authentication errors: Verify API key credential is correctly set.
    • Timeout errors: Increase timeout or reduce rendering wait times.
    • Rate limiting or quota exceeded: Check Scrapfly account limits.
    • Invalid parameter errors: Review property values for correctness and encoding.
    • Network errors: Ensure network connectivity and Scrapfly service availability.
  • Resolutions:

    • Double-check all input parameters and their formats.
    • Use debug mode to get detailed logs for troubleshooting.
    • Adjust retry and timeout settings according to target site responsiveness.
    • Consult Scrapfly documentation for proxy pool and anti-scraping configurations.

Links and References

Discussion