Scrapfly

Scrapfly data collection APIs for web page scraping, screenshots, and AI data extraction

Actions4

Scrape Actions
- Scrape Web Page
- Scrape API Request
Extraction Actions
- Extract Data From an HTML, Text, or Markdown Document Using AI
Screenshot Actions
- Capture Web Page Screenshot

Overview

The Scrapfly node provides a comprehensive interface to Scrapfly's data collection APIs, enabling users to scrape web pages, capture screenshots, perform AI-powered data extraction, and access account information. Specifically, the Scrape Web Page operation allows users to programmatically retrieve content from any web page URL with advanced options such as proxy usage, JavaScript rendering, anti-scraping bypass, caching, and more.

This node is beneficial in scenarios where automated, reliable web scraping is needed without managing infrastructure or dealing with common scraping challenges like IP blocking, dynamic content loading, or bot detection. For example, it can be used to:

Extract product details from e-commerce sites that require JavaScript rendering.
Collect news articles or blog posts while avoiding anti-bot protections.
Capture structured data from complex web pages using custom extraction templates.
Automate monitoring of website changes with caching and retry logic.

Properties

Name	Meaning
URL	The web page URL to scrape.
Method	HTTP method to use for the request. Options: GET, HEAD, OPTIONS, PATCH, POST, PUT.
Additional Fields	A collection of optional parameters to customize the scrape:
- Body	HTTP request body content (for methods like POST).
- Headers	Custom HTTP headers as key-value pairs (e.g., Accept-Language).
- Retry	Whether to retry the request on failures (boolean).
- Timeout	Maximum time allowed for the scrape in milliseconds (default 150000 ms).
- Proxy Pool	Proxy IP address pool to use. Options: Public Datacenter Pool, Public Residential Pool.
- Country	Proxy geolocation country code (e.g., "us").
- Anti-Scraping Protection (ASP)	Enable anti-scraping protection to bypass antibots (boolean).
- Cost Budget	Budget limit for dynamic retries/upgrades during ASP to control cost (number).
- Render JS	Enable JavaScript rendering via headless browser (boolean).
- Auto Scroll	Automatically scroll down the page when rendering with headless browser (boolean).
- Rendering Wait	Time in milliseconds to wait after page load before scraping when rendering JS (default 1000 ms).
- Rendering Stage	Stage to wait for during rendering: Complete (wait full load) or Dom Content Load (faster, partial load).
- Wait For Selector	XPath or CSS selector to wait for before scraping (string).
- JavaScript Injection	JavaScript code to inject and execute in the headless browser context (string).
- JavaScript Scenario	Base64-encoded JSON describing scripted interactions (clicks, waits, fills) for the headless browser.
- Screenshots	Multiple named screenshots to take, each with a selector or "fullpage".
- Screenshot Flags	Flags to customize screenshot behavior: Block Banners, Dark Mode, High Quality, Load Images, Print Media Format.
- Format	Output format of scraped content: Clean HTML, JSON, Markdown, Raw (default), Text.
- Format Options	Options for markdown format: No Links, No Images, Only Content.
- Extraction Template	Base64-encoded JSON template to extract structured data from the scraped content.
- Extraction Prompt	Instruction text for AI-based extraction or question answering on scraped content.
- Extraction Model	AI model identifier for automatic parsing of scraped document.
- Session	Alphanumeric session name to reuse cookies, fingerprint, and proxy across scrapes.
- Session Sticky Proxy	Whether to reuse the same proxy IP within a session (boolean).
- Cache	Enable cache layer to return cached content if available (boolean).
- Cache TTL	Cache time-to-live in seconds (default 86400 = 1 day).
- Cache Clear	Force refresh cache and scrape anew (boolean).
- Proxified Response	Return raw page content as response body, replacing status code and headers accordingly (boolean).
- Debug	Enable debug mode for detailed logs (boolean).
- Tags	Add tags to group scrapes for filtering in dashboard; multiple string tags allowed.
- Operating System	Emulate OS for user agent: Win11, Mac, Linux, Chromeos (default Win11). Cannot set OS and User-Agent header simultaneously.
- Language	Web page language setting, configures Accept-Language header.
- Geolocation	Spoof latitude and longitude for geolocation permission in format "latitude,longitude".
- DNS	Retrieve target DNS information (boolean).
- SSL	Retrieve remote SSL certificate and TLS info (only for HTTPS targets) (boolean).
- Correlation ID	Helper ID to correlate groups of scrapes for monitoring purposes (string).
- Webhook	Queue scrape request and redirect API response to a webhook endpoint by name (string).

Output

The node outputs an array of JSON objects representing the results of each scrape request. The structure of each output object depends on the requested options but generally includes:

The raw or formatted content of the scraped web page (HTML, JSON, Markdown, or plain text).
Metadata such as HTTP status code, headers, and timing information.
If enabled, extracted structured data based on extraction templates or AI models.
Screenshots data if requested (likely as URLs or binary references).
Debug information if debug mode is enabled.
Cache status and other operational metadata.

If the Proxified Response option is enabled, the output will contain the direct content of the page as the body, with status code and headers replaced accordingly, matching the selected format.

Binary data output is not explicitly described here, but screenshots or other media may be returned as URLs or references rather than raw binary.

Dependencies

Requires an active Scrapfly API key credential configured in n8n.
Depends on Scrapfly's external web scraping service for all operations.
Network connectivity to Scrapfly API endpoints.
Optional: Webhook endpoint configured in Scrapfly dashboard if using webhook queuing.
No additional local dependencies are required.

Troubleshooting

Common Issues:
- Invalid or missing API key credential will cause authentication errors.
- Incorrect URL format or unreachable URLs will result in request failures.
- Using incompatible combinations of options (e.g., setting both OS and User-Agent header) may cause errors.
- Exceeding timeout or cost budget limits may abort the scrape prematurely.
- Improperly encoded extraction templates or JavaScript scenarios may fail silently or cause errors.
- Proxy pool or country settings might lead to blocked requests if proxies are blacklisted.
Error Messages:
- Authentication errors: Verify API key credential is correctly set.
- Timeout errors: Increase timeout or reduce rendering wait times.
- Rate limiting or quota exceeded: Check Scrapfly account limits.
- Invalid parameter errors: Review property values for correctness and encoding.
- Network errors: Ensure network connectivity and Scrapfly service availability.
Resolutions:
- Double-check all input parameters and their formats.
- Use debug mode to get detailed logs for troubleshooting.
- Adjust retry and timeout settings according to target site responsiveness.
- Consult Scrapfly documentation for proxy pool and anti-scraping configurations.