Scrapfly icon

Scrapfly

Scrapfly data collection APIs for web page scraping, screenshots, and AI data extraction

Overview

This node integrates with Scrapfly, a service providing data collection APIs for web scraping, screenshots, and AI-powered data extraction. Specifically, the Scrape API Request operation allows users to perform HTTP requests to scrape web pages or APIs with advanced features like proxy rotation, session management, and anti-scraping protection.

Common scenarios include:

  • Extracting data from websites that require custom headers or specific HTTP methods.
  • Bypassing anti-bot protections using built-in anti-scraping features.
  • Using proxy pools to avoid IP bans or geo-restrict content.
  • Managing sessions to maintain cookies and fingerprints across multiple requests.

Practical example:

  • Scraping product details from an e-commerce site by sending a GET request with custom headers and rotating proxies to avoid detection.

Properties

Name Meaning
URL The web page URL to scrape.
Method The HTTP method to use for the request. Options: GET, HEAD, OPTIONS, PATCH, POST, PUT.
Additional Fields A collection of optional parameters:
- Body The HTTP request body (for methods like POST or PUT).
- Headers Custom HTTP headers as key-value pairs to include in the request (e.g., Accept-Language: en-US).
- Proxy Pool The proxy pool to use for the request. Options: Public Datacenter Pool, Public Residential Pool.
- Country The country code for proxy geolocation (e.g., us for United States).
- Anti-Scraping Protection (asp) Enable to bypass anti-bot protections automatically.
- Session A named session string to reuse cookies, fingerprint, and proxy across multiple scrapes. Must be alphanumeric and max 255 characters.
- Session Sticky Proxy Whether to reuse the same proxy IP within the session (best effort). Defaults to true.
- Debug Enable debug mode to get detailed logs for troubleshooting.

Output

The node outputs an array of JSON objects representing the response from the Scrapfly API for each input item processed. Each JSON object typically contains:

  • The scraped data or API response content.
  • Metadata about the request such as status codes, headers, and any error messages if applicable.

If binary data is returned (not typical for this operation), it would represent downloaded files or screenshots, but this operation focuses on JSON/text responses.

Dependencies

  • Requires an active Scrapfly API key credential configured in n8n.
  • Internet access to reach Scrapfly's API endpoints.
  • Optional proxy usage depends on Scrapfly's proxy pools configured via the node properties.

Troubleshooting

  • Common issues:

    • Invalid or missing API key: Ensure the Scrapfly API key credential is correctly set up.
    • Network errors or timeouts: Check internet connectivity and Scrapfly service status.
    • Incorrect URL or unsupported HTTP method: Verify the URL format and method compatibility.
    • Proxy-related errors: If using proxy pools, ensure the selected pool is available and supports the target region.
    • Session misconfiguration: Session names must be alphanumeric and under 256 characters.
  • Error messages:

    • Authentication failures usually indicate invalid API credentials.
    • HTTP errors (4xx, 5xx) reflect issues with the target server or request parameters.
    • Debug mode can be enabled to get more detailed error information for diagnosis.

Links and References

Discussion