ScrapeNinja icon

ScrapeNinja

Consume ScrapeNinja Web Scraping API - See full documentation at https://scrapeninja.net/docs/

Overview

The ScrapeNinja node's "Scrape Single Page (Browser, Slow)" operation enables you to scrape a single web page using a real browser environment (headless Chrome). This approach allows for the execution of JavaScript on the target page, making it suitable for scraping dynamic content that requires JS rendering, handling complex interactions, or capturing screenshots. It is particularly useful for:

  • Extracting data from modern websites that rely heavily on client-side JavaScript.
  • Bypassing basic anti-bot protections by mimicking real browser behavior.
  • Capturing rendered HTML, screenshots, and even specific iframe contents.

Practical examples:

  • Scraping product details from e-commerce sites with dynamic loading.
  • Collecting news articles from sites that render content via JavaScript.
  • Taking screenshots of landing pages for monitoring or archival purposes.

Properties

Below are the supported input properties for this operation, along with their display names, types, and meanings:

Display Name Type Meaning
URL to Scrape string The URL of the web page to scrape.
Headers string[] Custom request headers (one per line: "HeaderName: value"). User-Agent and other basic headers are added automatically.
Retry Count number Number of retry attempts if certain conditions fail (e.g., HTTP errors, unexpected text/status).
Geo Location options Proxy geo location or custom proxy selection. Each attempt may use a different IP if using a geo option.
Custom Proxy URL string Premium or custom proxy URL (used if Geo Location is set to "[Custom or Premium Proxy]").
Text Not Expected string[] Array of text patterns; if found in the response, triggers a retry with another proxy.
Status Not Expected number[] HTTP status codes that trigger a retry with another proxy. Defaults include 403 and 502.
Extractor (Custom JS) string Custom JavaScript function for extracting data from the HTML. Receives page HTML and Cheerio parser as arguments. Must return a JSON object.
Timeout (Seconds) number Timeout per attempt (in seconds) for JS-based scraping.
Wait For Selector string CSS selector to wait for before considering the page loaded.
Dump Iframe string Name of an iframe to dump. Waits for this iframe to appear in the DOM.
Wait For Selector in Iframe string CSS selector to wait for inside the specified iframe.
Extractor Target Iframe boolean Whether to run the custom extractor on the iframe HTML instead of the main page.
Block Images boolean Whether to block images in Chrome to speed up loading.
Block Media (CSS, Fonts) boolean Whether to block CSS/fonts in Chrome to speed up loading.
Screenshot boolean Whether to take a screenshot of the page (slower if enabled).
Catch Ajax Headers URL Mask string If set, captures/dumps XHR requests/responses matching this mask.
Post-Load Wait Time number Seconds to wait after page load (1–12s). Use if automatic waiting fails.
Viewport Settings (JSON) string Advanced: Custom viewport size/settings as a JSON object. Default is 1920x1080.

Output

The output will be a JSON object containing the results of the scraping operation. The structure can vary depending on the options selected, but typically includes:

  • Extracted Data: If a custom extractor is provided, the returned JSON object from your extractor function.
  • HTML Content: The full HTML of the page (or iframe, if specified).
  • Screenshot (optional): If enabled, binary data representing the screenshot (as a file attachment).
  • XHR/Ajax Data (optional): If "Catch Ajax Headers URL Mask" is set, relevant request/response data.
  • Meta Information: Such as HTTP status, headers, timing, and possibly proxy/geolocation info.

Note: If binary data (like screenshots) is included, it will be available in the binary output field.


Dependencies

  • External Service: Requires access to the ScrapeNinja API.
  • API Key: You must configure the scrapeNinjaApi credential in n8n.
  • Proxy (optional): For custom proxies, follow the proxy setup guide.

Troubleshooting

Common Issues:

  • Invalid API Key: Ensure your ScrapeNinja API credentials are correctly configured in n8n.
  • Timeouts: Increase the "Timeout (Seconds)" property if the target site is slow to load.
  • Blocked Requests: Some sites may still block scraping despite browser emulation. Try changing the Geo Location or using a premium proxy.
  • Extractor Errors: If your custom JS extractor throws an error, ensure it returns a valid JSON object and uses the correct function signature.
  • Binary Output Handling: If you enable screenshots, make sure downstream nodes can handle binary data.

Error Messages:

  • "error": "<message>", "details": "<additional details>": General error format. Check the message and details for clues (e.g., network errors, invalid selectors, extractor exceptions).
  • HTTP Status Not Expected: If you see retries or failures due to status codes, adjust the "Status Not Expected" list as needed.

Links and References

Discussion