ScrapeNinja icon

ScrapeNinja

Consume ScrapeNinja Web Scraping API - See full documentation at https://scrapeninja.net/docs/

Overview

The ScrapeNinja node's "Crawl Website (Many Pages)" operation allows you to start a comprehensive web crawling process from a specified URL, traversing multiple pages of a website according to your configuration. This is useful for scenarios such as:

  • Collecting large datasets from websites for research or analysis.
  • Indexing documentation or blog sections for search or archiving.
  • Monitoring website changes across many pages.
  • Gathering content for training language models.

Practical Example:
You want to crawl all HTML pages under https://example.com/docs/, excluding PDFs and admin sections, and store the results in a database for later processing.

Properties

Below are the supported input properties for this operation, with their display names, types, and meanings:

Display Name Type Meaning
Crawler Settings. Crawler node can take long time to finish! ... notice Informational message about crawler duration and progress tracking via logs and Postgres tables.
Start URL string (required) The initial URL where the crawler begins its traversal. Example: https://example.com
Max Depth number Limits how deep the crawler will traverse from the start page. (1 = only the start page)
Max Pages number Maximum number of pages to crawl before stopping.
Concurrent Requests number Number of simultaneous requests (1-5) the crawler will make. Controls speed and server load.
URL Pattern Matching Guide notice Explains how to use wildcards (*, **) in inclusion/exclusion patterns.
URL Inclusion Patterns string[] Only URLs matching these patterns will be crawled. Supports wildcards for flexible matching.
URL Exclusion Patterns string[] URLs matching these patterns will be skipped. Useful for avoiding unwanted content.
Re-Set Crawler Tables boolean If enabled, drops and recreates all crawler-related tables before starting. Use with caution.
WARNING: Only enable next parameter if crawling less than 30 pages... notice Warns about potential memory issues when including HTML in output.
Embed HTML of Scraped Pages in Node Output boolean If enabled, includes the full HTML of each scraped page in the node's output. Not recommended for large crawls.
Scraping Engine Settings notice Information about scraping engine options.
Engine Type options Selects the scraping engine:
- Fast (No JS): High-performance, no JavaScript execution.
- Real Browser (With JS): Uses real Chrome, supports JavaScript, slower.
Headers string[] Custom HTTP headers to send with requests. One per line, e.g., X-Header: value. Basic headers are added automatically.
Retry Count number Number of retry attempts if certain conditions fail.
Geo Location options Proxy location or custom proxy selection. Options include US, EU, Australia, etc.
Custom Proxy URL string URL for a premium or custom proxy. Only shown if Geo Location is set to "[Custom or Premium Proxy]".
Text Not Expected string[] List of text patterns; if found in a response, triggers a retry with another proxy.
Status Not Expected number[] HTTP status codes that trigger a retry with another proxy.
Follow Redirects boolean Whether to follow HTTP redirects (only for Fast engine).
Timeout (Seconds) number Timeout per attempt (in seconds) for Fast engine.
Timeout (Seconds) number Timeout per attempt (in seconds) for JS-based engine.
Wait For Selector string CSS selector to wait for before considering a JS-rendered page loaded.
Block Images boolean Blocks images in real Chrome to speed up loading (JS engine only).
Block Media (CSS, Fonts) boolean Blocks CSS/fonts in real Chrome to speed up loading (JS engine only).
Post-Load Wait Time number Wait time (seconds) after page load for JS engine (0-12s).

Output

The output of this operation is an array of objects, each representing a crawled page. The structure typically includes:

{
  "url": "https://example.com/page",
  "status": 200,
  "headers": { /* response headers */ },
  "contentType": "text/html",
  "responseTimeMs": 1234,
  "depth": 2,
  "matchedIncludePattern": "/docs/**",
  "matchedExcludePattern": null,
  "html": "<!DOCTYPE html>...</html>", // Only present if "Embed HTML..." is enabled
  "error": null // or error message if failed
}
  • url: The URL of the crawled page.
  • status: HTTP status code returned by the server.
  • headers: Response headers.
  • contentType: MIME type of the response.
  • responseTimeMs: Time taken to fetch the page.
  • depth: How deep this page was from the start URL.
  • matchedIncludePattern: Which inclusion pattern matched (if any).
  • matchedExcludePattern: Which exclusion pattern matched (if any).
  • html: Full HTML content (only if "Embed HTML..." is enabled).
  • error: Error message if the crawl failed for this page.

Note: For large crawls, HTML is not included unless explicitly enabled due to memory concerns.

Dependencies

  • External Services:

    • Requires access to the ScrapeNinja API.
    • Requires a valid ScrapeNinja API key (configured as n8n credentials).
    • Requires a PostgreSQL database connection (configured as n8n credentials) for storing crawl state and results.
  • n8n Configuration:

    • Credentials for both ScrapeNinja API and Postgres must be set up in n8n.

Troubleshooting

Common Issues:

  • Long Execution Times: Crawling many pages can take significant time. Monitor progress using n8n logs or by querying the crawler_runs, crawler_queue, and crawler_logs tables in your Postgres database.
  • Memory Usage: Enabling "Embed HTML..." for large crawls may cause high memory usage or node failures. For large crawls, retrieve HTML directly from the database instead.
  • Proxy/Geo Errors: Invalid or misconfigured proxies may result in connection errors or incomplete crawls.
  • Pattern Mismatches: Incorrect inclusion/exclusion patterns may result in missing or extra pages being crawled.

Error Messages:

  • "No additional details available": Indicates a generic failure without specific error data. Check logs and database tables for more information.
  • HTTP/Network Errors: If a page fails to load, the error field in the output will contain the error message. Review the error and adjust settings (e.g., increase retries, check proxy).

Links and References

Discussion