ScrapeNinja

Consume ScrapeNinja Web Scraping API - See full documentation at https://scrapeninja.net/docs/

Actions6

Overview

The ScrapeNinja node's "Crawl Website (Many Pages)" operation allows you to start a comprehensive web crawling process from a specified URL, traversing multiple pages of a website according to your configuration. This is useful for scenarios such as:

Collecting large datasets from websites for research or analysis.
Indexing documentation or blog sections for search or archiving.
Monitoring website changes across many pages.
Gathering content for training language models.

Practical Example:
You want to crawl all HTML pages under https://example.com/docs/, excluding PDFs and admin sections, and store the results in a database for later processing.

Properties

Below are the supported input properties for this operation, with their display names, types, and meanings:

Display Name	Type	Meaning
Crawler Settings. Crawler node can take long time to finish! ...	notice	Informational message about crawler duration and progress tracking via logs and Postgres tables.
Start URL	string (required)	The initial URL where the crawler begins its traversal. Example: `https://example.com`
Max Depth	number	Limits how deep the crawler will traverse from the start page. (1 = only the start page)
Max Pages	number	Maximum number of pages to crawl before stopping.
Concurrent Requests	number	Number of simultaneous requests (1-5) the crawler will make. Controls speed and server load.
URL Pattern Matching Guide	notice	Explains how to use wildcards (``, `*`) in inclusion/exclusion patterns.
URL Inclusion Patterns	string[]	Only URLs matching these patterns will be crawled. Supports wildcards for flexible matching.
URL Exclusion Patterns	string[]	URLs matching these patterns will be skipped. Useful for avoiding unwanted content.
Re-Set Crawler Tables	boolean	If enabled, drops and recreates all crawler-related tables before starting. Use with caution.
WARNING: Only enable next parameter if crawling less than 30 pages...	notice	Warns about potential memory issues when including HTML in output.
Embed HTML of Scraped Pages in Node Output	boolean	If enabled, includes the full HTML of each scraped page in the node's output. Not recommended for large crawls.
Scraping Engine Settings	notice	Information about scraping engine options.
Engine Type	options	Selects the scraping engine: - Fast (No JS): High-performance, no JavaScript execution. - Real Browser (With JS): Uses real Chrome, supports JavaScript, slower.
Headers	string[]	Custom HTTP headers to send with requests. One per line, e.g., `X-Header: value`. Basic headers are added automatically.
Retry Count	number	Number of retry attempts if certain conditions fail.
Geo Location	options	Proxy location or custom proxy selection. Options include US, EU, Australia, etc.
Custom Proxy URL	string	URL for a premium or custom proxy. Only shown if Geo Location is set to "[Custom or Premium Proxy]".
Text Not Expected	string[]	List of text patterns; if found in a response, triggers a retry with another proxy.
Status Not Expected	number[]	HTTP status codes that trigger a retry with another proxy.
Follow Redirects	boolean	Whether to follow HTTP redirects (only for Fast engine).
Timeout (Seconds)	number	Timeout per attempt (in seconds) for Fast engine.
Timeout (Seconds)	number	Timeout per attempt (in seconds) for JS-based engine.
Wait For Selector	string	CSS selector to wait for before considering a JS-rendered page loaded.
Block Images	boolean	Blocks images in real Chrome to speed up loading (JS engine only).
Block Media (CSS, Fonts)	boolean	Blocks CSS/fonts in real Chrome to speed up loading (JS engine only).
Post-Load Wait Time	number	Wait time (seconds) after page load for JS engine (0-12s).

Output

The output of this operation is an array of objects, each representing a crawled page. The structure typically includes:

{
  "url": "https://example.com/page",
  "status": 200,
  "headers": { /* response headers */ },
  "contentType": "text/html",
  "responseTimeMs": 1234,
  "depth": 2,
  "matchedIncludePattern": "/docs/**",
  "matchedExcludePattern": null,
  "html": "<!DOCTYPE html>...</html>", // Only present if "Embed HTML..." is enabled
  "error": null // or error message if failed
}

url: The URL of the crawled page.
status: HTTP status code returned by the server.
headers: Response headers.
contentType: MIME type of the response.
responseTimeMs: Time taken to fetch the page.
depth: How deep this page was from the start URL.
matchedIncludePattern: Which inclusion pattern matched (if any).
matchedExcludePattern: Which exclusion pattern matched (if any).
html: Full HTML content (only if "Embed HTML..." is enabled).
error: Error message if the crawl failed for this page.

Note: For large crawls, HTML is not included unless explicitly enabled due to memory concerns.

Dependencies

External Services:
- Requires access to the ScrapeNinja API.
- Requires a valid ScrapeNinja API key (configured as n8n credentials).
- Requires a PostgreSQL database connection (configured as n8n credentials) for storing crawl state and results.
n8n Configuration:
- Credentials for both ScrapeNinja API and Postgres must be set up in n8n.

Troubleshooting

Common Issues:

Long Execution Times: Crawling many pages can take significant time. Monitor progress using n8n logs or by querying the crawler_runs, crawler_queue, and crawler_logs tables in your Postgres database.
Memory Usage: Enabling "Embed HTML..." for large crawls may cause high memory usage or node failures. For large crawls, retrieve HTML directly from the database instead.
Proxy/Geo Errors: Invalid or misconfigured proxies may result in connection errors or incomplete crawls.
Pattern Mismatches: Incorrect inclusion/exclusion patterns may result in missing or extra pages being crawled.

Error Messages:

"No additional details available": Indicates a generic failure without specific error data. Check logs and database tables for more information.
HTTP/Network Errors: If a page fails to load, the error field in the output will contain the error message. Review the error and adjust settings (e.g., increase retries, check proxy).