Actions6
Overview
The ScrapeNinja node's "Crawl Website (Many Pages)" operation allows you to start a comprehensive web crawling process from a specified URL, traversing multiple pages of a website according to your configuration. This is useful for scenarios such as:
- Collecting large datasets from websites for research or analysis.
- Indexing documentation or blog sections for search or archiving.
- Monitoring website changes across many pages.
- Gathering content for training language models.
Practical Example:
You want to crawl all HTML pages under https://example.com/docs/, excluding PDFs and admin sections, and store the results in a database for later processing.
Properties
Below are the supported input properties for this operation, with their display names, types, and meanings:
| Display Name | Type | Meaning |
|---|---|---|
| Crawler Settings. Crawler node can take long time to finish! ... | notice | Informational message about crawler duration and progress tracking via logs and Postgres tables. |
| Start URL | string (required) | The initial URL where the crawler begins its traversal. Example: https://example.com |
| Max Depth | number | Limits how deep the crawler will traverse from the start page. (1 = only the start page) |
| Max Pages | number | Maximum number of pages to crawl before stopping. |
| Concurrent Requests | number | Number of simultaneous requests (1-5) the crawler will make. Controls speed and server load. |
| URL Pattern Matching Guide | notice | Explains how to use wildcards (*, **) in inclusion/exclusion patterns. |
| URL Inclusion Patterns | string[] | Only URLs matching these patterns will be crawled. Supports wildcards for flexible matching. |
| URL Exclusion Patterns | string[] | URLs matching these patterns will be skipped. Useful for avoiding unwanted content. |
| Re-Set Crawler Tables | boolean | If enabled, drops and recreates all crawler-related tables before starting. Use with caution. |
| WARNING: Only enable next parameter if crawling less than 30 pages... | notice | Warns about potential memory issues when including HTML in output. |
| Embed HTML of Scraped Pages in Node Output | boolean | If enabled, includes the full HTML of each scraped page in the node's output. Not recommended for large crawls. |
| Scraping Engine Settings | notice | Information about scraping engine options. |
| Engine Type | options | Selects the scraping engine: - Fast (No JS): High-performance, no JavaScript execution. - Real Browser (With JS): Uses real Chrome, supports JavaScript, slower. |
| Headers | string[] | Custom HTTP headers to send with requests. One per line, e.g., X-Header: value. Basic headers are added automatically. |
| Retry Count | number | Number of retry attempts if certain conditions fail. |
| Geo Location | options | Proxy location or custom proxy selection. Options include US, EU, Australia, etc. |
| Custom Proxy URL | string | URL for a premium or custom proxy. Only shown if Geo Location is set to "[Custom or Premium Proxy]". |
| Text Not Expected | string[] | List of text patterns; if found in a response, triggers a retry with another proxy. |
| Status Not Expected | number[] | HTTP status codes that trigger a retry with another proxy. |
| Follow Redirects | boolean | Whether to follow HTTP redirects (only for Fast engine). |
| Timeout (Seconds) | number | Timeout per attempt (in seconds) for Fast engine. |
| Timeout (Seconds) | number | Timeout per attempt (in seconds) for JS-based engine. |
| Wait For Selector | string | CSS selector to wait for before considering a JS-rendered page loaded. |
| Block Images | boolean | Blocks images in real Chrome to speed up loading (JS engine only). |
| Block Media (CSS, Fonts) | boolean | Blocks CSS/fonts in real Chrome to speed up loading (JS engine only). |
| Post-Load Wait Time | number | Wait time (seconds) after page load for JS engine (0-12s). |
Output
The output of this operation is an array of objects, each representing a crawled page. The structure typically includes:
{
"url": "https://example.com/page",
"status": 200,
"headers": { /* response headers */ },
"contentType": "text/html",
"responseTimeMs": 1234,
"depth": 2,
"matchedIncludePattern": "/docs/**",
"matchedExcludePattern": null,
"html": "<!DOCTYPE html>...</html>", // Only present if "Embed HTML..." is enabled
"error": null // or error message if failed
}
- url: The URL of the crawled page.
- status: HTTP status code returned by the server.
- headers: Response headers.
- contentType: MIME type of the response.
- responseTimeMs: Time taken to fetch the page.
- depth: How deep this page was from the start URL.
- matchedIncludePattern: Which inclusion pattern matched (if any).
- matchedExcludePattern: Which exclusion pattern matched (if any).
- html: Full HTML content (only if "Embed HTML..." is enabled).
- error: Error message if the crawl failed for this page.
Note: For large crawls, HTML is not included unless explicitly enabled due to memory concerns.
Dependencies
External Services:
- Requires access to the ScrapeNinja API.
- Requires a valid ScrapeNinja API key (configured as n8n credentials).
- Requires a PostgreSQL database connection (configured as n8n credentials) for storing crawl state and results.
n8n Configuration:
- Credentials for both ScrapeNinja API and Postgres must be set up in n8n.
Troubleshooting
Common Issues:
- Long Execution Times: Crawling many pages can take significant time. Monitor progress using n8n logs or by querying the
crawler_runs,crawler_queue, andcrawler_logstables in your Postgres database. - Memory Usage: Enabling "Embed HTML..." for large crawls may cause high memory usage or node failures. For large crawls, retrieve HTML directly from the database instead.
- Proxy/Geo Errors: Invalid or misconfigured proxies may result in connection errors or incomplete crawls.
- Pattern Mismatches: Incorrect inclusion/exclusion patterns may result in missing or extra pages being crawled.
Error Messages:
- "No additional details available": Indicates a generic failure without specific error data. Check logs and database tables for more information.
- HTTP/Network Errors: If a page fails to load, the error field in the output will contain the error message. Review the error and adjust settings (e.g., increase retries, check proxy).