Overview
The node "Crawl4AI: Basic Crawler" is designed to crawl multiple URLs using the Crawl4AI service. It enables users to fetch and extract web content programmatically, supporting advanced options such as JavaScript execution, headless browsing, caching strategies, and selective content extraction via CSS selectors. This node is beneficial for scenarios like web scraping, data aggregation from multiple websites, SEO analysis, or monitoring changes on web pages.
For example, a user can input a list of URLs to crawl product pages from different e-commerce sites, specify browser options to mimic real user behavior, and filter the extracted content by CSS selectors to focus only on product descriptions.
Properties
| Name | Meaning |
|---|---|
| URLs | Comma-separated list of URLs to crawl. Example: https://example.com, https://example.org |
| Browser Options | Collection of settings controlling the browser environment during crawling: - Enable JavaScript: Whether to execute JavaScript on pages. - Headless Mode: Run browser without UI. - Timeout (Ms): Max wait time for page load. - User Agent: Custom user agent string. - Viewport Height & Width: Dimensions of the browser viewport. |
| Crawler Options | Collection of crawling behavior settings: - Cache Mode: How to use cache ( Enabled, Bypass, Only).- Check Robots.txt: Respect robots.txt rules. - CSS Selector: Extract specific content by CSS selector. - Exclude External Links: Omit external links from results. - Excluded Tags: HTML tags to exclude. - Max Retries: Number of retries on failure. - Page Timeout (Ms): Max wait for page load. - Request Timeout (Ms): Max wait for network requests. - Stream Results: Stream output as available. - Word Count Threshold: Minimum word count for content inclusion. |
| Options | Additional options: - Include Media Data: Include images/videos in output. - Verbose Response: Include detailed data like HTML and status codes. - Max Concurrent Crawls: Limit concurrent crawling tasks. |
Output
The node outputs JSON data representing the crawled content for each URL. The structure typically includes:
- Extracted text content (filtered by CSS selector if specified).
- Metadata such as HTTP status codes, response times, and possibly raw HTML if verbose mode is enabled.
- Optionally included media data (images, videos) if enabled.
- If streaming is enabled, partial results may be emitted progressively.
Binary data output is not explicitly indicated; media data is likely included as URLs or base64-encoded strings within JSON.
Dependencies
- Requires an API key credential for the Crawl4AI service to authenticate requests.
- Relies on the Crawl4AI backend for crawling and rendering web pages.
- No other external dependencies are indicated.
- Users must configure the API key credential in n8n before using this node.
Troubleshooting
Common Issues:
- Invalid or missing API key credential will cause authentication failures.
- Network timeouts if target URLs are slow or unresponsive; adjust timeout settings accordingly.
- Crawling blocked by robots.txt if enabled; disable if necessary but respect site policies.
- Excessive concurrency may lead to rate limiting or IP blocking; reduce max concurrent crawls.
- Incorrect CSS selectors may result in empty content extraction.
Error Messages:
- Authentication errors: Verify API key configuration.
- Timeout errors: Increase timeout values or check network connectivity.
- Parsing errors: Check CSS selector syntax and excluded tags.
- Rate limit errors: Lower concurrency or add delays between requests.
Links and References
- Crawl4AI Official Website — For API documentation and service details.
- n8n Documentation — For general usage of custom nodes and credentials.
- CSS Selectors Reference — To craft selectors for content extraction.