Overview
The node "Crawl4AI: Basic Crawler" enables crawling multiple URLs using the Crawl4AI service. It is designed to fetch and extract web content programmatically, supporting advanced browser options and crawler configurations. This node is beneficial for scenarios such as web scraping, data extraction from multiple websites, monitoring website changes, or aggregating content from various sources.
Practical examples include:
- Extracting article content from a list of news websites.
- Monitoring product pages for price changes.
- Collecting metadata or media assets from multiple URLs for analysis.
Properties
| Name | Meaning |
|---|---|
| URLs | Comma-separated list of URLs to crawl. The node will process each URL in this list. |
| Browser Options | Collection of settings controlling the browser behavior during crawling: • Enable JavaScript (true/false) • Headless Mode (true/false) • Timeout (ms) for page load • User Agent string • Viewport Width and Height (pixels) |
| Crawler Options | Collection of crawler-specific settings: • Cache Mode: Enabled (read/write), Bypass (force fresh), Only (read only) • Check Robots.txt rules (true/false) • CSS Selector to focus on specific content • Exclude External Links (true/false) • Excluded HTML tags (comma-separated) • Max Retries for failed requests • Page Timeout (ms) • Request Timeout (ms) • Stream Results (true/false) • Word Count Threshold for including content |
| Options | Additional options: • Include Media Data (images, videos) in output (true/false) • Verbose Response with detailed data like HTML and status codes (true/false) • Max Concurrent Crawls (number) |
Output
The node outputs JSON data containing the results of crawling each URL. The structure typically includes extracted content based on the CSS selector or full page content if no selector is specified. If enabled, media data such as images and videos are included. When verbose response is active, additional details like raw HTML, HTTP status codes, and metadata are provided.
If streaming is enabled, results may be emitted progressively as they become available.
Binary data output is not explicitly indicated; media data is likely embedded or referenced within the JSON output.
Dependencies
- Requires an API key credential for the Crawl4AI service to authenticate requests.
- Relies on the Crawl4AI external web crawling API.
- No other explicit environment variables or n8n configurations are required beyond standard API credential setup.
Troubleshooting
Common Issues:
- Invalid or missing API credentials will cause authentication failures.
- Network timeouts if the target URLs are slow or unresponsive; adjust timeout settings accordingly.
- Robots.txt restrictions may block crawling if enabled; disable if necessary or ensure compliance.
- Exceeding max concurrent crawls may lead to rate limiting or throttling by the Crawl4AI service.
- Incorrect CSS selectors may result in empty or incomplete content extraction.
Error Messages:
- Authentication errors: Verify that the API key credential is correctly configured.
- Timeout errors: Increase timeout values in browser or crawler options.
- Rate limit errors: Reduce max concurrent crawls or add retry logic.
- Parsing errors: Check CSS selector syntax and excluded tags configuration.
Links and References
- Crawl4AI Official Website (for API documentation and usage guidelines)
- n8n Documentation (general node usage and credential setup)