Overview
The node "Crawl4AI: Basic Crawler" is designed to crawl multiple URLs using the Crawl4AI service. It enables users to fetch and extract web content programmatically, supporting advanced browser options and crawling configurations. This node is beneficial for scenarios such as web scraping, content aggregation, SEO analysis, or monitoring changes on websites.
For example, a user can input a list of URLs to crawl, specify whether JavaScript should be executed on pages, set timeouts, and filter content by CSS selectors. The node can also respect robots.txt rules, exclude certain HTML tags, and control caching behavior. It supports concurrent crawling and can optionally include media data or verbose response details.
Properties
| Name | Meaning |
|---|---|
| URLs | Comma-separated list of URLs to crawl. Example: https://example.com, https://example.org |
| Browser Options | Collection of settings controlling the browser environment during crawling: - Enable JavaScript (boolean): Whether to execute JavaScript on pages. - Headless Mode (boolean): Run browser without UI. - Timeout (ms) (number): Max wait time for page load. - User Agent (string): Custom user agent string. - Viewport Height (number): Browser viewport height. - Viewport Width (number): Browser viewport width. |
| Crawler Options | Collection of crawling-specific settings: - Cache Mode (options): How to use cache ( enabled, bypass, only).- Check Robots.txt (boolean): Respect robots.txt rules. - CSS Selector (string): Focus on specific content via CSS selector. - Exclude External Links (boolean): Exclude links outside the domain. - Excluded Tags (string): Comma-separated HTML tags to exclude. - Max Retries (number): Number of retries for failed requests. - Page Timeout (ms) (number): Max wait for page load. - Request Timeout (ms) (number): Max wait for network requests. - Stream Results (boolean): Stream results as they become available. - Word Count Threshold (number): Minimum word count for content inclusion. |
| Options | Additional options: - Include Media Data (boolean): Include images/videos in output. - Verbose Response (boolean): Include detailed data like HTML and status codes. - Max Concurrent Crawls (number): Max number of parallel crawls. |
Output
The node outputs JSON data containing the crawled content from each URL. Depending on options selected, the output may include:
- Extracted text content filtered by CSS selectors.
- Metadata such as HTTP status codes, response headers.
- Media data like images and videos if enabled.
- Verbose information including full HTML source.
- Streaming output if enabled, providing partial results progressively.
If media data is included, binary data fields may be present representing downloaded media files.
Dependencies
- Requires an API key credential for the Crawl4AI service.
- The node depends on the Crawl4AI external API to perform crawling operations.
- No additional environment variables are explicitly required beyond the API authentication.
Troubleshooting
Common issues:
- Invalid or missing API key will cause authentication failures.
- Network timeouts if URLs are slow or unreachable; adjust timeout settings accordingly.
- Robots.txt restrictions may block crawling if enabled.
- Incorrect CSS selectors may result in empty content extraction.
- Exceeding max concurrent crawls may lead to rate limiting or errors from the service.
Error messages:
- Authentication errors indicate invalid credentials; verify API key setup.
- Timeout errors suggest increasing timeout values or checking network connectivity.
- Parsing errors may occur if the page structure differs significantly from expectations; review CSS selectors and excluded tags.
Links and References
- Crawl4AI Official Website — For API documentation and service details.
- Robots.txt Specification — Understanding robots.txt rules.
- CSS Selectors Reference — Guide to writing CSS selectors for content extraction.