Overview
The node "Crawl4AI: Basic Crawler" is designed to crawl websites by fetching and processing the content of a single URL. It leverages configurable browser and crawler options to control how the page is loaded, rendered, and parsed. This node is useful for scenarios such as web scraping, content extraction, SEO analysis, or monitoring changes on web pages.
For example, you can use it to:
- Extract article content from news sites by specifying CSS selectors.
- Crawl pages that require JavaScript execution to fully load content.
- Respect robots.txt rules to comply with website crawling policies.
- Cache results to optimize repeated crawls of the same URLs.
Properties
| Name | Meaning |
|---|---|
| URL | The URL to crawl. |
| Browser Options | Collection of settings controlling the browser environment: - Enable JavaScript: Whether to execute JavaScript on the page. - Headless Mode: Run browser without UI. - Timeout (Ms): Max wait time for page load. - User Agent: Custom user agent string. - Viewport Height: Browser viewport height in pixels. - Viewport Width: Browser viewport width in pixels. |
| Crawler Options | Collection of settings controlling crawling behavior: - Cache Mode: How to use cache ("Enabled", "Bypass", "Only"). - Check Robots.txt: Whether to respect robots.txt rules. - CSS Selector: CSS selector to extract specific content. - Exclude External Links: Whether to exclude links pointing outside the domain. - Excluded Tags: Comma-separated HTML tags to ignore. - JavaScript Code: Custom JS code to run after page load. - JavaScript Only Mode: Execute only JS without crawling. - Max Retries: Number of retry attempts on failure. - Page Timeout (Ms): Max wait time for page loading. - Request Timeout (Ms): Max wait time for network requests. - Session ID: Identifier to maintain browser state across multiple crawls. - Word Count Threshold: Minimum word count for content inclusion. |
| Options | Additional output options: - Include Media Data: Include images and videos in output. - Verbose Response: Include detailed data like full HTML and status codes. |
Output
The node outputs JSON data representing the crawled page content and metadata. Depending on options, the output may include:
- Extracted text content filtered by CSS selectors and excluded tags.
- Lists of internal and external links found on the page.
- Media data such as images and videos if enabled.
- Detailed response information including HTTP status codes, raw HTML, and timing details when verbose mode is active.
If media data is included, binary data fields may be present representing downloaded media files.
Dependencies
- Requires an API key credential for the Crawl4AI service to authenticate requests.
- Relies on the Crawl4AI backend to perform crawling operations.
- No additional environment variables are explicitly required beyond the API credential.
Troubleshooting
Common issues:
- Invalid or missing URL input will cause the node to fail.
- Network timeouts if the target site is slow or unresponsive; adjust timeout settings accordingly.
- JavaScript execution failures if the page relies heavily on dynamic content; ensure JavaScript is enabled.
- Robots.txt restrictions may block crawling if enabled; disable if necessary but respect site policies.
- Cache mode misconfiguration might lead to stale or missing data.
Error messages:
- Authentication errors indicate invalid or missing API credentials.
- Timeout errors suggest increasing timeout values or checking network connectivity.
- Parsing errors may occur if CSS selectors are incorrect or the page structure changes.
Links and References
- Crawl4AI Official Documentation (hypothetical link)
- Web Scraping Best Practices
- robots.txt Specification