Overview
The node "Crawl4AI: Basic Crawler" is designed to crawl websites by fetching and processing the content of a single URL. It is useful for scenarios where you want to extract web page data programmatically, such as scraping articles, monitoring website changes, or gathering structured information from web pages.
This node supports advanced crawling options including JavaScript execution, respecting robots.txt rules, caching strategies, and filtering content by CSS selectors or HTML tags. It can also maintain browser sessions across multiple crawls, making it suitable for multi-step crawling workflows.
Practical examples:
- Extracting article text from a news website by specifying a CSS selector.
- Crawling product pages with JavaScript-rendered content enabled.
- Running custom JavaScript on the page to interact with dynamic elements before extracting data.
- Using cache modes to optimize repeated crawls of the same URLs.
Properties
| Name | Meaning |
|---|---|
| URL | The web address to crawl. |
| Browser Options | Settings controlling the browser environment during crawling: - Enable JavaScript (true/false) - Headless Mode (true/false) - Timeout in milliseconds - User Agent string - Viewport width and height |
| Crawler Options | Controls crawling behavior: - Cache Mode: Enabled (read/write), Bypass (force fresh), Only (read only) - Check Robots.txt (true/false) - CSS Selector to focus on specific content - Exclude external links (true/false) - Excluded HTML tags (comma-separated) - JavaScript code to execute after load - JavaScript only mode (true/false) - Max retries for failed requests - Page and request timeouts in milliseconds - Session ID to maintain browser state - Word count threshold to filter content |
| Options | Additional output options: - Include media data (images, videos) (true/false) - Verbose response with detailed data like HTML and status codes (true/false) |
Output
The node outputs JSON data representing the crawled content of the specified URL. The structure typically includes extracted text content, metadata about the page, and optionally media data if enabled. When verbose response is selected, additional details such as raw HTML, HTTP status codes, and other diagnostic information are included.
If media inclusion is enabled, the output may contain binary data references for images or videos found on the page.
Dependencies
- Requires an API key credential for the Crawl4AI service to perform crawling operations.
- The node internally uses a browser automation environment to fetch and process web pages, supporting JavaScript execution and session management.
- No additional environment variables or configurations are explicitly required beyond the API credential.
Troubleshooting
- Timeouts: If the page takes too long to load, increase the timeout settings under Browser Options or Crawler Options.
- JavaScript Execution Issues: If dynamic content is not loaded correctly, ensure JavaScript is enabled and consider adding custom JavaScript code to trigger necessary actions.
- Cache Behavior: Unexpected stale data might be due to cache mode settings; adjust cache mode to bypass or disable caching if fresh content is needed.
- Robots.txt Restrictions: Enabling robots.txt checking may prevent crawling some pages; disable this option if access is required but respect legal and ethical considerations.
- Session Management: For multi-step crawls, ensure consistent session IDs are used to maintain browser state.
- Error Messages: Common errors may relate to invalid URLs, network issues, or authentication failures with the API key. Verify URL correctness, network connectivity, and that the API key is valid and has sufficient permissions.
Links and References
- Crawl4AI Official Documentation (example placeholder link)
- n8n Documentation on Creating Custom Nodes
- Web Scraping Best Practices