Crawl4AI: Basic Crawler icon

Crawl4AI: Basic Crawler

Crawl websites using Crawl4AI

Overview

The node "Crawl4AI: Basic Crawler" is designed to crawl websites by fetching and processing the content of a single URL. It leverages configurable browser and crawler options to control how the page is loaded, rendered, and parsed. This node is useful for scenarios such as web scraping, content extraction, SEO analysis, or automated data collection from web pages.

For example, you can use it to:

  • Extract article content from news sites by specifying a CSS selector.
  • Crawl product pages while respecting robots.txt rules.
  • Execute custom JavaScript on the page to load dynamic content before extraction.
  • Maintain session state across multiple crawls using a session ID.

Properties

Name Meaning
URL The URL of the webpage to crawl.
Browser Options Settings controlling the browser environment during crawling:
- Enable JavaScript (true/false)
- Headless Mode (true/false)
- Timeout in milliseconds
- User Agent string
- Viewport width and height
Crawler Options Controls crawling behavior:
- Cache Mode: Enabled (read/write), Bypass (force fresh), Only (read only)
- Check Robots.txt (true/false)
- CSS Selector to focus on specific content
- Exclude external links (true/false)
- Excluded HTML tags (comma-separated)
- JavaScript code to execute after page load
- JavaScript only mode (true/false)
- Max retries for failed requests
- Page timeout in ms
- Request timeout in ms
- Session ID to maintain browser state
- Word count threshold for including content
Options Additional output options:
- Include media data (images, videos) (true/false)
- Verbose response with detailed data like HTML and status codes (true/false)

Output

The node outputs JSON data representing the crawled content of the specified URL. Depending on options, this may include:

  • Extracted text content filtered by CSS selectors and word count thresholds.
  • Metadata about the page such as HTTP status codes.
  • Optionally included media data like images and videos if enabled.
  • Detailed raw HTML and other verbose information if requested.

If media data is included, binary data fields may be present representing downloaded media files.

Dependencies

  • Requires an API key credential for the Crawl4AI service to perform crawling.
  • The node internally uses a browser automation environment configured via the provided browser options.
  • No additional environment variables are explicitly required beyond the API authentication.

Troubleshooting

  • Timeouts: If the page takes too long to load, increase the timeout, pageTimeout, or requestTimeout values.
  • JavaScript Execution Issues: If dynamic content is not loading correctly, ensure Enable JavaScript is true and consider adding custom JavaScript code to trigger content loading.
  • Robots.txt Blocking: Enabling Check Robots.txt may prevent crawling some URLs; disable it if you have permission to bypass.
  • Cache Behavior: Misconfigured cache modes might cause stale data or excessive requests; choose the appropriate cache mode based on your needs.
  • Session Management: Use consistent Session ID values to maintain state across multiple crawls; otherwise, sessions will be isolated.
  • Invalid CSS Selectors: Incorrect CSS selectors may result in empty content extraction; verify selectors carefully.

Common error messages would relate to network failures, invalid URLs, or authentication issues with the Crawl4AI API. Ensure the API key is valid and the URL is reachable.

Links and References

Discussion