Crawl4AI: Basic Crawler icon

Crawl4AI: Basic Crawler

Crawl websites using Crawl4AI

Overview

The node "Crawl4AI: Basic Crawler" is designed to crawl multiple URLs using the Crawl4AI service. It enables users to fetch and extract web content programmatically, supporting advanced options such as JavaScript execution, headless browsing, caching strategies, and selective content extraction via CSS selectors. This node is beneficial for scenarios like web scraping, data aggregation from multiple websites, SEO analysis, or monitoring changes on web pages.

For example, a user can input a list of URLs to crawl product pages from different e-commerce sites, specify browser options to mimic real user behavior, and filter the extracted content by CSS selectors to focus only on product descriptions.

Properties

Name Meaning
URLs Comma-separated list of URLs to crawl. Example: https://example.com, https://example.org
Browser Options Collection of settings controlling the browser environment during crawling:
- Enable JavaScript: Whether to execute JavaScript on pages.
- Headless Mode: Run browser without UI.
- Timeout (Ms): Max wait time for page load.
- User Agent: Custom user agent string.
- Viewport Height & Width: Dimensions of the browser viewport.
Crawler Options Collection of crawling behavior settings:
- Cache Mode: How to use cache (Enabled, Bypass, Only).
- Check Robots.txt: Respect robots.txt rules.
- CSS Selector: Extract specific content by CSS selector.
- Exclude External Links: Omit external links from results.
- Excluded Tags: HTML tags to exclude.
- Max Retries: Number of retries on failure.
- Page Timeout (Ms): Max wait for page load.
- Request Timeout (Ms): Max wait for network requests.
- Stream Results: Stream output as available.
- Word Count Threshold: Minimum word count for content inclusion.
Options Additional options:
- Include Media Data: Include images/videos in output.
- Verbose Response: Include detailed data like HTML and status codes.
- Max Concurrent Crawls: Limit concurrent crawling tasks.

Output

The node outputs JSON data representing the crawled content for each URL. The structure typically includes:

  • Extracted text content (filtered by CSS selector if specified).
  • Metadata such as HTTP status codes, response times, and possibly raw HTML if verbose mode is enabled.
  • Optionally included media data (images, videos) if enabled.
  • If streaming is enabled, partial results may be emitted progressively.

Binary data output is not explicitly indicated; media data is likely included as URLs or base64-encoded strings within JSON.

Dependencies

  • Requires an API key credential for the Crawl4AI service to authenticate requests.
  • Relies on the Crawl4AI backend for crawling and rendering web pages.
  • No other external dependencies are indicated.
  • Users must configure the API key credential in n8n before using this node.

Troubleshooting

  • Common Issues:

    • Invalid or missing API key credential will cause authentication failures.
    • Network timeouts if target URLs are slow or unresponsive; adjust timeout settings accordingly.
    • Crawling blocked by robots.txt if enabled; disable if necessary but respect site policies.
    • Excessive concurrency may lead to rate limiting or IP blocking; reduce max concurrent crawls.
    • Incorrect CSS selectors may result in empty content extraction.
  • Error Messages:

    • Authentication errors: Verify API key configuration.
    • Timeout errors: Increase timeout values or check network connectivity.
    • Parsing errors: Check CSS selector syntax and excluded tags.
    • Rate limit errors: Lower concurrency or add delays between requests.

Links and References

Discussion