Crawl4AI: Basic Crawler

Crawl websites using Crawl4AI

Actions3

Overview

This node, named "Crawl4AI: Basic Crawler," processes raw HTML content to extract and transform web page data. It is designed to crawl websites by parsing provided HTML input, focusing on specific parts of the content, filtering out unwanted elements, and optionally including media data or verbose details in the output.

Common scenarios where this node is beneficial include:

Extracting article content from a full HTML page for further analysis or storage.
Cleaning up HTML by removing navigation bars, footers, or sidebars before processing.
Filtering links to exclude external URLs when gathering internal site data.
Counting words to ensure only substantial content is processed.
Including media information such as images or videos embedded in the HTML.

Practical example:
You have an HTML snapshot of a news article and want to extract just the main text without ads or navigation menus. You provide the raw HTML, specify a CSS selector targeting the article body, exclude tags like <nav> and <footer>, and set a minimum word count threshold to ignore short snippets. The node returns cleaned content ready for use in newsletters or summaries.

Properties

Name	Meaning
HTML Content	The raw HTML content to process.
Base URL	The base URL used to resolve relative links within the HTML content.
Crawler Options	Collection of options to customize crawling behavior:
- CSS Selector	CSS selector string to focus extraction on specific parts of the HTML (e.g., `article.content`).
- Exclude External Links	Boolean flag to exclude external links from the results.
- Excluded Tags	Comma-separated list of HTML tags to exclude from processing (e.g., `nav,footer,aside`).
- Word Count Threshold	Minimum number of words required for content to be included in the output.
Options	Additional processing options:
- Include Media Data	Whether to include media data such as images and videos in the output.
- Verbose Response	Whether to include detailed response data like original HTML and status codes in the output.

Output

The node outputs JSON data representing the processed content extracted from the raw HTML. This typically includes cleaned text content filtered according to the specified selectors and exclusions.

If enabled, media data such as image and video metadata will also be included in the output.

When verbose mode is active, additional fields may appear containing the original HTML snippet, HTTP status codes, or other diagnostic information useful for debugging or detailed analysis.

No binary data output is indicated by the source code.

Dependencies

Requires an API key credential for the Crawl4AI service to perform crawling operations.
The node depends on the Crawl4AI platform's API to process and parse the HTML content.
No other external dependencies are explicitly mentioned.

Troubleshooting

Missing or invalid API credentials: Ensure that a valid API key credential for the Crawl4AI service is configured in n8n.
Empty or malformed HTML input: Providing empty or invalid HTML content may result in no output or errors. Validate the HTML before passing it to the node.
Incorrect CSS selectors: If the CSS selector does not match any elements, the output may be empty. Verify selectors using browser developer tools.
Excluding too many tags: Overly broad excluded tags might remove all content. Adjust the exclusion list carefully.
Word count threshold too high: Setting a very high threshold could filter out all content unintentionally.
Verbose output confusion: Enabling verbose response adds extra data fields; if unexpected, disable this option.

Links and References

Crawl4AI Official Website — For API documentation and service details.
CSS Selectors Reference — To help craft effective selectors.
HTML Parsing Best Practices — Guidance on working with HTML content.

Crawl4AI: Basic CrawlerInstall