Crawl4AI: Basic Crawler

Crawl websites using Crawl4AI

Actions3

Overview

This node, named "Crawl4AI: Basic Crawler," is designed to process raw HTML content by crawling and extracting meaningful data from it. It is particularly useful for scenarios where you have raw HTML input and want to extract structured information such as text content, links, or media elements while applying filters like CSS selectors or excluding certain tags.

Practical examples include:

Extracting the main article content from a downloaded HTML page.
Filtering out navigation bars, footers, or sidebars to focus on primary content.
Resolving relative URLs based on a base URL.
Optionally including media data like images or videos in the output.
Applying word count thresholds to filter out insignificant content blocks.

Properties

Name	Meaning
HTML Content	The raw HTML content to process. Example: `<html><body><h1>Example</h1><p>Content</p></body></html>`
Base URL	The base URL used to resolve relative links within the HTML content. Default: `https://example.com`
Crawler Options	Collection of options to customize crawling behavior:
- CSS Selector	CSS selector string to focus extraction on specific parts of the HTML (e.g., `article.content`). Empty means full page.
- Exclude External Links	Boolean flag to exclude external links from the results.
- Excluded Tags	Comma-separated list of HTML tags to exclude from processing (e.g., `nav,footer,aside`).
- Word Count Threshold	Minimum number of words required for content to be included in the output.
Options	Additional options controlling output details:
- Include Media Data	Boolean flag indicating whether to include media data such as images and videos in the output.
- Verbose Response	Boolean flag to include detailed response data like original HTML and status codes.

Output

The node outputs JSON data representing the processed content extracted from the provided raw HTML. This typically includes cleaned and filtered textual content, resolved links, and optionally media data if enabled.

If media inclusion is enabled, the output will contain media metadata such as image or video URLs.

When verbose response is enabled, additional fields may be present, such as the original HTML snippet processed, HTTP status codes, or other diagnostic information.

No binary data output is indicated by the source code.

Dependencies

Requires an API key credential for the Crawl4AI service to perform crawling and processing.
The node depends on the Crawl4AI API endpoint to process the HTML content.
No other external dependencies are indicated.

Troubleshooting

Common issues:
- Invalid or missing API credentials will cause authentication failures.
- Providing malformed or empty HTML content may result in no output or errors.
- Incorrect CSS selectors might lead to empty or incomplete extraction.
- Setting too high a word count threshold could filter out all content unintentionally.
Error messages:
- Authentication errors related to the API key require verifying and updating the credential.
- Network or API errors suggest checking internet connectivity and Crawl4AI service status.
- Parsing errors indicate invalid HTML input; ensure the HTML is well-formed.

Links and References

Crawl4AI Official Website (for API documentation and usage)
n8n Documentation on Creating Custom Nodes