Crawl4AI: Basic Crawler

Crawl websites using Crawl4AI

Actions3

Overview

This node, named "Crawl4AI: Basic Crawler," is designed to process raw HTML content by crawling and extracting meaningful data from it. It is particularly useful for scenarios where you have raw HTML input and want to extract structured information such as text content, links, or media elements while applying filters like CSS selectors or excluding certain tags.

Practical examples include:

Extracting the main article content from a downloaded HTML page.
Filtering out navigation bars, footers, or sidebars to focus on primary content.
Resolving relative URLs based on a base URL.
Optionally including media data like images or videos in the output.
Applying word count thresholds to filter out insignificant content blocks.

Properties

Name	Meaning
HTML Content	The raw HTML content to process. Example: `<html><body><h1>Example</h1><p>Content</p></body></html>`
Base URL	The base URL used to resolve relative links within the HTML content. Default: `https://example.com`
Crawler Options	Collection of options to customize crawling behavior:
- CSS Selector	CSS selector string to focus extraction on specific parts of the HTML (e.g., `article.content`). Leave empty to process full page.
- Exclude External Links	Boolean flag to exclude external links from the extracted results.
- Excluded Tags	Comma-separated list of HTML tags to exclude from processing (e.g., `nav,footer,aside`).
- Word Count Threshold	Minimum number of words required for content to be included in the output.
Options	Additional options affecting output details:
- Include Media Data	Whether to include media elements such as images and videos in the output.
- Verbose Response	Whether to include detailed response data such as original HTML and status codes in the output.

Output

The node outputs JSON data representing the processed content extracted from the provided raw HTML. This typically includes cleaned and filtered textual content, resolved links, and optionally media data if enabled.

If the "Include Media Data" option is selected, the output will also contain information about images and videos found within the processed HTML.

When "Verbose Response" is enabled, the output includes additional metadata such as the original HTML content and HTTP status codes related to the crawling operation.

No binary data output is indicated by the source code or properties.

Dependencies

Requires an API key credential for the Crawl4AI service to perform crawling and processing.
The node depends on the external Crawl4AI API to handle the actual crawling and HTML processing logic.
No other external dependencies or environment variables are explicitly required.

Troubleshooting

Common Issues:
- Invalid or missing API credentials will prevent the node from functioning.
- Providing malformed or incomplete HTML content may result in empty or incorrect output.
- Incorrect CSS selectors might lead to no content being extracted.
- Setting too high a word count threshold could filter out all content unintentionally.
Error Messages:
- Authentication errors indicate issues with the API key; verify and re-enter credentials.
- Network or timeout errors suggest connectivity problems with the Crawl4AI service.
- Parsing errors may occur if the HTML content is not well-formed; validate your input HTML.

Links and References

Crawl4AI Official Website — For API documentation and account setup.
CSS Selectors Reference — To craft effective CSS selectors for content targeting.
HTML5 Specification — For understanding HTML structure and tags.