Overview
This node, named "Crawl4AI: Basic Crawler," processes raw HTML content to extract and transform web page data. It is designed to crawl websites by parsing provided HTML input, focusing on specific parts of the content, filtering out unwanted elements, and optionally including media data or verbose details in the output.
Common scenarios where this node is beneficial include:
- Extracting article content from a full HTML page for further analysis or storage.
- Cleaning up HTML by removing navigation bars, footers, or sidebars before processing.
- Filtering links to exclude external URLs when gathering internal site data.
- Counting words to ensure only substantial content is processed.
- Including media information such as images or videos embedded in the HTML.
Practical example:
You have an HTML snapshot of a news article and want to extract just the main text without ads or navigation menus. You provide the raw HTML, specify a CSS selector targeting the article body, exclude tags like <nav> and <footer>, and set a minimum word count threshold to ignore short snippets. The node returns cleaned content ready for use in newsletters or summaries.
Properties
| Name | Meaning |
|---|---|
| HTML Content | The raw HTML content to process. |
| Base URL | The base URL used to resolve relative links within the HTML content. |
| Crawler Options | Collection of options to customize crawling behavior: |
| - CSS Selector | CSS selector string to focus extraction on specific parts of the HTML (e.g., article.content). |
| - Exclude External Links | Boolean flag to exclude external links from the results. |
| - Excluded Tags | Comma-separated list of HTML tags to exclude from processing (e.g., nav,footer,aside). |
| - Word Count Threshold | Minimum number of words required for content to be included in the output. |
| Options | Additional processing options: |
| - Include Media Data | Whether to include media data such as images and videos in the output. |
| - Verbose Response | Whether to include detailed response data like original HTML and status codes in the output. |
Output
The node outputs JSON data representing the processed content extracted from the raw HTML. This typically includes cleaned text content filtered according to the specified selectors and exclusions.
If enabled, media data such as image and video metadata will also be included in the output.
When verbose mode is active, additional fields may appear containing the original HTML snippet, HTTP status codes, or other diagnostic information useful for debugging or detailed analysis.
No binary data output is indicated by the source code.
Dependencies
- Requires an API key credential for the Crawl4AI service to perform crawling operations.
- The node depends on the Crawl4AI platform's API to process and parse the HTML content.
- No other external dependencies are explicitly mentioned.
Troubleshooting
- Missing or invalid API credentials: Ensure that a valid API key credential for the Crawl4AI service is configured in n8n.
- Empty or malformed HTML input: Providing empty or invalid HTML content may result in no output or errors. Validate the HTML before passing it to the node.
- Incorrect CSS selectors: If the CSS selector does not match any elements, the output may be empty. Verify selectors using browser developer tools.
- Excluding too many tags: Overly broad excluded tags might remove all content. Adjust the exclusion list carefully.
- Word count threshold too high: Setting a very high threshold could filter out all content unintentionally.
- Verbose output confusion: Enabling verbose response adds extra data fields; if unexpected, disable this option.
Links and References
- Crawl4AI Official Website — For API documentation and service details.
- CSS Selectors Reference — To help craft effective selectors.
- HTML Parsing Best Practices — Guidance on working with HTML content.