Crawl4AI: Basic Crawler icon

Crawl4AI: Basic Crawler

Crawl websites using Crawl4AI

Overview

This node, named "Crawl4AI: Basic Crawler," is designed to process raw HTML content by crawling and extracting meaningful data from it. It is particularly useful for scenarios where you have raw HTML input and want to extract structured information such as text content, links, or media elements while applying filters like CSS selectors or excluding certain tags.

Practical examples include:

  • Extracting the main article content from a downloaded HTML page.
  • Filtering out navigation bars, footers, or sidebars to focus on primary content.
  • Resolving relative URLs based on a base URL.
  • Optionally including media data like images or videos in the output.
  • Applying word count thresholds to filter out insignificant content blocks.

Properties

Name Meaning
HTML Content The raw HTML content to process. Example: <html><body><h1>Example</h1><p>Content</p></body></html>
Base URL The base URL used to resolve relative links within the HTML content. Default: https://example.com
Crawler Options Collection of options to customize crawling behavior:
- CSS Selector CSS selector string to focus extraction on specific parts of the HTML (e.g., article.content). Leave empty to process full page.
- Exclude External Links Boolean flag to exclude external links from the extracted results.
- Excluded Tags Comma-separated list of HTML tags to exclude from processing (e.g., nav,footer,aside).
- Word Count Threshold Minimum number of words required for content to be included in the output.
Options Additional options affecting output details:
- Include Media Data Whether to include media elements such as images and videos in the output.
- Verbose Response Whether to include detailed response data such as original HTML and status codes in the output.

Output

The node outputs JSON data representing the processed content extracted from the provided raw HTML. This typically includes cleaned and filtered textual content, resolved links, and optionally media data if enabled.

If the "Include Media Data" option is selected, the output will also contain information about images and videos found within the processed HTML.

When "Verbose Response" is enabled, the output includes additional metadata such as the original HTML content and HTTP status codes related to the crawling operation.

No binary data output is indicated by the source code or properties.

Dependencies

  • Requires an API key credential for the Crawl4AI service to perform crawling and processing.
  • The node depends on the external Crawl4AI API to handle the actual crawling and HTML processing logic.
  • No other external dependencies or environment variables are explicitly required.

Troubleshooting

  • Common Issues:

    • Invalid or missing API credentials will prevent the node from functioning.
    • Providing malformed or incomplete HTML content may result in empty or incorrect output.
    • Incorrect CSS selectors might lead to no content being extracted.
    • Setting too high a word count threshold could filter out all content unintentionally.
  • Error Messages:

    • Authentication errors indicate issues with the API key; verify and re-enter credentials.
    • Network or timeout errors suggest connectivity problems with the Crawl4AI service.
    • Parsing errors may occur if the HTML content is not well-formed; validate your input HTML.

Links and References

Discussion