Crawl4AI: Basic Crawler icon

Crawl4AI: Basic Crawler

Crawl websites using Crawl4AI

Overview

This node, named "Crawl4AI: Basic Crawler," is designed to process raw HTML content by crawling and extracting meaningful data from it. It is particularly useful for scenarios where you have raw HTML input and want to extract structured information such as text content, links, or media elements while applying filters like CSS selectors or excluding certain tags.

Practical examples include:

  • Extracting the main article content from a downloaded HTML page.
  • Filtering out navigation bars, footers, or sidebars to focus on primary content.
  • Resolving relative URLs based on a base URL.
  • Optionally including media data like images or videos in the output.
  • Applying word count thresholds to filter out insignificant content blocks.

Properties

Name Meaning
HTML Content The raw HTML content to process. Example: <html><body><h1>Example</h1><p>Content</p></body></html>
Base URL The base URL used to resolve relative links within the HTML content. Default: https://example.com
Crawler Options Collection of options to customize crawling behavior:
- CSS Selector CSS selector string to focus extraction on specific parts of the HTML (e.g., article.content). Empty means full page.
- Exclude External Links Boolean flag to exclude external links from the results.
- Excluded Tags Comma-separated list of HTML tags to exclude from processing (e.g., nav,footer,aside).
- Word Count Threshold Minimum number of words required for content to be included in the output.
Options Additional options controlling output details:
- Include Media Data Boolean flag indicating whether to include media data such as images and videos in the output.
- Verbose Response Boolean flag to include detailed response data like original HTML and status codes.

Output

The node outputs JSON data representing the processed content extracted from the provided raw HTML. This typically includes cleaned and filtered textual content, resolved links, and optionally media data if enabled.

If media inclusion is enabled, the output will contain media metadata such as image or video URLs.

When verbose response is enabled, additional fields may be present, such as the original HTML snippet processed, HTTP status codes, or other diagnostic information.

No binary data output is indicated by the source code.

Dependencies

  • Requires an API key credential for the Crawl4AI service to perform crawling and processing.
  • The node depends on the Crawl4AI API endpoint to process the HTML content.
  • No other external dependencies are indicated.

Troubleshooting

  • Common issues:

    • Invalid or missing API credentials will cause authentication failures.
    • Providing malformed or empty HTML content may result in no output or errors.
    • Incorrect CSS selectors might lead to empty or incomplete extraction.
    • Setting too high a word count threshold could filter out all content unintentionally.
  • Error messages:

    • Authentication errors related to the API key require verifying and updating the credential.
    • Network or API errors suggest checking internet connectivity and Crawl4AI service status.
    • Parsing errors indicate invalid HTML input; ensure the HTML is well-formed.

Links and References

Discussion