Crawl4AI: Content Extractor icon

Crawl4AI: Content Extractor

Extract structured content from web pages using Crawl4AI

Overview

The Crawl4AI: Content Extractor node extracts structured content from web pages by leveraging a Large Language Model (LLM) to interpret and parse the page content according to user-defined instructions and schema. It is particularly useful for scenarios where you want to scrape specific data points from websites without writing complex scraping code, such as extracting product details, article summaries, or any custom information from HTML pages.

Typical use cases include:

  • E-commerce: Extracting product names, prices, descriptions, and other attributes from online stores.
  • News aggregation: Summarizing articles or extracting headlines and authors.
  • Market research: Collecting structured data from competitor websites.
  • Data enrichment: Pulling additional info from public web pages to augment datasets.

The node loads the target URL in a browser environment (optionally with JavaScript enabled), optionally runs custom JavaScript on the page, then sends the extracted text along with extraction instructions and schema fields to an LLM provider to parse and return structured data.

Properties

Name Meaning
URL The web page URL to extract content from.
Extraction Instructions Text instructions guiding the LLM on what specific information to extract from the page (e.g., "Extract the product name, price, and description").
Schema Fields Defines the fields to extract, each with:
- Field Name: Identifier for the extracted field.
- Field Type: Data type (String, Number, Boolean, Array).
- Description: Helps the LLM understand the field.
- Required: Whether the field must be present.
Browser Options Controls how the page is loaded:
- Enable JavaScript: Whether to run JS on the page.
- Headless Mode: Run browser invisibly.
- JavaScript Code: Custom JS to execute before extraction (e.g., click buttons).
- Timeout: Max wait time for page load.
- Viewport Height/Width: Browser window size.
LLM Options Settings for the language model:
- LLM Provider: Choose among supported LLMs (Anthropic Claude, Groq Llama, Ollama Llama, OpenAI GPT-3.5 Turbo, OpenAI GPT-4o).
- Max Tokens: Max tokens in response.
- Override LLM Provider: Use custom API key instead of default credentials.
- Provider API Key: API key if overriding.
- Temperature: Controls creativity/randomness of output.
Options Additional options:
- Cache Mode: How to use caching when crawling (enabled, bypass, only).
- Include Original Text: Whether to include full webpage text in output.
- CSS Selector: Limit extraction to a specific part of the page using a CSS selector.

Output

The node outputs JSON data containing the extracted fields as defined by the user’s schema. Each field will have a value matching its specified type (string, number, boolean, or array). If "Include Original Text" is enabled, the original webpage text is also included in the output.

If the node supports binary data output (not explicitly shown here), it would typically represent downloaded files or screenshots, but this node focuses on structured JSON extraction.

Example output structure (simplified):

{
  "extractedData": {
    "title": "Example Product",
    "price": 19.99,
    "description": "A great product for your needs.",
    "available": true,
    "tags": ["new", "sale"]
  },
  "originalText": "<full webpage text here>" // optional, if enabled
}

Dependencies

  • Requires an active internet connection to load the target URL.
  • Needs an API key credential for the Crawl4AI service to perform extraction.
  • Requires credentials or API keys for the selected LLM provider if overriding defaults.
  • Uses a headless browser environment internally to load and interact with web pages.
  • Node configuration may require setting environment variables or credentials for LLM providers and Crawl4AI API access.

Troubleshooting

  • Page Load Failures: If the URL is incorrect or the site blocks automated browsers, extraction will fail. Verify the URL and consider adjusting browser options like enabling/disabling JavaScript or increasing timeout.
  • Incorrect or Missing Data: Ensure extraction instructions and schema fields are clear and accurate. Ambiguous instructions can confuse the LLM.
  • Cache Issues: If stale data is returned, try changing the cache mode to "Bypass" to force fresh extraction.
  • API Key Errors: Invalid or missing API keys for Crawl4AI or LLM providers will cause authentication errors. Check credentials and permissions.
  • Timeouts: Complex pages or slow networks might require increasing the timeout setting.
  • JavaScript Execution Problems: If dynamic content is not loading, verify that JavaScript is enabled and that any custom JS code is correct.

Links and References

Discussion