Crawl4AI: Content Extractor icon

Crawl4AI: Content Extractor

Extract structured content from web pages using Crawl4AI

Overview

The Crawl4AI: Content Extractor node extracts structured content from web pages by leveraging a Large Language Model (LLM) to interpret and parse the page content according to user instructions. It is designed for scenarios where you want to scrape and transform web data into structured JSON objects without manually writing complex scraping code.

Typical use cases include:

  • Extracting job listings with titles, locations, and URLs from a careers page.
  • Pulling product details such as names, prices, and descriptions from e-commerce sites.
  • Collecting event information like dates, venues, and descriptions from event listing pages.

By combining browser automation options (e.g., enabling JavaScript, running headless, executing custom JS) with LLM-powered extraction guided by user-defined schema fields and instructions, this node offers a flexible and powerful way to convert unstructured web content into usable data.


Properties

Name Meaning
URL The web page URL to extract content from.
Extraction Instructions Text instructions for the LLM describing what to extract from the page. For multiple items, specify that all items should be extracted.
Schema Fields Defines the fields to extract, each with:
- Field Name: identifier for the extracted field.
- Field Type: String, Number, Boolean, or Array.
- Description: helps the LLM understand what to extract.
- Required: whether the field must be present.
Browser Options Controls browser behavior during extraction:
- Enable JavaScript: run page scripts.
- Headless Mode: run browser invisibly.
- JavaScript Code: custom JS to execute before extraction (e.g., click buttons).
- Timeout: max wait time for page load.
- Viewport Height/Width: browser window size.
LLM Options Settings for the language model:
- Frequency Penalty, Presence Penalty: control token repetition.
- LLM Provider: OpenAI, Groq, Anthropic, Ollama.
- Max Tokens: max response length.
- Model Name or ID: specific model selection.
- Override LLM Provider: use custom API key.
- Temperature, Top P: control creativity and diversity.
- Provider API Key: optional API key if overriding credentials.
Options Additional extraction options:
- Cache Mode: enabled, bypass, or read-only cache usage.
- Extract Multiple Items: whether to extract multiple entries or just one.
- Include Original Text: include full webpage text in output.
- CSS Selector: restrict extraction to a specific part of the page.

Output

The node outputs JSON data representing the extracted structured content based on the user-defined schema fields. The structure corresponds to the fields specified, with values parsed from the web page content.

If "Extract Multiple Items" is enabled, the output will be an array of objects, each representing one extracted item (e.g., one job listing per object). Otherwise, a single object is returned.

If "Include Original Text" is enabled, the output also contains the full original webpage text alongside the extracted data.

Binary data output is not indicated by the source; the node focuses on JSON structured data extraction.


Dependencies

  • Requires an API key credential for the Crawl4AI service to perform extraction.
  • Supports multiple LLM providers (OpenAI, Groq, Anthropic, Ollama), which may require separate API keys or credentials.
  • Uses a headless browser environment to load and interact with web pages, optionally executing JavaScript.
  • May require network access to the target URLs and the LLM provider endpoints.

Troubleshooting

  • Page Load Failures or Timeouts: If the page does not load within the specified timeout, increase the "Timeout (MS)" property or check network connectivity.
  • Incorrect or Incomplete Extraction: Ensure the "Extraction Instructions" clearly describe what to extract and that the "Schema Fields" match expected data types and names.
  • JavaScript Execution Issues: If dynamic content is not loaded, verify "Enable JavaScript" is true and consider adding custom "JavaScript Code" to trigger loading (e.g., clicking "Load More" buttons).
  • Cache Problems: If stale data is returned, adjust "Cache Mode" to bypass or disable caching.
  • LLM Errors or Rate Limits: Check API keys and usage limits for the selected LLM provider. Use the override option to supply a valid API key if needed.
  • Model Selection Issues: If no models appear or errors occur fetching models, fallback defaults are used, but specifying a valid model ID can help.

Common error messages typically relate to network issues, invalid API keys, or misconfigured extraction parameters. Reviewing logs and adjusting properties accordingly usually resolves these.


Links and References


This summary is based solely on static analysis of the provided source code and property definitions.

Discussion