Overview
The Crawl4AI: Content Extractor node extracts structured data from web pages by leveraging a Large Language Model (LLM) to interpret and parse the content. It is designed to fetch a webpage, optionally execute JavaScript on it (e.g., to load dynamic content), and then use an LLM to extract specific information according to user-defined instructions and schema.
This node is beneficial in scenarios such as:
- Scraping product details (name, price, description) from e-commerce sites.
- Extracting summaries or key points from articles or blog posts.
- Gathering structured data from complex web pages where traditional scrapers struggle due to dynamic content or inconsistent HTML structure.
Practical example:
- You want to extract the title, price, and features of a product from a retail website. You provide the URL, specify extraction instructions for these fields, define a schema for the expected output, and the node returns structured JSON with the extracted data.
Properties
| Name | Meaning |
|---|---|
| URL | The web page URL to extract content from. |
| Extraction Instructions | Text instructions guiding the LLM on what specific information to extract from the page (e.g., "Extract the product name, price, and description"). |
| Schema Input Mode | Method to define the extraction schema: either via simple individual field inputs ("Simple Fields") or by providing a full JSON schema ("Advanced JSON"). |
| Schema Fields | (When using Simple Fields mode) A collection of fields specifying each field's name, type (string, number, boolean, array), description, and whether it is required. |
| JSON Schema | (When using Advanced JSON mode) A JSON-formatted schema defining the structure, types, descriptions, and required fields for the extracted data. |
| Browser Options | Settings controlling the browser environment used to load the page, including enabling/disabling JavaScript execution, headless mode, custom JavaScript code to run before extraction, timeout, and viewport dimensions. |
| LLM Options | Configuration for the LLM provider used for extraction, including choice of provider, maximum tokens, temperature (randomness), option to override default provider, and API key if overriding. |
| Options | Additional options such as how to handle arrays in the extracted data (e.g., keep as object, split top-level arrays, smart splitting), cache usage mode (enabled, bypass, read-only), CSS selector to focus extraction, and whether to include metadata or original text in outputs. |
Output
The node outputs structured JSON data under the json field that matches the user-defined schema. This JSON contains the extracted fields as interpreted by the LLM from the webpage content.
If array handling options are enabled, the output may be split into multiple items based on arrays detected in the extracted data, allowing easier processing of lists or collections.
The node can also include metadata such as the source URL and success status, and optionally the full original text of the webpage if requested.
No binary data output is indicated.
Dependencies
- Requires access to a web browser environment capable of loading and rendering web pages, including optional JavaScript execution.
- Requires an API key credential for the Crawl4AI service to perform the extraction.
- Supports multiple LLM providers; if overriding the default provider, requires an API key for the chosen LLM.
- Network access to target URLs and LLM endpoints is necessary.
- Proper configuration of caching behavior can improve performance and reduce repeated requests.
Troubleshooting
- Page Load Failures: If the page does not load within the specified timeout, extraction will fail. Increase the timeout or check network connectivity.
- JavaScript Execution Issues: Some pages require JavaScript to render content. Ensure JavaScript is enabled in browser options and any necessary custom JS code is correct.
- Invalid Schema: Providing malformed JSON in the advanced schema or inconsistent field definitions can cause errors. Validate JSON syntax and ensure required fields are correctly defined.
- LLM Errors: Exceeding token limits or invalid API keys for the LLM provider can cause failures. Adjust max tokens and verify API credentials.
- Cache Inconsistencies: Using cache modes improperly might return stale data or no data. Choose appropriate cache mode based on freshness requirements.
- Empty or Incorrect Extraction: If instructions or schema do not match the page content well, the LLM may return incomplete or incorrect data. Refine instructions and schema descriptions for clarity.
Links and References
- JSON Schema Documentation
- n8n Documentation on Custom Nodes
- Relevant LLM provider documentation (OpenAI, Anthropic, etc.) depending on selected provider.