Overview
The node "Crawl4AI: Content Extractor" is designed to extract structured content from web pages by using CSS selectors. It loads a specified URL, optionally executes JavaScript on the page (e.g., clicking buttons or scrolling), and extracts data based on user-defined CSS selectors relative to a base selector that identifies repeating elements like product items or article cards.
This node is beneficial in scenarios such as:
- Scraping product details from e-commerce listings.
- Extracting article titles, summaries, or links from news sites.
- Collecting structured data from any webpage with repeated content blocks.
For example, you can specify a URL of an online store category page, set the base selector to the product item container, and define fields like product title, price, and image URL using CSS selectors. The node will return an array of extracted items with these fields.
Properties
| Name | Meaning |
|---|---|
| URL | The web address of the page to extract content from. |
| Base Selector | CSS selector identifying the repeating element on the page (e.g., each product card or article block). Extraction fields are relative to this selector. |
| Fields | A collection of fields to extract for each repeating element. Each field includes: - Field Name: Identifier for the extracted data. - CSS Selector: Selector relative to the base selector. - Field Type: Text, HTML, or Attribute. - Attribute Name: If extracting an attribute, specify which one (e.g., href, src). |
| Browser Options | Settings controlling browser behavior: - Enable JavaScript: Whether to run JS on the page. - Headless Mode: Run browser without UI. - JavaScript Code: Custom JS to execute before extraction (e.g., click load more). - Timeout: Max wait time for page load. - Viewport Height & Width: Size of the browser window. |
| Options | Additional options: - Cache Mode: How to use caching (enabled, bypass, read-only). - Include Original Text: Whether to include full original page text in output. - Clean Text: Normalize extracted text by removing extra spaces and newlines. |
Output
The node outputs JSON data representing an array of extracted items, where each item corresponds to one element matched by the base selector. Each item contains key-value pairs for the requested fields, with values being either text content, raw HTML, or attribute values depending on the field type.
If enabled, the output may also include the full original text of the webpage.
No binary data output is indicated by the source code or properties.
Example output structure (simplified):
[
{
"title": "Product 1",
"price": "$19.99",
"link": "https://example.com/product1"
},
{
"title": "Product 2",
"price": "$29.99",
"link": "https://example.com/product2"
}
]
Dependencies
- Requires an API key credential for the Crawl4AI service to function.
- Uses a headless browser environment to load and interact with web pages.
- Supports executing custom JavaScript on the page before extraction.
- Caching mechanism configurable via options to optimize repeated requests.
Troubleshooting
- Page not loading or timing out: Increase the timeout setting or check network connectivity.
- No data extracted: Verify CSS selectors are correct and match the page structure. Ensure JavaScript execution is enabled if the page relies on client-side rendering.
- JavaScript code errors: Custom JS provided in browser options must be valid and safe; errors here can prevent extraction.
- Cache issues: If stale data is returned, try bypassing cache or clearing it.
- API authentication errors: Confirm the API key credential is correctly configured and has necessary permissions.