Overview
The node "Crawl4AI: Content Extractor" is designed to extract structured content from web pages by using CSS selectors. It is particularly useful for scraping data from websites where the content is organized in repeating elements such as product listings, article cards, or any other structured blocks. Users can specify a base CSS selector that identifies the repeating element and then define multiple fields with their own relative CSS selectors to extract specific pieces of information like text, HTML, or attribute values.
Practical examples include:
- Extracting product details (title, price, image URL) from an e-commerce category page.
- Scraping article headlines, authors, and publication dates from a news site.
- Collecting event information (name, date, location) from an events listing page.
The node supports browser automation features such as enabling JavaScript execution, running in headless mode, executing custom JavaScript code before extraction (e.g., clicking "load more" buttons), and setting viewport dimensions, which helps in handling dynamic content loading.
Properties
| Name | Meaning |
|---|---|
| URL | The web page URL to extract content from. |
| Base Selector | CSS selector identifying the repeating element on the page (e.g., each product item or article card). |
| Fields | A collection of fields to extract from each repeating element. Each field includes: - Field Name: Identifier for the extracted data. - CSS Selector: Relative selector within the base element. - Field Type: Type of data to extract: - Text: Extracts plain text content. - HTML: Extracts inner HTML. - Attribute: Extracts a specified attribute value (e.g., href, src). - Attribute Name: Name of the attribute to extract if Field Type is Attribute. |
| Browser Options | Settings controlling browser behavior during extraction: - Enable JavaScript: Whether to run JS on the page. - Headless Mode: Run browser without UI. - JavaScript Code: Custom JS to execute before extraction. - Timeout (ms): Max wait time for page load. - Viewport Height & Width: Dimensions of the browser window. |
| Options | Additional options: - Cache Mode: Controls caching strategy (enabled, bypass, only). - Include Original Text: Whether to include full original webpage text in output. - Clean Text: Normalize extracted text by removing extra spaces and newlines. |
Output
The node outputs JSON data representing an array of objects, each corresponding to one instance of the base selector found on the page. Each object contains key-value pairs where keys are the field names defined in the input properties, and values are the extracted content according to the specified field type.
If "Include Original Text" is enabled, the output will also contain the full raw text of the webpage.
Binary data output is not indicated by the source; the node focuses on textual and attribute extraction.
Example output structure (simplified):
[
{
"title": "Product 1",
"price": "$19.99",
"imageUrl": "https://example.com/image1.jpg"
},
{
"title": "Product 2",
"price": "$29.99",
"imageUrl": "https://example.com/image2.jpg"
}
]
Dependencies
- Requires an API key credential for the Crawl4AI service to perform content extraction.
- Uses a headless browser environment internally to load and parse web pages, supporting JavaScript execution.
- No additional external dependencies are explicitly mentioned beyond the Crawl4AI API.
Troubleshooting
Common Issues:
- Incorrect CSS selectors may result in empty or incomplete data extraction.
- Pages heavily reliant on JavaScript might require enabling JavaScript execution and possibly adding custom JS code to trigger dynamic content loading.
- Network issues or invalid URLs can cause failures in fetching the page.
- Cache settings might cause stale data to be returned if not configured properly.
Error Messages:
- Timeout errors if the page takes too long to load; increase the timeout property.
- Authentication errors if the API key credential is missing or invalid.
- Parsing errors if the CSS selectors do not match any elements; verify selectors with browser developer tools.
Links and References
- CSS Selectors Reference
- Headless Chrome Documentation
- Web Scraping Best Practices
- Crawl4AI Official Website (for API documentation and usage guidelines)