Overview
The node "Crawl4AI: Content Extractor" is designed to extract structured content from web pages by using CSS selectors. It is particularly useful for scraping data from websites where the content is organized in repeating elements such as product listings, article cards, or any other structured blocks. Users can specify a base CSS selector that identifies the repeating element and then define multiple fields with their own relative CSS selectors to extract specific pieces of information like text, HTML, or attribute values.
Practical examples include:
- Extracting product details (title, price, image URL) from an e-commerce category page.
- Scraping article headlines, authors, and publication dates from a news site.
- Collecting event information (name, date, location) from an events listing page.
The node supports browser automation features such as enabling JavaScript execution, running in headless mode, executing custom JavaScript code before extraction (e.g., clicking "load more" buttons), and setting viewport dimensions, which helps in handling dynamic content loading.
Properties
| Name | Meaning |
|---|---|
| URL | The web page URL to extract content from. |
| Base Selector | CSS selector identifying the repeating element on the page (e.g., each product item or article card). |
| Fields | A collection of fields to extract from each repeating element. Each field includes: - Field Name: Identifier for the extracted data. - CSS Selector: Relative selector within the base element. - Field Type: Type of data to extract (Text, HTML, Attribute). - Attribute Name: If extracting an attribute, specify its name (default "href"). |
| Browser Options | Settings controlling browser behavior during extraction: - Enable JavaScript: Whether to run JavaScript on the page. - Headless Mode: Run browser without UI. - JavaScript Code: Custom JS to execute before extraction. - Timeout (ms): Max wait time for page load. - Viewport Height & Width: Browser window size. |
| Options | Additional options: - Cache Mode: How to use caching ("Enabled", "Bypass", "Only"). - Include Original Text: Whether to include full original webpage text in output. - Clean Text: Whether to clean/normalize extracted text (remove extra spaces/newlines). |
Output
The node outputs JSON data representing an array of objects, each corresponding to one instance of the base selector found on the page. Each object contains key-value pairs where keys are the user-defined field names and values are the extracted content according to the specified field type:
- For Text fields: plain text content extracted and optionally cleaned.
- For HTML fields: inner HTML content of the selected element.
- For Attribute fields: the value of the specified attribute.
If enabled, the output may also include the full original text of the webpage.
The node does not explicitly mention binary data output, so it is assumed to produce only JSON structured data.
Dependencies
- Requires an API key credential for the Crawl4AI service to perform content extraction.
- Uses a headless browser environment internally to load and interact with web pages.
- No additional external dependencies are indicated beyond the Crawl4AI API access.
Troubleshooting
Common issues:
- Incorrect or overly broad CSS selectors may result in no data or incorrect data being extracted.
- Pages heavily reliant on JavaScript might require enabling JavaScript execution and possibly custom JS code to trigger content loading.
- Network timeouts if the page takes too long to load; increasing the timeout option may help.
- Cache settings might cause stale data to be returned; adjusting cache mode can resolve this.
Error messages:
- Authentication errors likely indicate missing or invalid API credentials.
- Timeout errors suggest the page did not load within the specified time.
- Selector-related errors may occur if the CSS selectors do not match any elements.
Resolving these typically involves verifying credentials, adjusting browser options (like timeout and JS execution), and refining CSS selectors.
Links and References
- CSS Selectors Reference
- Headless Browser Automation Concepts
- General web scraping best practices and legal considerations should be reviewed before extensive use.