Crawl4AI: Content Extractor

Extract structured content from web pages using Crawl4AI

Actions3

Overview

The node "Crawl4AI: Content Extractor" is designed to extract structured content from web pages by using CSS selectors. It is particularly useful for scraping data from websites where the content is organized in repeating elements such as product listings, article cards, or any other structured blocks. Users can specify a base CSS selector that identifies the repeating element and then define multiple fields with their own relative CSS selectors to extract specific pieces of information like text, HTML, or attribute values.

Practical examples include:

Extracting product details (title, price, image URL) from an e-commerce category page.
Scraping article headlines, authors, and publication dates from a news site.
Collecting event information (name, date, location) from an events listing page.

The node supports browser automation features such as enabling JavaScript execution, running in headless mode, executing custom JavaScript code before extraction (e.g., clicking "load more" buttons), and setting viewport dimensions, which helps in handling dynamic content loading.

Properties

Name	Meaning
URL	The web page URL to extract content from.
Base Selector	CSS selector identifying the repeating element on the page (e.g., each product item or article card).
Fields	A collection of fields to extract from each repeating element. Each field includes: - Field Name: Identifier for the extracted data. - CSS Selector: Relative selector within the base element. - Field Type: Type of data to extract (Text, HTML, Attribute). - Attribute Name: If extracting an attribute, specify its name (default "href").
Browser Options	Settings controlling browser behavior during extraction: - Enable JavaScript: Whether to run JavaScript on the page. - Headless Mode: Run browser without UI. - JavaScript Code: Custom JS to execute before extraction. - Timeout (ms): Max wait time for page load. - Viewport Height & Width: Browser window size.
Options	Additional options: - Cache Mode: How to use caching ("Enabled", "Bypass", "Only"). - Include Original Text: Whether to include full original webpage text in output. - Clean Text: Whether to clean/normalize extracted text (remove extra spaces/newlines).

Output

The node outputs JSON data representing an array of objects, each corresponding to one instance of the base selector found on the page. Each object contains key-value pairs where keys are the user-defined field names and values are the extracted content according to the specified field type:

For Text fields: plain text content extracted and optionally cleaned.
For HTML fields: inner HTML content of the selected element.
For Attribute fields: the value of the specified attribute.

If enabled, the output may also include the full original text of the webpage.

The node does not explicitly mention binary data output, so it is assumed to produce only JSON structured data.

Dependencies

Requires an API key credential for the Crawl4AI service to perform content extraction.
Uses a headless browser environment internally to load and interact with web pages.
No additional external dependencies are indicated beyond the Crawl4AI API access.

Troubleshooting

Common issues:
- Incorrect or overly broad CSS selectors may result in no data or incorrect data being extracted.
- Pages heavily reliant on JavaScript might require enabling JavaScript execution and possibly custom JS code to trigger content loading.
- Network timeouts if the page takes too long to load; increasing the timeout option may help.
- Cache settings might cause stale data to be returned; adjusting cache mode can resolve this.
Error messages:
- Authentication errors likely indicate missing or invalid API credentials.
- Timeout errors suggest the page did not load within the specified time.
- Selector-related errors may occur if the CSS selectors do not match any elements.

Resolving these typically involves verifying credentials, adjusting browser options (like timeout and JS execution), and refining CSS selectors.

Links and References

CSS Selectors Reference
Headless Browser Automation Concepts
General web scraping best practices and legal considerations should be reviewed before extensive use.