Crawl4AI: Content Extractor

Extract structured content from web pages using Crawl4AI

Actions3

Overview

The node "Crawl4AI: Content Extractor" is designed to extract structured JSON content from web pages. It supports extracting JSON data directly from a URL or embedded within HTML pages, including JSON inside <script> tags or JSON-LD structured data. This node is useful for scenarios where you need to scrape or gather structured data from websites that expose their data in JSON format or embed it within their HTML.

Practical examples include:

Extracting product information from e-commerce sites that embed product details as JSON-LD.
Fetching API-like JSON responses directly from URLs.
Scraping dynamic web pages where JSON data is embedded inside script tags and requires browser rendering with JavaScript enabled.

Properties

Name	Meaning
URL	The URL of the JSON data source to extract from.
JSON Path	The path within the JSON response to extract specific data (e.g., `data.items`). Leave empty to extract the entire JSON response.
Source Type	Where to find the JSON data on the page. Options: • Direct JSON URL — URL returns JSON directly. • JSON in Script Tag — JSON is embedded inside an HTML `<script>` tag. • JSON-LD — JSON-LD structured data.
Script Selector	CSS selector to identify the `<script>` tag containing the JSON data when "Source Type" is set to "JSON in Script Tag".
Browser Options	Collection of options controlling the headless browser behavior: • Headless Mode — Run browser without UI. • Enable JavaScript — Allow JS execution on the page. • Timeout (MS) — Max wait time for page load. • JavaScript Code — Custom JS code to execute before extraction (e.g., scrolling).
Options	Additional options: • Cache Mode — How to use caching (enabled, bypass, read-only). • Include Full Content — Whether to include the full JSON content along with extracted data. • Headers — HTTP headers to send with the request (in JSON format).

Output

The node outputs JSON data extracted from the specified URL or page. The structure of the output depends on the JSON Path property:

If a JSON Path is provided, the output contains the subset of JSON data at that path.
If no JSON Path is given, the entire JSON response or extracted JSON object is returned.
If "Include Full Content" is enabled, the full JSON content fetched is also included alongside the extracted data.

The output does not explicitly mention binary data, so it is assumed to be purely JSON structured data.

Dependencies

Requires an API key credential for the Crawl4AI service to perform the extraction.
Uses a headless browser environment optionally to render pages and execute JavaScript if needed.
Supports custom HTTP headers and caching strategies which may require proper configuration depending on the target website.

Troubleshooting

Common issues:
- Incorrect or missing URL can cause failures to fetch data.
- Invalid JSON Path may result in empty or incorrect extraction results.
- If JSON is embedded in script tags, an incorrect CSS selector will fail to locate the JSON.
- Pages requiring JavaScript might not load correctly if JavaScript execution is disabled.
- Timeout errors if the page takes too long to load or execute scripts.
- Cache mode misconfiguration could lead to stale or missing data.
Error messages and resolutions:
- Failed to fetch URL: Check network connectivity, URL correctness, and API key validity.
- JSON parsing error: Verify that the source actually returns valid JSON or that the JSON Path is correct.
- Timeout exceeded: Increase the timeout value in Browser Options or check page load performance.
- No JSON found in script tag: Confirm the script selector matches the actual page structure.

Links and References

JSONPath Syntax
JSON-LD Introduction
Headless Browser Automation Concepts (Puppeteer as an example)
Crawl4AI official documentation (not linked here due to lack of URL in source)