Crawl4AI: Content Extractor

Extract structured content from web pages using Crawl4AI

Actions3

Overview

The node "Crawl4AI: Content Extractor" is designed to extract structured JSON content from web pages. It supports extracting JSON data directly from a URL or embedded within HTML pages, including JSON inside <script> tags or JSON-LD structured data. This node is useful for scenarios where you need to scrape or gather structured data from websites that expose their data in JSON format or embed it within their HTML.

Practical examples include:

Extracting product information from e-commerce sites that provide JSON data in script tags.
Gathering metadata or structured data from news articles using JSON-LD.
Fetching API-like JSON responses directly from URLs for further processing.

Properties

Name	Meaning
URL	The URL of the JSON data to extract. This is the web address from which the JSON content will be fetched.
JSON Path	The path within the JSON structure to extract specific data. If left empty, the entire JSON response is returned.
Source Type	Specifies where to find the JSON data on the page. Options are: - Direct JSON URL: The URL returns JSON directly. - JSON in Script Tag: JSON is embedded inside a `<script>` tag. - JSON-LD: JSON-LD structured data.
Script Selector	CSS selector to identify the `<script>` tag containing the JSON data when "Source Type" is set to "JSON in Script Tag".
Browser Options	Collection of options controlling browser behavior during extraction: - Headless Mode: Run browser without UI. - Enable JavaScript: Allow JS execution on the page. - Timeout (MS): Max wait time for page load. - JavaScript Code: Custom JS code to run before extraction (e.g., scrolling).
Options	Additional options: - Cache Mode: Controls caching behavior with options to enable, bypass, or only use cache. - Include Full Content: Whether to include the full JSON content along with extracted data. - Headers: HTTP headers to send with the request, formatted as JSON string.

Output

The node outputs JSON data extracted from the specified source. The output json field contains either the entire JSON response or the subset defined by the JSON Path property. If "Include Full Content" is enabled, the full JSON content is also included alongside the extracted data.

If the source includes binary data (not indicated here), it would typically be represented separately, but this node focuses on JSON extraction and does not explicitly handle binary output.

Dependencies

Requires an API key credential for the Crawl4AI service to perform the extraction.
Uses a headless browser environment optionally to load pages and execute JavaScript if needed.
May require network access to fetch URLs.
Proper configuration of HTTP headers and cache settings can affect performance and results.

Troubleshooting

Common issues:
- Incorrect or missing URL leading to failed requests.
- Invalid JSON Path causing no data to be extracted.
- Wrong Script Selector when extracting JSON from script tags resulting in empty output.
- Network timeouts if the page takes too long to load or JavaScript execution is disabled but required.
- Cache mode misconfiguration causing stale or missing data.
Error messages and resolutions:
- Failed to fetch URL: Check the URL correctness and network connectivity.
- Invalid JSON Path: Verify the JSON Path syntax and ensure the path exists in the JSON structure.
- No JSON found in script tag: Confirm the CSS selector matches the script tag containing JSON.
- Timeout exceeded: Increase the timeout value or optimize the page loading process.
- Authentication errors: Ensure the API key credential is correctly configured and valid.

Links and References

JSONPath Syntax Guide
JSON-LD Introduction
Crawl4AI Documentation (hypothetical link based on node name)

Crawl4AI: Content ExtractorInstall