Crawl4AI: Content Extractor icon

Crawl4AI: Content Extractor

Extract structured content from web pages using Crawl4AI

Overview

The node "Crawl4AI: Content Extractor" is designed to extract structured JSON content from web pages. It supports extracting JSON data directly from a URL or embedded within HTML pages, including JSON inside <script> tags or JSON-LD structured data. This node is useful for scenarios where you need to scrape or gather structured data from websites that expose their data in JSON format or embed it within their HTML.

Practical examples include:

  • Extracting product information from e-commerce sites that provide JSON data in script tags.
  • Gathering metadata or structured data from news articles using JSON-LD.
  • Fetching API-like JSON responses directly from URLs for further processing.

Properties

Name Meaning
URL The URL of the JSON data to extract. This is the web address from which the JSON content will be fetched.
JSON Path The path within the JSON structure to extract specific data. If left empty, the entire JSON response is returned.
Source Type Specifies where to find the JSON data on the page. Options are:
- Direct JSON URL: The URL returns JSON directly.
- JSON in Script Tag: JSON is embedded inside a <script> tag.
- JSON-LD: JSON-LD structured data.
Script Selector CSS selector to identify the <script> tag containing the JSON data when "Source Type" is set to "JSON in Script Tag".
Browser Options Collection of options controlling browser behavior during extraction:
- Headless Mode: Run browser without UI.
- Enable JavaScript: Allow JS execution on the page.
- Timeout (MS): Max wait time for page load.
- JavaScript Code: Custom JS code to run before extraction (e.g., scrolling).
Options Additional options:
- Cache Mode: Controls caching behavior with options to enable, bypass, or only use cache.
- Include Full Content: Whether to include the full JSON content along with extracted data.
- Headers: HTTP headers to send with the request, formatted as JSON string.

Output

The node outputs JSON data extracted from the specified source. The output json field contains either the entire JSON response or the subset defined by the JSON Path property. If "Include Full Content" is enabled, the full JSON content is also included alongside the extracted data.

If the source includes binary data (not indicated here), it would typically be represented separately, but this node focuses on JSON extraction and does not explicitly handle binary output.

Dependencies

  • Requires an API key credential for the Crawl4AI service to perform the extraction.
  • Uses a headless browser environment optionally to load pages and execute JavaScript if needed.
  • May require network access to fetch URLs.
  • Proper configuration of HTTP headers and cache settings can affect performance and results.

Troubleshooting

  • Common issues:

    • Incorrect or missing URL leading to failed requests.
    • Invalid JSON Path causing no data to be extracted.
    • Wrong Script Selector when extracting JSON from script tags resulting in empty output.
    • Network timeouts if the page takes too long to load or JavaScript execution is disabled but required.
    • Cache mode misconfiguration causing stale or missing data.
  • Error messages and resolutions:

    • Failed to fetch URL: Check the URL correctness and network connectivity.
    • Invalid JSON Path: Verify the JSON Path syntax and ensure the path exists in the JSON structure.
    • No JSON found in script tag: Confirm the CSS selector matches the script tag containing JSON.
    • Timeout exceeded: Increase the timeout value or optimize the page loading process.
    • Authentication errors: Ensure the API key credential is correctly configured and valid.

Links and References

Discussion