Overview
The node "Crawl4AI: Content Extractor" is designed to extract structured JSON content from web pages. It supports fetching JSON data directly from URLs or extracting JSON embedded within HTML pages, such as inside <script> tags or JSON-LD structured data. This node is useful for scenarios where you want to scrape or gather structured data from websites that expose their data in JSON format or embed it within their HTML.
Practical examples include:
- Extracting product listings or details from e-commerce sites that provide JSON APIs or embed product data in script tags.
- Gathering event or article metadata from web pages using JSON-LD structured data.
- Automating data collection workflows by pulling JSON data from various web sources without manual parsing.
Properties
| Name | Meaning |
|---|---|
| URL | The URL of the JSON data source to extract from. |
| JSON Path | The path within the JSON response to extract specific data (e.g., data.items). Leave empty to extract the entire JSON response. |
| Source Type | Where to find the JSON data on the page. Options: • Direct JSON URL — URL returns JSON directly. • JSON in Script Tag — JSON is embedded inside a <script> tag.• JSON-LD — JSON-LD structured data on the page. |
| Script Selector | CSS selector to identify the <script> tag containing the JSON data when "Source Type" is set to "JSON in Script Tag". |
| Browser Options | Collection of options controlling browser behavior during extraction: • Headless Mode — Run browser without UI. • Enable JavaScript — Allow JS execution on page. • Timeout (MS) — Max wait time for page load. • JavaScript Code — Custom JS to run before extraction (e.g., scrolling). |
| Options | Additional options: • Cache Mode — How to use caching: Enabled (read/write), Bypass (force fresh fetch), Only (read only). • Include Full Content — Whether to include full JSON content along with extracted data. • Headers — HTTP headers to send with the request in JSON format. |
Output
The node outputs JSON data extracted from the specified URL or page. The output structure depends on the JSON Path property:
- If a JSON Path is provided, the output contains the subset of JSON data at that path.
- If no JSON Path is given, the entire JSON response or extracted JSON object is returned.
- If "Include Full Content" option is enabled, the full JSON content fetched is also included alongside the extracted data.
The output does not explicitly mention binary data handling, so it is assumed to be purely JSON structured data.
Dependencies
- Requires an API key credential for the Crawl4AI service to perform the extraction.
- Uses a headless browser environment optionally to load pages and execute JavaScript if needed.
- Supports custom HTTP headers and caching strategies which may require proper configuration depending on the target site.
- No other external dependencies are indicated in the provided code snippet.
Troubleshooting
Common issues:
- Incorrect or missing URL can cause failures to fetch data.
- Invalid JSON Path may result in empty or incorrect extraction results.
- If JSON is embedded in script tags, an incorrect CSS selector will fail to locate the JSON.
- Pages requiring JavaScript to render JSON data need "Enable JavaScript" enabled; otherwise, extraction may fail.
- Timeout too short may cause incomplete page loads and extraction errors.
Error messages:
- Errors related to network requests or invalid responses typically indicate connectivity or URL issues.
- Parsing errors suggest malformed JSON or incorrect JSON Path usage.
- Authentication errors occur if the required API key credential is missing or invalid.
Resolutions:
- Verify URL correctness and accessibility.
- Test JSON Path expressions separately to ensure they match the expected JSON structure.
- Adjust browser options like enabling JavaScript or increasing timeout for dynamic pages.
- Ensure valid API credentials are configured in n8n.
Links and References
- JSONPath Syntax Reference
- JSON-LD Introduction
- Crawl4AI Documentation (general) (hypothetical link based on node name)