Package Information
Documentation
Crawl4AI Plus for n8n
Enhanced fork targeting Crawl4AI v0.8.0 with 8 Basic Crawler operations, 7 Content Extractor operations, streaming crawl, async job submission, and comprehensive browser/session/LLM configuration.
Project History & Attribution
This is a maintained fork with enhanced features for Crawl4AI 0.8.0.
Fork Chain
- Original author: Heictor Hsiao - golfamigo/n8n-nodes-crawl4j
- First maintainer: Matias Lopez - qmatiaslopez/n8n-nodes-crawl4j
- Current maintainer: Max Soukhomlinov - msoukhomlinov/n8n-nodes-crawl4ai-plus
All credit for the original implementation goes to Heictor Hsiao and Matias Lopez.
v4.0.0 is a breaking change — all field names, output shapes, and operation behaviour have changed from v3.x. Existing workflows will need rebuilding. See CHANGELOG.md for full details.
Features
Basic Crawler Node (8 operations)
- Crawl Single URL — Extract content from a single page with full browser and crawler configuration
- Crawl Multiple URLs — Process multiple pages or use recursive keyword-driven discovery
- Manual list — comma-separated URLs crawled in parallel
- Recursive Discovery — BestFirst (recommended), BFS, or DFS strategies with seed URL and keyword query
- Crawl Stream — Stream crawl results one-item-at-a-time via
/crawl/stream; each result has its own timestamp - Process Raw HTML — Parse and extract content from raw HTML without a network request
- Discover Links — Extract, filter, and score all links from a page (internal/external, include/exclude patterns)
- Submit Crawl Job — Submit an async crawl job to
/crawl/joband receive atask_idfor large or long-running crawls; supports webhook callbacks - Get Job Status — Poll
/job/{task_id}to check status; returns full result data when complete - Health Check — Query
/monitor/healthand/monitor/endpoints/statsto verify server reachability and resource usage
Content Extractor Node (7 operations)
- CSS Selector Extractor — Structured extraction using
JsonCssExtractionStrategywith field-level selectors and attribute extraction - LLM Extractor — AI-powered structured extraction with schema support
- Input formats: markdown (default), HTML, or fit_markdown
- Schema modes: simple fields or advanced JSON schema
- JSON Extractor — Extract JSON from direct URLs, embedded
<script>tags (CSS or XPath selector), or JSON-LD - Regex Extractor — Pattern-based extraction with 21 built-in patterns, custom regex, LLM-generated patterns, or quick presets (Contact Info, Financial Data)
- Cosine Similarity Extractor — Semantic similarity clustering via
CosineStrategy; requiresunclecode/crawl4ai:allDocker image - SEO Metadata Extractor — Extract title, meta tags, Open Graph, Twitter Cards, JSON-LD, robots directives, and hreflang tags
- Submit LLM Job — Submit an async LLM extraction job to
/llm/joband receive atask_id
Table extraction is available in the Basic Crawler node via the Table Extraction crawler option (LLM-based or default heuristics).
Requirements
- n8n: 1.79.1 or higher
- Crawl4AI Docker: 0.8.0
- Standard operations:
unclecode/crawl4ai:latest - Cosine Similarity Extractor:
unclecode/crawl4ai:all(includes sentence-transformers)
- Standard operations:
Installation
# Install with pnpm (required — npm/yarn not supported)
pnpm install
pnpm build
Then restart your n8n instance. The nodes are declared in package.json → "n8n" → "nodes" and loaded from dist/.
Setup
Credentials
- Settings → Credentials → New → Crawl4AI API
- Configure:
- Docker URL — URL of your Crawl4AI container (default:
http://crawl4ai:11235) - Authentication — Defaults to No Authentication, which is correct for a standard Docker quickstart deployment. Switch to Token or Basic auth only if your Crawl4AI instance is configured with authentication.
- LLM Settings — Enable and configure a provider for AI-powered operations:
- OpenAI, Anthropic, Groq, Ollama, or custom LiteLLM endpoint
- Docker URL — URL of your Crawl4AI container (default:
Basic Crawler
- Add Crawl4AI Plus: Basic Crawler to your workflow
- Select an operation
- Configure the required fields (shown at top level — no digging through collapsed options for required parameters)
- Optional browser, crawler, and output options are in expandable collections
Content Extractor
- Add Crawl4AI Plus: Content Extractor to your workflow
- Select an extraction strategy
- Enter the URL and strategy-specific configuration
- LLM-based strategies use the provider configured in credentials
Configuration Reference
Browser Options
| Option | Description |
|---|---|
| Browser Type | Chromium (default), Firefox, or Webkit |
| Headless Mode | Run browser without a visible window |
| Enable JavaScript | Enable JS execution (required for dynamic pages) |
| Enable Stealth Mode | Hides webdriver properties to bypass bot detection |
| Extra Browser Arguments | Command-line flags passed to the browser process |
| Init Scripts | JavaScript injected before page load (stealth setup) |
| Viewport Width / Height | Browser viewport dimensions |
| Timeout (MS) | Maximum page load wait time |
| User Agent | Override the browser user agent string |
Session & Authentication
| Option | Description |
|---|---|
| Storage State (JSON) | Browser state (cookies, localStorage) as JSON — works on n8n Cloud |
| Cookies | Structured cookie entries for authentication |
| Session ID | Reuse a named browser context across requests |
| Use Managed Browser | Connect to an existing managed browser instance |
| Use Persistent Context | Persist browser profile to disk (self-hosted only) |
| User Data Directory | Path to the browser profile directory |
Crawler Options
| Option | Description |
|---|---|
| Cache Mode | ENABLED, BYPASS, DISABLED, READ_ONLY, or WRITE_ONLY |
| CSS Selector | Pre-filter page content before extraction |
| JavaScript Code | Execute custom JS on the page before extracting |
| Wait For | CSS selector or JS expression to wait for before extracting |
| Check Robots.txt | Respect the site's robots.txt rules |
| Word Count Threshold | Minimum word count for a content block to be included |
| Exclude External Links | Strip external links from results |
| Preserve HTTPS for Internal Links | Normalise internal link protocols |
Deep Crawl Options (Crawl Multiple URLs — Discover mode)
| Option | Description |
|---|---|
| Seed URL | Starting URL for recursive discovery |
| Discovery Query | Keywords that guide which links to follow (required for BestFirst) |
| Strategy | BestFirst (recommended), BFS, or DFS |
| Max Depth | Maximum link depth to follow |
| Max Pages | Maximum number of pages to crawl |
| Score Threshold | Minimum relevance score for BestFirst |
Output Options
| Option | Description |
|---|---|
| Include HTML | Include raw HTML in content.html |
| Include Links | Include links.internal and links.external arrays |
| Include Media | Include images, videos, and audio metadata |
| Screenshot | Capture a screenshot (base64) |
| Generate a PDF (base64) | |
| SSL Certificate | Extract SSL certificate details |
Output Shape
All operations return a consistent output object:
{
"domain": "example.com",
"url": "https://example.com/page",
"fetchedAt": "2026-02-18T10:00:00.000Z",
"success": true,
"statusCode": 200,
"content": {
"markdownRaw": "...",
"markdownFit": "..."
},
"extracted": {
"strategy": "JsonCssExtractionStrategy",
"json": { ... }
},
"links": {
"internal": [{ "href": "...", "text": "..." }],
"external": []
},
"metrics": {
"durationMs": 1240
}
}
Async Job Workflow
For large or long-running crawls, use the async pattern:
- Submit Crawl Job → returns
task_id - Get Job Status (poll with task_id) → returns
status: pending | processing | completed | failed - When
completed, result fields are returned directly at top level alongsidetask_idandstatus
Webhook callbacks are supported in Submit Crawl Job for push-based notification when the job finishes.
Project Structure
nodes/
├── Crawl4aiPlusBasicCrawler/
│ ├── Crawl4aiPlusBasicCrawler.node.ts
│ ├── crawl4aiplus.svg
│ ├── actions/
│ │ ├── operations.ts # Operation list and UI aggregation
│ │ ├── router.ts # Dispatch to operation execute()
│ │ ├── crawlSingleUrl.operation.ts
│ │ ├── crawlMultipleUrls.operation.ts # Manual list + recursive discovery
│ │ ├── crawlStream.operation.ts # Streaming crawl via /crawl/stream
│ │ ├── processRawHtml.operation.ts
│ │ ├── discoverLinks.operation.ts
│ │ ├── submitCrawlJob.operation.ts # Async job submission
│ │ ├── getJobStatus.operation.ts # Async job polling
│ │ └── healthCheck.operation.ts
│ └── helpers/
│ ├── interfaces.ts
│ ├── utils.ts # createBrowserConfig, createCrawlerRunConfig, buildLlmConfig, etc.
│ ├── apiClient.ts # Crawl4aiClient — all HTTP calls
│ └── formatters.ts # formatCrawlResult, formatExtractionResult
│
└── Crawl4aiPlusContentExtractor/
├── Crawl4aiPlusContentExtractor.node.ts
├── crawl4aiplus.svg
├── actions/
│ ├── operations.ts
│ ├── router.ts
│ ├── cssExtractor.operation.ts
│ ├── llmExtractor.operation.ts
│ ├── jsonExtractor.operation.ts
│ ├── regexExtractor.operation.ts
│ ├── cosineExtractor.operation.ts
│ ├── seoExtractor.operation.ts
│ └── submitLlmJob.operation.ts # Async LLM job submission
└── helpers/
├── interfaces.ts
└── utils.ts # Re-exports from BasicCrawler + extractor-specific helpers
credentials/
└── Crawl4aiApi.credentials.ts # Docker URL, auth, LLM provider config
Version History
See CHANGELOG.md for detailed version history and breaking changes.
License
MIT