Actions7
Overview
This node integrates with the Firecrawl API to perform website crawling and scraping tasks. It allows users to crawl a specified URL, optionally including or excluding certain paths, and scrape content in various formats such as HTML, Markdown, JSON, or screenshots. The node supports advanced options like ignoring sitemaps, handling query parameters, following external links, and interacting with dynamic page elements before scraping.
Common scenarios where this node is beneficial include:
- Extracting structured data from websites for analysis or automation.
- Monitoring website content changes by crawling and scraping pages.
- Collecting URLs and metadata for SEO audits.
- Automating data collection workflows that require navigating complex web structures.
Practical example: A user wants to crawl their blog site up to 3 levels deep, exclude admin pages, only scrape main content in Markdown format, and take screenshots of each page. This node can be configured with those parameters to automate the process efficiently.
Properties
| Name | Meaning |
|---|---|
| Url | The starting URL to begin crawling (e.g., https://firecrawl.dev). |
| Exclude Paths | List of URL path patterns (regex-like) to exclude from crawling (e.g., blog/* excludes all blog articles). |
| Include Paths | List of URL path patterns to include in crawling, filtering out others (e.g., blog/* includes only blog articles). |
| Max Depth | Maximum depth level relative to the starting URL to crawl (minimum 1). |
| Limit | Maximum number of crawl results to return. |
| Crawl Options | Collection of boolean flags controlling crawl behavior: • Ignore Sitemap: skip sitemap during crawl. • Ignore Query Params: treat URLs without query parameters as identical. • Allow Backward Links: enable navigation to previously linked pages. • Allow External Links: follow links to external domains. |
| Scrape Options | Settings for scraping content during crawl: • Formats: output formats such as HTML, JSON, Markdown, links, raw HTML, screenshot. • Only Main Content: extract only main page content excluding headers/footers. • Include Tags: specify HTML tags to include. • Exclude Tags: specify HTML tags to exclude. • Headers: custom HTTP headers for requests. • Wait For (Ms): delay before scraping page. • Mobile: emulate mobile device. • Skip TLS Verification: ignore TLS certificate errors. • Timeout (Ms): request timeout. • Actions: list of interactions (click, scroll, wait, write, press, screenshot) to perform on page before scraping. • Location: country code and preferred languages for request. • Remove Base64 Images: remove embedded base64 images from output. • Block Ads: enable ad and cookie popup blocking. • Proxy: proxy type to use (Basic or Stealth). |
| Use Custom Body | Whether to send a custom request body instead of the standard one. |
Output
The node outputs JSON data representing the crawl and scrape results. The structure typically includes:
- Metadata about the crawl such as URLs visited, status codes, and crawl depth.
- Scraped content in the requested formats (HTML, Markdown, JSON, etc.).
- Extracted links if requested.
- Screenshots as binary data if the screenshot format is selected.
- Additional information depending on the crawl and scrape options set.
If screenshots are included, the node outputs binary data representing the image files.
Dependencies
- Requires an API key credential for authenticating with the Firecrawl API.
- Network access to the Firecrawl service endpoint (default: https://api.firecrawl.dev/v1).
- Optional proxy configuration depending on user settings.
- No other external dependencies are required within n8n.
Troubleshooting
- Timeouts: If crawling large sites or slow pages, increase the "Timeout (Ms)" property to avoid premature termination.
- Empty Results: Check "Include Paths" and "Exclude Paths" filters to ensure they are not overly restrictive.
- Authentication Errors: Verify the API key credential is correctly configured and has necessary permissions.
- TLS Errors: Enable "Skip TLS Verification" if connecting to sites with self-signed certificates.
- Dynamic Content Not Loaded: Use "Actions" to interact with page elements (e.g., click buttons, wait) before scraping.
- Proxy Issues: Switch between "Basic" and "Stealth" proxy types if requests fail due to network restrictions.
Links and References
- Firecrawl official website: https://firecrawl.dev
- Firecrawl API documentation: https://docs.firecrawl.dev/api
- MDN Web Docs on Accept-Language header: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language