Firecrawl icon

Firecrawl

Get data from Firecrawl API

Overview

The Firecrawl node enables crawling and scraping of websites using the Firecrawl API. It is designed to fetch structured data, content, or metadata from web pages by following links and extracting information according to user-defined parameters. This node is useful for scenarios such as:

  • Extracting blog posts, product listings, or news articles from a website.
  • Monitoring changes on web pages with change tracking.
  • Capturing screenshots of web pages for visual records.
  • Collecting links or summaries from a site for analysis.

For example, you can crawl a news website to gather the latest headlines in markdown format, or scrape an e-commerce site to extract product details while excluding certain paths like user reviews.

Properties

Name Meaning
Url The starting URL to begin the crawl (e.g., https://firecrawl.dev).
Prompt A prompt string used during the crawl to guide content extraction or summarization.
Limit Maximum number of results (pages or items) to return from the crawl.
Delay Delay in milliseconds between requests to avoid overloading the target server.
Max Concurrency Maximum number of concurrent page scrapes allowed during the crawl.
Exclude Paths List of URL path patterns (regex-like) to exclude from crawling (e.g., "blog/*" excludes all blog subpaths).
Include Paths List of URL path patterns to include exclusively in the crawl.
Crawl Options Collection of boolean options controlling crawl behavior:
- Ignore Sitemap: whether to ignore sitemap.xml.
- Ignore Query Params: treat URLs without query parameters as identical.
- Allow External Links: follow external domains.
- Allow Subdomains: follow subdomains of main domain.
Scrape Options Options for scraping content during the crawl, including output formats and actions:
- Formats: output types such as markdown, html, json, links, screenshot, summary, change tracking.
- Actions: interactions before scraping like click, scroll, wait, write, press, screenshot.
- Only Main Content: whether to extract only main page content excluding headers/footers.
- Include Tags / Exclude Tags: specify HTML tags to include or exclude in output.
- Location: country and language preferences for the request.
- Remove Base64 Images: remove embedded base64 images from output.
- Block Ads: enable ad-blocking and cookie popup blocking.
- Store In Cache: whether to cache the page in Firecrawl index.
- Proxy: type of proxy to use ("Basic" or "Stealth").
Headers Custom HTTP headers to send with each request (key-value pairs).
Wait For (Ms) Time in milliseconds to wait for the page to load before scraping content.
Mobile Whether to emulate a mobile device when scraping.
Skip TLS Verification Whether to skip TLS certificate verification for HTTPS requests.
Timeout (Ms) Request timeout in milliseconds.
Additional Fields Custom JSON properties to add to the request body for advanced or custom API usage.
Use Custom Body Option to provide a fully custom request body instead of using the standard parameters.

Output

The node outputs JSON data representing the results of the crawl and scrape operation. The structure typically includes:

  • Extracted content in requested formats (markdown, html, json, etc.).
  • Metadata about each crawled page such as URL, status, timestamps.
  • Change tracking data if enabled.
  • Screenshots as binary data if requested (image files).
  • Lists of links or summaries depending on the scrape options.

If screenshots are included, the node outputs binary data representing the image files captured during the crawl.

Dependencies

  • Requires an API key credential for authenticating with the Firecrawl API.
  • Network access to the Firecrawl API endpoint (default https://api.firecrawl.dev/v2).
  • Optional proxy configuration supported via node properties.
  • No other external dependencies are required within n8n.

Troubleshooting

  • Timeouts: If requests time out, consider increasing the "Timeout (Ms)" property or reducing concurrency.
  • Empty Results: Check that the URL is correct and accessible; verify include/exclude path patterns do not filter out all pages.
  • Authentication Errors: Ensure the API key credential is valid and has proper permissions.
  • TLS Errors: If encountering TLS certificate errors, enable "Skip TLS Verification" cautiously.
  • Rate Limits: If crawling large sites, respect delay and concurrency limits to avoid being blocked.
  • Incorrect Output Format: Verify the selected scrape formats and prompts match the expected output schema.

Links and References

Discussion