Firecrawl icon

Firecrawl

Get data from Firecrawl API

Overview

The node "Firecrawl" enables crawling and scraping of websites using the Firecrawl API. It is designed to programmatically navigate through web pages starting from a specified URL, extract content, and return structured data. This node is useful for scenarios such as web data extraction, SEO analysis, content aggregation, or monitoring website changes.

For example, you can use it to crawl a blog site to collect article contents in markdown format, exclude certain paths like admin pages, limit the depth of crawling to avoid excessive requests, and customize scraping options to only get main content or specific HTML tags.

Properties

Name Meaning
Url The starting URL to begin crawling from (e.g., https://firecrawl.dev).
Exclude Paths List of URL path patterns (regex-like) to exclude from crawling (e.g., blog/* excludes all blog articles).
Include Paths List of URL path patterns to include in crawling, filtering out others (e.g., blog/* includes only blog articles).
Max Depth Maximum depth relative to the starting URL to crawl (minimum 1). Controls how far links are followed.
Limit Maximum number of results (pages) to return from the crawl.
Crawl Options Collection of boolean flags controlling crawl behavior:
• Ignore Sitemap: whether to skip sitemap usage.
• Ignore Query Params: treat URLs ignoring query parameters.
• Allow Backward Links: enable navigation to previously linked pages.
• Allow External Links: follow links to external domains.
Scrape Options Settings for scraping page content during crawl:
• Formats: output formats such as HTML, JSON, Markdown, links, raw HTML, screenshot.
• Only Main Content: extract only main content excluding headers/footers.
• Include Tags: specify HTML tags to include.
• Exclude Tags: specify HTML tags to exclude.
• Headers: HTTP headers to send with requests.
• Wait For (Ms): delay before scraping page.
• Mobile: emulate mobile device.
• Skip TLS Verification: ignore TLS certificate errors.
• Timeout (Ms): request timeout.
• Actions: list of interactions (click, press keys, scroll, wait, write text, take screenshots) to perform on dynamic content before scraping.
• Location: country code and preferred languages for the request.
• Remove Base64 Images: remove embedded base64 images from output.
• Block Ads: enable ad-blocking and cookie popup blocking.
• Proxy: type of proxy to use (Basic or Stealth).
Use Custom Body Whether to use a custom request body instead of the standard one.

Output

The node outputs JSON data representing the results of the crawl and scrape operation. Each item typically contains information about crawled pages including URLs, extracted content in requested formats (e.g., markdown, HTML), links found, and optionally screenshots if requested.

If binary data output is enabled (e.g., screenshots), the node will provide this as binary attachments associated with the respective items.

Dependencies

  • Requires an API key credential for authenticating with the Firecrawl API.
  • Network access to the Firecrawl API endpoint at https://api.firecrawl.dev/v1.
  • Optional proxy configuration depending on user settings.
  • No other external dependencies are required within n8n.

Troubleshooting

  • Common issues:

    • Invalid or missing API key credential will cause authentication failures.
    • Incorrect URL format may result in request errors.
    • Overly restrictive include/exclude path patterns might yield no results.
    • Setting very high maxDepth or limit values could lead to long execution times or API rate limits.
    • Dynamic content requiring interaction may not be scraped correctly without proper actions configured.
  • Error messages:

    • Authentication errors indicate invalid or missing API credentials; verify and re-enter the API key.
    • Timeout errors suggest increasing the timeout property or checking network connectivity.
    • Validation errors on input properties usually mean required fields are missing or have invalid values.
    • If no results are returned, check the URL, include/exclude paths, and crawl options for correctness.

Links and References

Discussion