Firecrawl Tool

Web scraping, crawling, and data extraction tool using Firecrawl v2 API. Can extract content from websites, crawl entire domains, map site structures, search the web, and extract structured data using AI. Perfect for both workflow automation and AI agent tools.

Join our community

Actions5

Overview

The node implements a web crawling operation using the Firecrawl v2 API. It starts from a specified URL and recursively discovers and scrapes multiple pages within the same site or subdomain, based on user-defined options such as crawl depth, page limits, and path filters. This is useful for gathering large sets of web content automatically, such as scraping all blog posts, articles, or product pages from a website.

Practical examples include:

Crawling an entire blog to collect all posts published in 2024.
Scraping a news website while excluding admin or private sections.
Collecting data from a documentation site with controlled depth and page limits.

Properties

Name	Meaning
URL	The starting URL for the crawl. The crawler will discover and scrape all accessible pages from this starting point.
Crawl Options	Collection of options to control the crawl behavior:
- Limit	Maximum number of pages to crawl.
- Max Depth	Maximum depth to crawl from the starting URL (how many link levels deep).
- Smart Crawl Prompt	Natural language prompt to intelligently guide the crawl, e.g., "Only crawl blog posts from 2024".
- Include Paths	Comma-separated list of URL paths to include in the crawl (e.g., "/blog,/articles").
- Exclude Paths	Comma-separated list of URL paths to exclude from the crawl (e.g., "/admin,/private").
- Allow External Links	Boolean flag whether to follow links to external domains outside the starting domain.
- Ignore Sitemap	Boolean flag whether to ignore the sitemap when crawling.
- Wait for Completion	Boolean flag whether to wait for the crawl job to complete before returning results, or return immediately with a job ID for asynchronous processing.

Output

The output JSON contains the crawl results returned by the Firecrawl API. When waiting for completion, it includes the full crawl data such as discovered URLs, scraped content, metadata, and status. If not waiting for completion, it returns a job ID that can be used to query the crawl status later.

The structure typically includes:

status: The current status of the crawl job (e.g., "completed", "failed").
data: The crawled pages and their extracted content.
id: The job identifier if the crawl is asynchronous.
error: Error message if the crawl failed.

No binary data output is indicated for this operation.

Dependencies

Requires an API key credential for the Firecrawl API.
The node makes HTTP requests to the Firecrawl API endpoint (default: https://api.firecrawl.dev).
No other external dependencies are required.

Troubleshooting

Missing API Key: The node throws an error if the Firecrawl API key is not provided in credentials. Ensure you add a valid API key.
Crawl Job Timeout: If waiting for completion, the node waits up to 5 minutes for the crawl to finish. If the crawl takes longer, a timeout error is thrown. Consider increasing limits or disabling wait-for-completion to handle long crawls asynchronously.
Crawl Job Failure: If the crawl job fails, the node reports the error message returned by the API. Check the crawl parameters and network accessibility of the target site.
Invalid Path Filters: Incorrectly formatted include/exclude paths may cause unexpected crawl results. Use comma-separated paths without spaces.
Allow External Links Misconfiguration: Enabling external links may cause the crawl to expand beyond the intended domain, potentially increasing crawl time and data volume.

Links and References

Firecrawl API Documentation: https://docs.firecrawl.dev
Web Crawling Concepts: https://en.wikipedia.org/wiki/Web_crawler
n8n HTTP Request Node Documentation: https://docs.n8n.io/nodes/n8n-nodes-base.httpRequest/

Firecrawl ToolInstall