Firecrawl Tool
Overview
The node implements a web crawling operation using the Firecrawl v2 API. It starts from a specified URL and recursively discovers and scrapes multiple pages within the same site or subdomain, based on user-defined options such as crawl depth, page limits, and path filters. This is useful for gathering large sets of web content automatically, such as scraping all blog posts, articles, or product pages from a website.
Practical examples include:
- Crawling an entire blog to collect all posts published in 2024.
- Scraping a news website while excluding admin or private sections.
- Collecting data from a documentation site with controlled depth and page limits.
Properties
| Name | Meaning |
|---|---|
| URL | The starting URL for the crawl. The crawler will discover and scrape all accessible pages from this starting point. |
| Crawl Options | Collection of options to control the crawl behavior: |
| - Limit | Maximum number of pages to crawl. |
| - Max Depth | Maximum depth to crawl from the starting URL (how many link levels deep). |
| - Smart Crawl Prompt | Natural language prompt to intelligently guide the crawl, e.g., "Only crawl blog posts from 2024". |
| - Include Paths | Comma-separated list of URL paths to include in the crawl (e.g., "/blog,/articles"). |
| - Exclude Paths | Comma-separated list of URL paths to exclude from the crawl (e.g., "/admin,/private"). |
| - Allow External Links | Boolean flag whether to follow links to external domains outside the starting domain. |
| - Ignore Sitemap | Boolean flag whether to ignore the sitemap when crawling. |
| - Wait for Completion | Boolean flag whether to wait for the crawl job to complete before returning results, or return immediately with a job ID for asynchronous processing. |
Output
The output JSON contains the crawl results returned by the Firecrawl API. When waiting for completion, it includes the full crawl data such as discovered URLs, scraped content, metadata, and status. If not waiting for completion, it returns a job ID that can be used to query the crawl status later.
The structure typically includes:
status: The current status of the crawl job (e.g., "completed", "failed").data: The crawled pages and their extracted content.id: The job identifier if the crawl is asynchronous.error: Error message if the crawl failed.
No binary data output is indicated for this operation.
Dependencies
- Requires an API key credential for the Firecrawl API.
- The node makes HTTP requests to the Firecrawl API endpoint (default: https://api.firecrawl.dev).
- No other external dependencies are required.
Troubleshooting
- Missing API Key: The node throws an error if the Firecrawl API key is not provided in credentials. Ensure you add a valid API key.
- Crawl Job Timeout: If waiting for completion, the node waits up to 5 minutes for the crawl to finish. If the crawl takes longer, a timeout error is thrown. Consider increasing limits or disabling wait-for-completion to handle long crawls asynchronously.
- Crawl Job Failure: If the crawl job fails, the node reports the error message returned by the API. Check the crawl parameters and network accessibility of the target site.
- Invalid Path Filters: Incorrectly formatted include/exclude paths may cause unexpected crawl results. Use comma-separated paths without spaces.
- Allow External Links Misconfiguration: Enabling external links may cause the crawl to expand beyond the intended domain, potentially increasing crawl time and data volume.
Links and References
- Firecrawl API Documentation: https://docs.firecrawl.dev
- Web Crawling Concepts: https://en.wikipedia.org/wiki/Web_crawler
- n8n HTTP Request Node Documentation: https://docs.n8n.io/nodes/n8n-nodes-base.httpRequest/