Firecrawl icon

Firecrawl

Firecrawl是一个LLM友好的网页爬虫系统

Overview

The node "Firecrawl" provides web crawling and scraping capabilities, focusing on the V1 resource's operation to create a full-site crawl task ("创建整站获取任务"). This operation enables users to start a comprehensive crawl of an entire website starting from a specified URL. It supports controlling crawl depth, page limits, content extraction formats, and various advanced options such as filtering by tags or paths, handling headers, and proxy settings.

This node is beneficial for scenarios where you need to gather structured data or snapshots from multiple pages of a website automatically. For example:

  • Archiving or backing up website content.
  • Extracting product listings or blog posts across a site.
  • Monitoring changes or updates on a website.
  • Generating datasets for analysis or machine learning from web content.

Properties

Name Meaning
网页链接 (v1Url) The starting webpage URL for the crawl. Required.
最大深度 (v1CrawlMaxDepth) Maximum depth of links to follow from the starting page during the crawl. Default is 3.
最大页面数量 (v1CrawlLimit) Maximum number of pages to crawl. Default is 50.
返回格式 (v1CrawlFormats) Formats in which to return the scraped content. Options include: Extract, HTML, Links, Markdown, 原始HTML (raw HTML), 整个页面截图 (full page screenshot), 网页截图 (page screenshot). Default is Markdown.
仅获取网页主要内容 (v1CrawlOnlyMainContent) Whether to extract only the main content of each webpage, ignoring sidebars, ads, etc. Default true.
等待时间(毫秒) (v1CrawlWaitFor) Time in milliseconds to wait for the page to render before scraping. Default 0.
HTTP Headers (v1CrawlHeaders) Custom HTTP headers to send with requests during crawling.
Tag标签白名单 (v1CrawlOnlyIncludeTags) List of HTML tag names to include when scraping content.
排除的Tag标签 (v1CrawlExcludeTags) List of HTML tag names to exclude from scraping.
包含的网页链接 (v1CrawlIncludePaths) List of URL path patterns (supports regex-like syntax) to include in the crawl.
排除的网页链接 (v1CrawlExcludePaths) List of URL path patterns to exclude from the crawl.
超时时间(毫秒) (v1CrawlTimeout) Timeout in milliseconds for waiting for page rendering. Default 60.
跳过TLS验证 (v1CrawlSkipTlsVerification) Whether to skip TLS certificate verification during requests. Default true.
获取移动端页面 (v1CrawlMobile) Whether to fetch the mobile version of pages. Default false.
移除Base64图片 (v1CrawlRemoveBase64Images) Whether to remove embedded Base64 images from the scraped content. Default false.
忽略站点地图 (v1CrawlIgnoreSitemap) Whether to ignore sitemap files during crawling. Default true.
允许反向链接 (v1CrawlAllowBackwardLinks) Whether to allow crawling backward links (links pointing back to previously visited pages). Default true.
允许外部链接 (v1CrawlAllowExternalLinks) Whether to allow crawling external links outside the starting domain. Default true.
Webhook地址 (v1CrawlWebhook) URL to receive webhook callbacks for crawl events such as start, page crawled, completion, or failure.
屏蔽广告 (v1CrawlBlockAds) Enable ad blocking and cookie popup blocking during crawling. Default false.
代理类型 (v1CrawlProxy) Proxy type used for crawling. Options: Basic (fast, for sites without advanced anti-bot measures), Stealth (slower but more reliable for sites with advanced anti-bot protections). Default is Basic.

Output

The node outputs JSON data representing the results of the crawl task. This typically includes:

  • Metadata about the crawl task status.
  • Details of each crawled page including extracted content in the requested formats (e.g., markdown, HTML).
  • Links discovered during the crawl.
  • Screenshots if requested.

If screenshots are included, binary data representing the image(s) may be part of the output.

Dependencies

  • Requires an API key credential for authenticating with the Firecrawl service.
  • Needs the base URL of the Firecrawl API configured in credentials.
  • Network access to target websites and optionally to webhook URLs.
  • Optional proxy configuration depending on the selected proxy type.

Troubleshooting

  • Common issues:

    • Invalid or unreachable starting URL will cause crawl failures.
    • Setting too high max depth or page limit may lead to long execution times or timeouts.
    • Incorrect webhook URL may result in missed event notifications.
    • Skipping TLS verification might be necessary for sites with self-signed certificates.
    • Proxy misconfiguration can block crawling or cause slowdowns.
  • Error messages:

    • "Invalid URL" — Check the format and accessibility of the provided URL.
    • "Timeout exceeded" — Increase timeout or reduce crawl depth/page limit.
    • "Authentication failed" — Verify API key credential correctness.
    • "Webhook delivery failed" — Ensure webhook endpoint is reachable and accepts POST requests.

Links and References

  • Firecrawl official documentation (not provided here; consult your Firecrawl API docs)
  • Web crawling best practices and ethical guidelines
  • n8n documentation on creating custom nodes and using HTTP request nodes

Discussion