Firecrawl icon

Firecrawl

Firecrawl是一个LLM友好的网页爬虫系统

Overview

The node "Firecrawl" is designed to perform web scraping tasks using an advanced system that integrates LLM (Large Language Model) capabilities. Specifically, the V1 resource with the operation "创建批量获取任务 Batch/Scrape" allows users to create batch web scraping jobs by providing multiple URLs and specifying how the content should be extracted or formatted.

This node is beneficial in scenarios where you need to collect data from multiple web pages efficiently, such as:

  • Aggregating product information from e-commerce sites.
  • Collecting news articles or blog posts in bulk.
  • Extracting structured data from multiple web pages for analysis.
  • Taking screenshots of multiple pages for visual monitoring.

Practical example: You want to scrape the main content and links from a list of URLs to build a dataset for sentiment analysis or market research. You can input all URLs at once, specify the desired output formats (e.g., Markdown and Links), and configure options like waiting time or HTTP headers to mimic real user behavior.

Properties

Name Meaning
网页链接 (v1CrawlUrls) A collection of URLs to scrape. Users can add multiple web page links which will be processed in batch.
返回格式 (v1Formats) The output format(s) for the scraped content. Options include: Extract (structured extraction), HTML, Links, Markdown, 原始HTML (raw HTML), 整个页面截图 (full page screenshot). Default is Markdown.
选项 (options) Additional optional settings:
- 仅获取网页主要内容: Only extract main content (boolean).
- 等待时间(毫秒): Wait time in milliseconds before scraping to allow page rendering.
- HTTP Headers: Custom headers to send.
- Tag标签白名单: Whitelist of HTML tags to include.
- 排除的Tag标签: Tags to exclude.
- 超时时间(毫秒): Timeout for scraping.
- 跳过TLS验证: Skip TLS verification (boolean).
- 获取移动端页面: Fetch mobile version of page (boolean).
- 移除Base64图片: Remove base64 images (boolean).
- Extract 配置: Configuration for extraction prompts and schema.
- 页面操作: JSON array of page actions to perform.
- 地理位置设置: Location settings including country and languages.
- 屏蔽广告: Enable ad and cookie popup blocking (boolean).
- 代理类型: Proxy type to use ("Basic" or "Stealth").

Output

The node outputs JSON data representing the results of the batch scraping task. The structure typically includes the scraped content in the requested formats (e.g., extracted text, HTML snippets, links, markdown, raw HTML, or screenshots). If screenshots are requested, binary data representing the image may be included or referenced.

The output JSON will correspond to each URL submitted, containing the data fields as per the selected formats and extraction configurations.

Dependencies

  • Requires an API key credential for authentication with the Firecrawl service.
  • Needs network access to the target URLs.
  • Supports proxy configuration (basic or stealth) to handle anti-bot measures.
  • May require environment variables or n8n credentials setup for the Firecrawl API base URL and authentication token.

Troubleshooting

  • Timeouts: If pages take too long to load, increase the "超时时间(毫秒)" (timeout) or "等待时间(毫秒)" (waitFor) properties.
  • TLS Errors: If TLS certificate errors occur, enable "跳过TLS验证" (skip TLS verification).
  • Incomplete Data: Ensure "仅获取网页主要内容" (onlyMainContent) is set appropriately; disabling it may yield more complete but noisier data.
  • Proxy Issues: Use "Stealth" proxy if basic proxy fails due to anti-bot protections.
  • Headers Not Applied: Verify custom HTTP headers are correctly formatted.
  • Ad Blocking: Enable "屏蔽广告" (blockAds) to reduce interference from ads and popups.
  • Invalid JSON in Extract Config: Check the JSON schema and prompt strings for correctness.

Common error messages usually relate to network issues, invalid URLs, or authentication failures. Confirm API credentials and URL validity first.

Links and References

  • Firecrawl official documentation (not provided here, check your Firecrawl API docs)
  • Web scraping best practices and legal considerations
  • n8n documentation on creating custom nodes and handling HTTP requests

Discussion