Firecrawl icon

Firecrawl

Firecrawl是一个LLM友好的网页爬虫系统

Overview

The Firecrawl node is a web scraping tool designed to extract data from web pages using advanced techniques, including support for large language model (LLM) integration. The "V1" resource with the operation "获取网页 Scrape" (Get Webpage Scrape) allows users to fetch and process content from a single webpage or batch tasks with various output formats and customization options.

This node is beneficial in scenarios such as:

  • Extracting main content or specific tags from articles or blogs.
  • Capturing full-page screenshots or HTML snapshots for archival or analysis.
  • Collecting links or markdown-formatted content for further processing.
  • Automating data collection from websites that require rendering time or custom headers.

Practical examples:

  • Automatically scrape news articles' main text and metadata for sentiment analysis.
  • Generate markdown summaries of product pages for e-commerce monitoring.
  • Capture screenshots of webpages for visual regression testing or documentation.

Properties

Name Meaning
网页链接 (v1Url) The URL of the webpage to scrape. Required for operations like fetching a single page or creating site-wide tasks.
返回格式 (v1Formats) Output formats to return. Options include: Extract (structured extraction), HTML, Links, Markdown, 原始HTML (raw HTML), 整个页面截图 (full page screenshot). Used for single page fetch and batch tasks.
返回格式 (v1SearchFormats) Similar to v1Formats but used specifically for the "搜索网页" (search webpage) operation. Includes an additional "网页截图" (webpage screenshot) option.
选项 (options) Collection of optional settings:
- 仅获取网页主要内容 (v1OnlyMainContent) Whether to extract only the main content of the webpage. Defaults to true.
- 等待时间(毫秒) (v1WaitFor) Time in milliseconds to wait for the webpage to render before scraping. Useful for dynamic content.
- HTTP Headers (v1Headers) Custom HTTP headers to send with the request. Can be multiple key-value pairs.
- Tag标签白名单 (v1OnlyIncludeTags) Whitelist of HTML tags to include in the extraction.
- 排除的Tag标签 (v1ExcludeTags) List of HTML tags to exclude from the extraction.
- 超时时间(毫秒) (v1Timeout) Timeout in milliseconds for the scraping operation.
- 跳过TLS验证 (v1SkipTlsVerification) Whether to skip TLS certificate verification. Defaults to true.
- 获取移动端页面 (v1Mobile) Whether to fetch the mobile version of the webpage. Defaults to false.
- 移除Base64图片 (v1RemoveBase64Images) Whether to remove Base64-encoded images from the output. Defaults to false.
- Extract 配置 (v1Extract) Configuration for structured extraction using prompts and schema definitions. Includes system prompt, user prompt, and JSON schema.
- 页面操作 (v1Actions) JSON array defining actions to perform on the page before scraping (e.g., clicks, scrolls).
- 地理位置设置 (v1Location) Location settings including country and preferred languages to simulate geolocation during scraping.
- 屏蔽广告 (v1BlockAds) Enable ad blocking and cookie popup blocking during scraping. Defaults to false.
- 代理类型 (v1Proxy) Proxy type to use for scraping. Options: Basic (fast, for sites without advanced anti-bot), Stealth (slower, for sites with advanced anti-bot protections). Defaults to Basic.

Output

The node outputs JSON data containing the scraped content according to the requested formats. The structure depends on the selected output formats:

  • Extract: Structured data extracted based on the provided schema and prompts.
  • HTML / 原始HTML: The full or raw HTML content of the page.
  • Links: An array of hyperlinks found on the page.
  • Markdown: The page content converted into markdown format.
  • 整个页面截图 / 网页截图: Binary image data representing a screenshot of the entire page or viewport.

If screenshots are requested, the node outputs binary data representing the captured images.

Dependencies

  • Requires an API key credential for the Firecrawl service.
  • Needs network access to the target URLs.
  • Supports proxy configuration to bypass anti-scraping measures.
  • Optional location and language settings may require additional configuration on the Firecrawl platform.

Troubleshooting

  • Timeouts: If the page takes too long to load, increase the "超时时间" (timeout) or "等待时间" (waitFor) properties.
  • Incomplete content: Ensure "仅获取网页主要内容" is set appropriately; disabling it may help if important parts are missing.
  • TLS errors: If TLS verification fails, enabling "跳过TLS验证" can resolve issues with self-signed certificates.
  • Blocked requests: Use the "代理类型" stealth proxy option to bypass advanced anti-bot protections.
  • Incorrect output format: Verify that the requested "返回格式" matches the expected data type and that the downstream nodes can handle it.
  • Headers not applied: Confirm custom HTTP headers are correctly formatted as key-value pairs.

Links and References

  • Firecrawl official website and API documentation (not provided here).
  • General web scraping best practices and legal considerations.
  • Markdown formatting guides for content conversion.

Discussion