Firecrawl icon

Firecrawl

Firecrawl是一个LLM友好的网页爬虫系统

Overview

This node, named "Firecrawl," is designed to scrape and extract content from web pages using a large language model (LLM)-friendly web crawling system. Specifically, the "获取单个网页" ("Get Single Webpage") operation under the "V0" resource fetches a single webpage's content with various customizable options.

Common scenarios for this node include:

  • Extracting the main textual content of an article or blog post.
  • Capturing screenshots of webpages for visual records.
  • Retrieving HTML or raw HTML for further processing or analysis.
  • Filtering specific HTML tags to include or exclude certain parts of the page.
  • Adding custom HTTP headers for authenticated or specialized requests.
  • Waiting for dynamic content to load before scraping.

Practical examples:

  • Automatically fetching news articles' main content without ads or sidebars.
  • Taking full-page screenshots of product pages for monitoring changes.
  • Collecting all links from a webpage for link analysis.
  • Scraping markdown-formatted content for integration into documentation systems.

Properties

Name Meaning
网页链接 (v0Url) The URL of the webpage to fetch. This is required.
仅获取网页主要内容 (v0OnlyMainContent) Whether to extract only the main content of the webpage (true/false). Defaults to true.
返回HTML (v0IncludeHtml) Whether to include the processed HTML content in the response (true/false). Defaults to false.
返回原始HTML (v0IncludeRawHtml) Whether to include the raw HTML source of the webpage in the response (true/false). Defaults to false.
截图 (v0Screenshot) Whether to capture a screenshot of the visible part of the webpage (true/false). Defaults to false.
整体页面截图 (v0FullPageScreenshot) Whether to capture a full-page screenshot (entire scrollable area) of the webpage (true/false). Defaults to false.
等待时间(毫秒) (v0WaitFor) Time in milliseconds to wait for the webpage to render before scraping. Useful for pages with dynamic content. Defaults to 0.
HTTP Headers (v0Headers) Custom HTTP headers to send with the request. Users can add multiple key-value pairs. Useful for authentication or setting user-agent strings.
Tag标签白名单 (v0OnlyIncludeTags) List of HTML tag names to whitelist/include in the extracted content. Only these tags will be included if specified.
排除的Tag标签 (v0RemoveTags) List of HTML tag names to exclude/remove from the extracted content.
超时时间(毫秒) (v0Timeout) Timeout in milliseconds to wait for the webpage rendering and scraping process before aborting. Defaults to 60,000 ms (60 seconds).

Output

The node outputs JSON data containing the scraped webpage content according to the selected options. The output structure may include:

  • Main content text extracted from the webpage.
  • Processed HTML content if requested.
  • Raw HTML source if requested.
  • Screenshots as binary data (image files), either visible viewport or full page.
  • Lists of links or other extracted elements depending on options.

If screenshots are enabled, the node outputs binary image data representing the captured webpage screenshot(s).

Dependencies

  • Requires an API key credential for the Firecrawl service.
  • The node sends requests to the Firecrawl API base URL configured in credentials.
  • No other external dependencies are indicated.

Troubleshooting

  • Timeouts: If the webpage takes too long to load, increase the "超时时间" (timeout) or "等待时间" (waitFor) properties.
  • Empty content: Ensure the URL is correct and accessible. Check if the page requires authentication or special headers; use the HTTP Headers property accordingly.
  • Screenshots not generated: Verify that the screenshot options are enabled and that the API supports capturing screenshots for the target page.
  • Invalid tags in whitelist/blacklist: Use valid HTML tag names; incorrect tags may cause unexpected results.
  • API errors: Confirm that the API key credential is valid and has sufficient permissions.

Links and References

  • Firecrawl official website or API documentation (not provided in source).
  • General web scraping best practices.
  • n8n documentation on creating and using custom nodes.

Discussion