Firecrawl icon

Firecrawl

Get data from Firecrawl API

Overview

The Firecrawl node enables users to extract structured data from web pages using the Firecrawl API. It is designed for web scraping and data extraction tasks where you want to gather information from multiple URLs, optionally guided by prompts or schemas to shape the extracted data.

Common scenarios include:

  • Extracting product details, prices, or reviews from e-commerce sites.
  • Gathering news articles or blog content based on URL patterns.
  • Monitoring website changes with change tracking formats.
  • Capturing screenshots of web pages for visual records.
  • Collecting links or summaries from specified web pages.

Practical example: You can provide a list of URLs (supporting glob patterns) and define a JSON schema to extract specific fields like titles and dates from each page. The node will crawl these URLs, scrape the content according to your settings, and return structured JSON data.

Properties

Name Meaning
URLs List of URLs to extract data from. Supports glob format to specify multiple pages.
Prompt A text prompt to guide the extraction process, helping tailor the data extraction to your needs.
Schema JSON schema defining the structure of the extracted data, ensuring output matches expected fields and types.
Ignore Sitemap Whether to ignore the website's sitemap during crawling (default: true).
Include Subdomains Whether to include subdomains of the target website in the crawl (default: false).
Enable Web Search Enables web search to find additional data beyond the provided URLs (default: false).
Show Sources Whether to include the sources used for data extraction in the output (default: false).
Scrape Options Options controlling how content is scraped, including output formats (e.g., markdown, HTML, JSON, screenshots), change tracking modes, screenshot quality and viewport size, and actions to interact with dynamic content before scraping.
Only Main Content Whether to return only the main content of the page, excluding headers, navigation bars, footers, etc. (default: true).
Include Tags Specifies HTML tags to include in the output (e.g., header, article).
Exclude Tags Specifies HTML tags to exclude from the output (e.g., footer, nav).
Headers Custom HTTP headers to send with requests, allowing customization such as user-agent or authorization headers.
Wait For (Ms) Milliseconds to wait for the page to load before fetching content, useful for pages with dynamic content loading.
Mobile Emulate scraping from a mobile device (default: false).
Skip TLS Verification Whether to skip TLS certificate verification when making requests (default: false).
Timeout (Ms) Request timeout in milliseconds (default: 30000).
Actions List of actions to perform on the page before scraping, such as clicking elements, scrolling, waiting, writing text, pressing keys, or taking screenshots.
Location Location settings for the request, including country code and preferred languages/locales, affecting localization of content.
Remove Base64 Images Whether to remove base64 encoded images from the output to reduce payload size (default: true).
Block Ads Enables ad-blocking and cookie popup blocking during scraping (default: true).
Store In Cache Whether to store the page in the Firecrawl index and cache; disable if handling sensitive data (default: true).
Proxy Type of proxy to use for requests; options are Basic or Stealth.
Additional Fields Allows adding custom JSON properties to the request body for advanced use cases.
Use Custom Body Option to use a fully custom request body instead of the standard parameters.

Output

The node outputs JSON data representing the extracted content from the specified URLs. The structure of the JSON output depends on the provided schema and formats selected in the scrape options.

Output may include:

  • Structured data matching the user-defined schema.
  • Raw or processed HTML content.
  • Markdown or summary text.
  • Lists of links found on the pages.
  • Change tracking information showing differences between crawls.
  • Screenshots as binary data (if screenshot format is selected).

If "Show Sources" is enabled, the output also includes metadata about the sources used for extraction.

Binary data output (screenshots) is provided as file attachments suitable for further processing or saving.

Dependencies

  • Requires an API key credential for authenticating with the Firecrawl API.
  • Network access to the Firecrawl API endpoint (default https://api.firecrawl.dev/v2).
  • Optional proxy configuration depending on network requirements.
  • No other external dependencies are required.

Troubleshooting

  • Timeouts: If requests time out, consider increasing the "Timeout (Ms)" property or checking network connectivity.
  • Invalid URLs: Ensure URLs are correctly formatted and accessible. Glob patterns should be valid.
  • Schema errors: Malformed JSON schemas can cause extraction failures. Validate JSON syntax before use.
  • Permission issues: Verify that the API key has sufficient permissions and is correctly configured.
  • Dynamic content not loaded: Use "Actions" to interact with the page (e.g., wait, click) before scraping if content loads dynamically.
  • TLS errors: If encountering TLS certificate errors, enabling "Skip TLS Verification" might help but use cautiously.
  • Ad-blocking issues: Disabling "Block Ads" may be necessary if it interferes with content loading.

Links and References

Discussion