Firecrawl icon

Firecrawl

Get data from Firecrawl API

Overview

The Firecrawl node enables users to extract structured data from web pages by leveraging the Firecrawl API. It is designed for scenarios where you need to scrape and parse content from multiple URLs, optionally guided by a prompt and schema to shape the extracted data. This node is beneficial for automating data collection from websites, such as gathering product details, news articles, or any other web-based information that can be defined by a schema.

Practical examples include:

  • Extracting product specifications from e-commerce sites using URL patterns.
  • Collecting news headlines and summaries from multiple news portals.
  • Scraping job listings with specific fields like title, location, and salary.

Properties

Name Meaning
URLs List of URLs (supports glob format) to extract data from. Multiple URLs can be specified.
Prompt A text prompt to guide the extraction process, helping tailor the data extraction to specific needs.
Schema JSON schema defining the structure of the extracted data, specifying expected properties and their types.
Ignore Sitemap Whether to ignore the website's sitemap during crawling (default: true).
Include Subdomains Whether to include subdomains of the target website in the crawl (default: false).
Enable Web Search Enables web search to find additional data beyond the provided URLs (default: false).
Show Sources Whether to include the sources used for data extraction in the output (default: false).
Scrape Options Collection of options controlling how content is scraped, including:
- Formats: Output formats such as HTML, JSON, Markdown, links, raw HTML, screenshot.
- Only Main Content: Return only main page content.
- Include Tags: Specify HTML tags to include.
- Exclude Tags: Specify HTML tags to exclude.
- Headers: Custom headers to send with requests.
- Wait For (Ms): Delay before fetching content.
- Mobile: Emulate mobile device scraping.
- Skip TLS Verification: Skip SSL certificate verification.
- Timeout (Ms): Request timeout.
- Actions: List of actions (click, press, scroll, wait, write, screenshot) to interact with dynamic content before scraping.
- Location: Country code and preferred languages for the request.
- Remove Base64 Images: Remove base64 encoded images from output.
- Block Ads: Enable ad-blocking and cookie popup blocking.
- Proxy: Type of proxy to use (Basic or Stealth).
Use Custom Body Whether to use a custom request body instead of the standard parameters (default: false).

Output

The node outputs JSON data representing the extracted content structured according to the provided schema and prompt. The output includes the parsed data from the specified URLs or discovered sources. If enabled, it may also include metadata about the sources used for extraction.

If the "Formats" option includes screenshot, the node can output binary data representing screenshots of the pages.

Dependencies

  • Requires an API key credential for the Firecrawl API.
  • The node communicates with the Firecrawl API endpoint (default: https://api.firecrawl.dev/v1).
  • No additional external dependencies are required within n8n, but proper network access to the API and target URLs is necessary.

Troubleshooting

  • Timeouts: If requests time out, consider increasing the "Timeout (Ms)" property.
  • Invalid URLs: Ensure URLs are correctly formatted and accessible; glob patterns should be valid.
  • Schema Errors: Incorrect JSON schema definitions may cause extraction failures; validate the schema JSON.
  • TLS Issues: If encountering SSL errors, enabling "Skip TLS Verification" might help but use cautiously.
  • Dynamic Content Not Loaded: Use "Actions" to interact with dynamic page elements (e.g., clicking buttons or waiting) before scraping.
  • API Authentication Errors: Verify that the API key credential is correctly configured and has necessary permissions.

Links and References

Discussion