Firecrawl

Get data from Firecrawl API

Actions6

Overview

The Firecrawl node enables users to extract structured data from web pages by leveraging the Firecrawl API. It is designed for scenarios where you need to scrape and parse content from multiple URLs, optionally guided by a prompt and schema to shape the extracted data. This node is beneficial for automating data collection from websites, such as gathering product details, news articles, or any other web content that can be programmatically accessed.

Practical examples include:

Extracting product information from e-commerce sites using URL patterns.
Collecting news headlines and summaries from multiple news portals.
Scraping blog posts or documentation pages with specific content structures.

Properties

Name	Meaning
URLs	The list of URLs (supports glob format) from which to extract data. Multiple URLs can be specified.
Prompt	A textual prompt to guide the extraction process, helping tailor the data extraction to specific needs.
Schema	JSON schema defining the expected structure of the extracted data, allowing precise control over output formatting.
Ignore Sitemap	Whether to ignore the website's sitemap during crawling (default: true).
Include Subdomains	Whether to include subdomains of the target website in the crawl (default: false).
Enable Web Search	Enables web search to find additional relevant data beyond the provided URLs (default: false).
Show Sources	Whether to display the sources used for data extraction alongside the results (default: false).
Scrape Options	Collection of options controlling how content is scraped, including: • Formats: Output formats like HTML, JSON, Markdown, links, raw HTML, screenshot. • Only Main Content: Extract only main page content. • Include/Exclude Tags: Specify HTML tags to include or exclude. • Headers: Custom HTTP headers. • Wait For (Ms): Delay before scraping. • Mobile: Emulate mobile device. • Skip TLS Verification: Ignore TLS errors. • Timeout (Ms): Request timeout. • Actions: List of interactions (click, press, scroll, wait, write, screenshot) to perform on the page before scraping. • Location: Country and preferred languages for the request. • Remove Base64 Images: Remove base64 encoded images from output. • Block Ads: Enable ad and cookie popup blocking. • Proxy: Type of proxy to use (Basic or Stealth).
Use Custom Body	Whether to send a custom request body instead of the standard parameters (default: false).

Output

The node outputs JSON data representing the extracted content according to the specified schema and prompt. The structure of the json output field matches the user-defined schema, containing properties with types such as strings or numbers as defined.

If enabled, the output may also include:

Source information detailing where each piece of data was extracted from.
Various content formats like HTML, Markdown, JSON, or screenshots depending on the selected scrape options.

Binary data output is possible when requesting screenshots, which will contain image data representing the captured webpage view.

Dependencies

Requires an active connection to the Firecrawl API via an API key credential.
Network access to target URLs for crawling and scraping.
Optional proxy configuration for requests.
Properly configured n8n credentials for authentication with the Firecrawl service.

Troubleshooting

Timeouts: If requests time out, consider increasing the "Timeout (Ms)" property or checking network connectivity.
Invalid URLs or Glob Patterns: Ensure URLs are correctly formatted and valid; incorrect globs may result in no data being extracted.
Schema Parsing Errors: The schema must be valid JSON; malformed JSON will cause errors.
TLS Verification Failures: If connecting to sites with self-signed certificates, enable "Skip TLS Verification".
Insufficient Permissions: Verify that the API key credential has proper permissions and is not expired.
Dynamic Content Not Loading: Use the "Actions" property to interact with dynamic elements (e.g., clicking buttons or waiting) before scraping.
Ad Blocking Issues: If content is blocked or missing, try toggling the "Block Ads" option.

Links and References

Firecrawl API Documentation (for detailed API capabilities)
MDN Web Docs - Accept-Language Header (for language settings)
Glob Pattern Syntax (to understand URL pattern matching)