Firecrawl icon

Firecrawl

Interactive with the firecrawl api

Overview

This node interacts with the Firecrawl API to scrape web pages. It allows users to extract various formats of data from a given URL, such as Markdown, HTML, raw HTML, content text, links, screenshots, and JSON representations. The node supports options to focus on the main content of the page, include or exclude specific HTML tags, classes, or IDs, and leverage caching to speed up repeated requests.

Common scenarios where this node is beneficial include:

  • Extracting clean article content from news websites while ignoring navigation menus and footers.
  • Collecting all links from a webpage for link analysis or crawling.
  • Capturing screenshots of webpages for visual monitoring or archiving.
  • Obtaining structured JSON data extracted from a page for further processing.

Practical example: A user wants to scrape a blog post URL to get its main content in Markdown format, excluding ads and sidebars, and cache the result for faster future access.

Properties

Name Meaning
Url The URL of the webpage to scrape.
Formats The output formats to retrieve from the scraped page. Options include: Markdown, HTML, Raw HTML, Content (text), Links, Screenshot, Full Page Screenshot, Extracted data, and JSON. Multiple formats can be selected.
Only Main Content If enabled, only the main content of the page is returned, excluding headers, navigation bars, footers, and other peripheral elements.
Include Tags A list of tags, classes, or IDs to specifically include in the final output. This filters the content to only those elements specified. Accepts multiple comma-separated values.
Exclude Tags A list of tags, classes, or IDs to remove from the page before output. Accepts multiple comma-separated values.
Cache Selects whether to use caching to speed up scraping. Options are None (no cache) or Postgres (cache results in a PostgreSQL database).
Cache TTL Time-to-live for cached entries in seconds when using Postgres cache. A value of -1 means no expiration (cache indefinitely).

Output

The node outputs a JSON object containing the scraped data according to the requested formats. The structure depends on the selected formats but generally includes fields such as:

  • markdown: The page content converted to Markdown.
  • html: The cleaned HTML content.
  • rawHtml: The original raw HTML of the page.
  • content: Plain text content extracted from the page.
  • links: An array of URLs found on the page.
  • screenshot: A screenshot image of the page encoded in a suitable format (e.g., base64).
  • screenshot@fullPage: A full-page screenshot image.
  • extract: Extracted structured data from the page.
  • json: JSON representation of the scraped data.

If screenshots are requested, binary data representing the images is included in the output.

Dependencies

  • Requires an API key credential for the Firecrawl service to authenticate API calls.
  • Optionally requires PostgreSQL credentials if caching with Postgres is enabled.
  • Uses external libraries:
    • Firecrawl JavaScript SDK for interacting with the Firecrawl API.
    • Keyv with Postgres adapter for caching.
  • n8n node configuration must include these credentials accordingly.

Troubleshooting

  • Common issues:

    • Invalid or missing API key credential will cause authentication failures.
    • Incorrect PostgreSQL credentials or connectivity issues will prevent caching.
    • Requesting unsupported operations will throw errors.
    • Network issues may cause timeouts or failed API calls.
  • Error messages:

    • "The operation "scrapeUrl" failed: <error>" indicates the Firecrawl API returned an error during scraping. Check the URL validity and API quota.
    • "The operation "<operation>" is not implemented." means an unsupported operation was requested; verify the operation parameter.
  • Resolutions:

    • Ensure valid and active API keys are configured.
    • Verify PostgreSQL connection details if caching is used.
    • Confirm the URL is accessible and correctly formatted.
    • Use supported operation names only.

Links and References

Discussion