Firecrawl icon

Firecrawl

Get data from Firecrawl API

Overview

The Firecrawl node is designed to interact with the Firecrawl API, enabling users to crawl websites and extract data efficiently. This node is particularly beneficial for web scraping tasks where users need to gather content from various web pages while applying specific filters and options. Common scenarios include collecting product information from e-commerce sites, aggregating blog posts, or extracting data for research purposes. For example, a user might configure the node to crawl a news website, including only articles from a specific section while excluding advertisements and irrelevant paths.

Properties

Name Meaning
Url The URL of the website to crawl (default: http://localhost:3002).
Exclude Paths URL patterns to exclude from the crawl using regex (e.g., blog/* excludes /blog/article-1).
Include Paths URL patterns to include in the crawl using regex (e.g., blog/* includes only /blog/article-1).
Max Depth Maximum depth to crawl relative to the entered URL (default: 2).
Limit Maximum number of results to return (default: 50).
Crawl Options Various options affecting the crawling behavior, such as ignoring sitemaps or allowing external links.
Scrape Options Options for scraping content during the crawl, including output formats and tag inclusion/exclusion.
Additional Fields Custom fields to send in the request body, including custom JSON properties.
Use Custom Body A flag indicating whether to use a custom body for the request.

Output

The output structure of the Firecrawl node typically consists of the scraped data from the specified URL, formatted according to the selected scrape options. This may include HTML content, JSON objects, or other specified formats. If binary data is involved, it would represent images or files extracted during the crawl process.

Dependencies

  • Firecrawl API: An API key credential is required to authenticate requests.
  • Base URL Configuration: Users can set a base URL for the API, defaulting to http://localhost:3002/v1.

Troubleshooting

  • Common Issues:

    • Invalid URL: Ensure that the provided URL is correctly formatted and accessible.
    • Authentication Errors: Verify that the API key is valid and has the necessary permissions.
    • Timeouts: Adjust the timeout settings if requests are taking too long to respond.
  • Error Messages:

    • "Failed to connect": Indicates issues with the network or incorrect URL. Check connectivity and URL validity.
    • "Unauthorized": Suggests problems with API key authentication. Confirm that the correct credentials are being used.

Links and References

Discussion