FireCrawl icon

FireCrawl

FireCrawl API

Overview

The Batch Scrape operation of the FireCrawl node allows users to scrape multiple web pages in a single batch request. It is designed to fetch and extract various types of data from a list of URLs, supporting different output formats such as extracted structured data, screenshots, HTML content, links, and markdown. This operation is useful for scenarios like competitive analysis, content aggregation, SEO monitoring, or any task requiring automated collection of web page data at scale.

For example, you can provide a list of product pages URLs to scrape their main content and images, or capture full-page screenshots of news articles for archival purposes. The node supports advanced options like waiting for page load, emulating mobile devices, filtering HTML tags, and handling invalid URLs gracefully.

Properties

Name Meaning
URLs The list of URLs to scrape. Multiple URLs can be provided to perform batch scraping.
Webhook URL A URL to receive webhook notifications about the progress of the batch scrape job. Useful for asynchronous processing and tracking.
Formats Output format(s) for the scraped data. Options include: Extract (structured data), Full Page Screenshot, HTML, Links, Markdown, Raw HTML, Screenshot (viewport only). Multiple formats can be selected simultaneously.
Additional Options Collection of optional settings:
- Only Main Content: Return only the main content excluding headers, navs, footers, etc.
- Include Tags: Comma-separated list of HTML tags to include.
- Exclude Tags: Tags to exclude.
- Headers: Custom HTTP headers to send with requests (e.g., cookies, user-agent).
- Wait for (MS): Delay before fetching content to allow page loading.
- Mobile: Emulate mobile device scraping.
- Skip TLS Verification: Ignore TLS certificate errors.
- Timeout (MS): Request timeout duration.
- Remove Base64 Images: Remove base64 encoded images from output.
- Ignore Invalid URLs: Continue scraping valid URLs if some are invalid instead of failing entire batch.
Extract When using the "extract" format, this fixed collection allows specifying extraction details:
- Schema: JSON schema for structured data extraction.
- System Prompt: System prompt guiding extraction.
- Prompt: Extraction prompt without schema.
Use Custom Body Boolean flag to indicate whether to use a custom request body instead of the standard parameters.

Output

The node outputs JSON data containing the results of the batch scrape operation. Each item corresponds to one URL and includes fields depending on the requested formats:

  • Extract: Structured data extracted according to the provided schema or prompts.
  • Full Page Screenshot / Screenshot: Image data representing the screenshot of the page (likely as a binary or base64 string).
  • HTML / Raw HTML: The HTML content of the page, either cleaned or raw.
  • Links: List of hyperlinks found on the page.
  • Markdown: Page content converted into markdown format.

If binary data such as screenshots is included, it will be available in the binary output field, typically encoded appropriately for further processing or saving.

Dependencies

  • Requires an active connection to the FireCrawl API service.
  • Needs an API authentication token configured in the node credentials.
  • Network access to target URLs must be allowed.
  • Optional webhook URL endpoint should be accessible if webhook notifications are used.

Troubleshooting

  • Invalid URLs: If URLs are malformed or unreachable, the node may fail unless "Ignore Invalid URLs" is enabled.
  • Timeouts: Requests may time out if pages take too long to load; adjust the "Timeout (MS)" or "Wait for (MS)" properties accordingly.
  • TLS Errors: For sites with problematic SSL certificates, enable "Skip TLS Verification" to bypass verification.
  • Empty or Missing Data: Ensure that the correct formats are selected and extraction prompts/schemas are properly defined.
  • Webhook Failures: Verify that the webhook URL is reachable and correctly handles incoming notifications.

Common error messages usually relate to network issues, invalid input URLs, or authentication failures. Checking API credentials and network connectivity often resolves these.

Links and References


This summary is based solely on static analysis of the provided source code and property definitions.

Discussion