Overview
This node integrates with the Firecrawl web scraping service to extract content from a specified URL. It is designed to be connected to an AI agent, enabling automated retrieval and processing of web page data for further analysis or use within workflows.
Common scenarios where this node is beneficial include:
- Extracting main article content from news websites while excluding navigation menus and footers.
- Collecting links or raw HTML from a webpage for data aggregation or monitoring.
- Capturing screenshots of webpages for visual records or audits.
- Providing structured JSON or markdown summaries of web content to AI agents for enhanced understanding or decision-making.
For example, a user might configure this node to scrape the main content of a product page in markdown format, then feed that content into an AI model for sentiment analysis or summarization.
Properties
| Name | Meaning |
|---|---|
| This node must be connected to an AI agent. | A notice indicating that this node requires connection to an AI agent node to function properly. |
| Description | A text description explaining to the AI what this tool does; defaults to "Scrapes content from a given URL using Firecrawl". |
| URL | The web address to scrape content from. Supports placeholders for dynamic URLs. |
| Formats | The output formats to return. Options include: Markdown, HTML, Raw HTML, Content (plain text), Links, Screenshot, Full Page Screenshot, Extracted data, and JSON. Multiple can be selected simultaneously. |
| Only Main Content | Boolean flag indicating whether to return only the main content of the page, excluding headers, navigation bars, footers, etc. Defaults to true. |
| Include Tags | Comma-separated list of HTML tags, classes, or IDs to explicitly include in the output. |
| Exclude Tags | Comma-separated list of HTML tags, classes, or IDs to exclude from the output. |
| Cache | Option to enable caching of scraping results to speed up repeated requests. Choices are None or Postgres-based caching. |
| Cache TTL | Time-to-live for cached entries in seconds when using Postgres caching. Default is -1 (no expiration). |
| Placeholder Definitions | Defines named placeholders that can be used in the URL or other parameters. Each placeholder has a name, description, and type (String or URL). Used for dynamic parameter substitution before scraping. |
Output
The node outputs a single field json containing the scraped data in the requested formats. The structure depends on the selected formats but generally includes:
- Textual content in markdown, HTML, raw HTML, or plain content form.
- Extracted links as arrays or objects.
- Screenshots encoded as image data (binary data not explicitly detailed here).
- Extracted structured data or JSON representations of the page content.
If multiple formats are selected, the output JSON will contain all corresponding fields serialized as a JSON string.
Dependencies
- Requires an API key credential for the Firecrawl web scraping service.
- Optionally requires credentials for a Postgres database if caching is enabled with Postgres.
- Uses external libraries for caching and Firecrawl API interaction.
- Must be connected to an AI agent node in n8n to operate correctly.
Troubleshooting
- Misconfigured placeholders: If a placeholder is defined but not used in the URL or other parameters, the node throws an error indicating the unused placeholder. To fix, either remove the unused placeholder definition or ensure it is referenced properly.
- Operation not implemented: The node currently supports only the "Scrape Url" operation. Selecting any other operation will cause an error.
- API errors from Firecrawl: If the scraping request fails, the node throws an error with the message returned by Firecrawl. Check API key validity, URL correctness, and network connectivity.
- Caching issues: When using Postgres caching, ensure the database credentials are correct and the database is accessible. Misconfiguration may lead to cache failures or slower performance.