Actions6
Overview
The node "Firecrawl" is designed to scrape web pages by fetching and extracting content from a specified URL using the Firecrawl API. It allows users to customize how the scraping is performed, including selecting output formats, filtering HTML tags, setting request headers, simulating mobile devices, handling dynamic page interactions, and more.
This node is beneficial in scenarios such as:
- Extracting main content or specific parts of a webpage for data analysis.
- Collecting links or structured data (JSON) from websites.
- Taking screenshots of webpages programmatically.
- Automating interaction with dynamic content before scraping (e.g., clicking buttons, scrolling).
- Bypassing ads and cookie popups during scraping.
Practical examples:
- Scraping blog posts in Markdown format excluding navigation bars and footers.
- Gathering all links from a news website homepage.
- Capturing a screenshot of a product page on an e-commerce site.
- Extracting JSON data embedded in a webpage after interacting with a dropdown menu.
Properties
| Name | Meaning |
|---|---|
| Url | The URL of the webpage to scrape. |
| Scrape Options | A collection of options controlling the scraping behavior: |
| - Formats | Output format(s) for the scraped data. Options include: HTML, JSON, Links, Markdown, Raw HTML, Screenshot. |
| - Only Main Content | Whether to return only the main content of the page, excluding headers, navigation bars, footers, etc. |
| - Include Tags | List of HTML tags to explicitly include in the output. |
| - Exclude Tags | List of HTML tags to exclude from the output. |
| - Headers | Custom HTTP headers to send with the scraping request, specified as key-value pairs. |
| - Wait For (Ms) | Number of milliseconds to wait for the page to load before fetching content. |
| - Mobile | Whether to emulate scraping from a mobile device. |
| - Skip TLS Verification | Whether to skip TLS certificate verification when making requests. |
| - Timeout (Ms) | Timeout duration in milliseconds for the scraping request. |
| - Actions | List of actions to interact with dynamic content before scraping. Actions can be click, press key, take screenshot, scroll, wait, or write text, each with relevant parameters like selector, text, direction, etc. |
| - Location | Settings for geolocation of the request, including country (ISO 3166-1 alpha-2 code) and preferred languages/locales. |
| - Remove Base64 Images | Whether to remove base64 encoded images from the output. |
| - Block Ads | Enables ad-blocking and cookie popup blocking during scraping. |
| - Proxy | Type of proxy to use for the request. Options are Basic or Stealth. |
| Use Custom Body | Whether to use a custom request body instead of the default scraping options. |
Output
The node outputs a JSON object containing the scraped content according to the selected formats and options. The structure varies depending on the chosen output formats but generally includes:
- Extracted content in HTML, Markdown, or raw HTML form.
- JSON data if requested.
- An array of links if the "Links" format is selected.
- Screenshot data if the "Screenshot" format is selected (likely as binary or base64-encoded image data).
If binary data (such as screenshots) is included, it represents visual captures of the webpage either as full-page or viewport-sized images based on user settings.
Dependencies
- Requires access to the Firecrawl API endpoint at
https://api.firecrawl.dev/v1. - Needs an API authentication token credential configured in n8n to authorize requests to the Firecrawl service.
- No other external dependencies are indicated.
Troubleshooting
- Timeouts: If the request times out, consider increasing the "Timeout (Ms)" property or checking network connectivity.
- TLS Errors: If TLS certificate errors occur, enabling "Skip TLS Verification" may help but should be used cautiously.
- Incorrect Content Extraction: Adjust "Only Main Content", "Include Tags", and "Exclude Tags" to fine-tune what parts of the page are scraped.
- Dynamic Content Not Loaded: Use "Actions" to interact with the page (e.g., clicking buttons, waiting) before scraping to ensure dynamic content is loaded.
- Proxy Issues: If scraping fails due to IP restrictions, try switching between "Basic" and "Stealth" proxy types.
- Invalid URL: Ensure the URL is correctly formatted and accessible.
- API Authentication Failures: Verify that the API key credential is correctly set up and has necessary permissions.
Links and References
- Firecrawl API Documentation: https://firecrawl.dev/docs
- MDN Web Docs on Accept-Language Header: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language