Smart Web Scraper
Overview
The Smart Web Scraper node is designed to intelligently scrape web pages by automatically attempting multiple scraping methods in a prioritized order based on a chosen strategy. It extracts clean, main content from URLs provided by the user, removing extraneous elements like navigation menus and ads when desired. The node supports failover mechanisms using different scraping services or APIs to maximize success rates.
Common scenarios where this node is beneficial include:
- Extracting article content or blog posts from multiple URLs for content aggregation.
- Automating data collection workflows that require reliable extraction of textual content from web pages.
- Using different scraping strategies to balance cost, speed, and quality depending on project needs.
- Handling websites with varying levels of complexity or anti-scraping measures by switching between HTTP requests, AI-powered readers, and premium APIs.
Practical examples:
- Scraping news articles from a list of URLs, outputting clean markdown summaries including metadata like author and publish date.
- Collecting product descriptions from e-commerce pages using a speed-first strategy to minimize latency.
- Using a quality-first approach to extract highly accurate content via premium APIs for research purposes.
Properties
| Name | Meaning |
|---|---|
| URLs | The URL(s) to scrape. Multiple URLs can be separated by commas or new lines. |
| Scraping Strategy | Strategy for attempting different scraping methods: - Cost Effective: Try free methods first (HTTP GET → Jina → Firecrawl) - Speed First: Use fastest available method - Quality First: Start with premium APIs for best extraction |
| Failover Options | Collection of options to enable/disable and configure failover scraping methods: - Enable Firecrawl (boolean) - Firecrawl API Key (string, password) - Firecrawl API Host (string) - Enable Jina AI (boolean) - Jina API Key (string, password) - Jina API Host (string) - Enable Proxy (boolean) - Proxy Host, Port, Protocol, Username, Password (for proxy configuration) |
| Output Options | Collection of options controlling output format and content: - Output Format: Markdown, Text, HTML, JSON - Max Content Length (characters, 0 = unlimited) - Include Metadata (boolean) - Extract Main Content Only (boolean) |
| Advanced Options | Collection of advanced HTTP request settings: - User Agent string - Timeout in milliseconds - Retry Count per method - Custom Headers (JSON object) |
Output
The node outputs an array of items, each corresponding to one scraped URL. Each item contains a json field with the following structure:
content: The extracted content from the webpage, formatted according to the selected output format (Markdown, plain text, cleaned HTML, or structured JSON).metadata(optional): An object containing metadata such as title, author, excerpt, site name, and content length if enabled.scrapingMethod: A string indicating which scraping method was successfully used (e.g., "HTTP GET with content extraction", "Jina AI Reader", or "Firecrawl API").url: The original URL scraped.timestamp: ISO timestamp of when the scraping occurred.
If the output format is JSON, the metadata fields are merged into the top-level output object alongside the content.
The node does not output binary data.
Dependencies
- External APIs/services optionally used for scraping:
- Jina AI Reader: An AI-powered content extraction service requiring an optional API key.
- Firecrawl API: A premium scraping API requiring an API key.
- HTTP requests are made using Axios with configurable headers, user agent, timeout, and optional proxy support.
- Uses Mozilla Readability and jsdom libraries internally to parse and extract main content from raw HTML.
- Optional proxy server configuration for routing requests through a proxy.
To use the external APIs, users must provide valid API keys and optionally configure API hosts if different from defaults.
Troubleshooting
Common Issues
- No valid URLs provided: If the URLs input is empty or improperly formatted, the node will throw an error.
- Missing API keys: When enabling Firecrawl or Jina AI, missing or invalid API keys will cause failures.
- Request timeouts or network errors: Can occur due to slow responses, incorrect proxy settings, or network issues.
- Scraping failures for all methods: If none of the scraping methods succeed, the node throws an error unless "Continue On Fail" is enabled.
Error Messages and Resolutions
"No valid URLs provided": Ensure the URLs parameter contains at least one valid URL separated by commas or new lines."Firecrawl API key is required": Provide a valid Firecrawl API key when enabling Firecrawl failover."Failed to scrape URL ... with all available methods": Check network connectivity, API keys, and consider adjusting the scraping strategy or enabling/disabling failover options.- JSON parsing errors in custom headers: Verify that the JSON entered in the "Custom Headers" field is valid.
Enabling "Continue On Fail" allows the workflow to proceed even if some URLs fail to scrape, returning error details in the output.
Links and References
- Mozilla Readability — Used internally for extracting main content from HTML.
- jsdom — JavaScript implementation of DOM and HTML standards used for parsing.
- Axios — HTTP client library used for making requests.
- Turndown — Converts HTML to Markdown format.
- Firecrawl API documentation (refer to your Firecrawl provider for official docs).
- Jina AI Reader API documentation (refer to your Jina AI provider for official docs).