Overview
This node enables scraping content from a specified webpage URL using an external web crawling API. It is designed to extract webpage content in various formats such as Markdown, cleaned text, or raw HTML. The node supports removing unwanted elements via CSS selectors and can optionally focus on extracting only the main content of the page. Additionally, it allows running a custom prompt on the scraped content to extract specific information.
Common scenarios where this node is beneficial include:
- Preparing webpage content for use with language models or retrieval-augmented generation (RAG) workflows.
- Extracting clean textual data from complex webpages by removing ads, navigation bars, or other irrelevant sections.
- Automating data extraction from websites without needing to write custom scrapers.
- Running targeted queries on scraped content to retrieve structured insights.
Practical example: Scraping a news article URL to get its main content in Markdown format, removing sidebar ads and footer links, then applying a prompt to extract the article's key points.
Properties
| Name | Meaning |
|---|---|
| URL to Scrape | The full URL of the webpage to scrape content from. |
| Output Format | The format of the scraped content output. Options: Markdown, Cleaned (plain text), or HTML. |
| CSS Selectors to Remove | Comma-separated list of CSS selectors whose matching elements will be removed from the content. |
| Prompt | A custom prompt string to run on the scraped content to extract specific information. |
| Extract Main Content Only | Boolean flag indicating whether to extract only the main content section of the webpage. |
Output
The node outputs an array of JSON objects, each representing the result of scraping one input item. Each JSON object contains the full response from the web crawling API, which includes:
success: Boolean indicating if the scraping was successful.status: HTTP status code returned by the API.error_message: Error message if the scraping failed.- The scraped content in the requested format (Markdown, cleaned text, or HTML).
- Any additional metadata provided by the API.
If the node encounters an error during scraping and "Continue On Fail" is enabled, it outputs an error object with the error message paired to the corresponding input item.
The node does not output binary data.
Dependencies
- Requires an API key credential for authenticating with the external Web Crawler API service.
- The node makes authenticated POST requests to
https://api.webcrawlerapi.com/v2/scrape. - Proper configuration of the API key credential within n8n is necessary for operation.
Troubleshooting
Common issues:
- Invalid or missing API key credential will cause authentication failures.
- Incorrect or malformed URLs may lead to request errors.
- Specifying invalid CSS selectors could result in unexpected content removal or no effect.
- Network connectivity problems can prevent reaching the API endpoint.
Error messages:
[status] error_message— Indicates an HTTP error status and message returned by the API. Check the URL and API key validity.Unknown error— Generic failure; verify network and API service status.- NodeOperationError with message from caught exceptions — Usually indicates internal or request-related issues.
Resolutions:
- Ensure the API key credential is correctly set up and valid.
- Verify the URL is accessible and properly formatted.
- Test CSS selectors independently to confirm they match intended elements.
- Enable "Continue On Fail" to handle partial failures gracefully.
Links and References
- WebCrawlerAPI Documentation (for detailed API usage and parameters)
- n8n Documentation on Credentials (for setting up API keys)
- CSS Selector Reference (to craft selectors for cleaning content)