Actions7
- Data API Actions
- Web Scraping Actions
Overview
The node provides web scraping capabilities by interacting with Dumpling AI's API. Specifically, the "Scrape URL" operation allows users to fetch content from a specified webpage URL. It supports different output formats such as raw HTML, cleaned markdown, or a screenshot of the page. Users can choose whether to clean the output (removing elements like navigation bars and footers) and whether to render JavaScript on the page before scraping, which affects the completeness and speed of the scrape.
This node is beneficial in scenarios where automated extraction of web content is needed for further processing, analysis, or integration into workflows. For example:
- Extracting article content from news websites in markdown format for content aggregation.
- Capturing screenshots of webpages for visual monitoring or reporting.
- Retrieving raw HTML for custom parsing or data extraction.
Properties
| Name | Meaning |
|---|---|
| URL | The webpage URL to scrape. Must be a valid URL string. |
| Output Format | The format of the scraped output. Options: - HTML: Raw HTML content. - Markdown: Clean markdown format. - Screenshot: Image capture of the page. |
| Clean Output | Whether to clean the output by removing common webpage elements like navigation bars and footers. Boolean value. |
| Render JavaScript | Whether to render JavaScript on the page before scraping. Enabling this ensures dynamic content loads but may slow down the process. Boolean value. |
Output
The node outputs JSON data containing the scraped content based on the selected format:
- For HTML and Markdown formats, the output will include a text field with the page content either as raw HTML or cleaned markdown.
- For the Screenshot format, the output includes binary data representing an image capture of the webpage.
The exact structure typically contains a json property with the scraped content and, if applicable, a binary property holding the screenshot image data.
Dependencies
- Requires an active API key credential for Dumpling AI to authenticate requests.
- Depends on Dumpling AI's web scraping API endpoint at
https://app.dumplingai.com/api/v1. - No additional environment variables are explicitly required beyond the API authentication.
Troubleshooting
Common Issues:
- Invalid or unreachable URL: Ensure the URL is correct and accessible from the network where n8n runs.
- API authentication errors: Verify that the provided API key credential is valid and has necessary permissions.
- Timeout or slow responses when rendering JavaScript: Rendering JS can significantly increase scraping time; disable it if not needed.
- Unexpected output format: Confirm the selected output format matches your intended use case.
Error Messages:
- Authentication failures usually indicate invalid or missing API credentials.
- Network errors suggest connectivity issues or blocked access to the target URL or Dumpling AI API.
- Parsing errors might occur if the page content is malformed or the cleaning process fails; try switching output formats or disabling cleaning.