Overview
This node enables crawling and scraping of web pages using a headless browser approach. It supports extracting links, text content, or raw HTML from a specified URL. The node is useful for scenarios such as web data collection, link discovery, content aggregation, or monitoring changes on websites.
For example:
- Extracting all hyperlinks from a webpage to map site structure.
- Scraping the main textual content of an article for analysis.
- Retrieving the full HTML markup of a page for further processing.
Properties
| Name | Meaning |
|---|---|
| URL | The web address to crawl or scrape. |
| Operation | The type of extraction to perform: "Extract Links", "Extract Text", or "Extract HTML". |
| Max Depth | Maximum depth of crawling (only applicable when extracting links). |
Output
The output is an array of JSON objects with the following structure depending on the operation:
Extract Links
{ "status": "success", "message": "Crawling finished", "data": { "url": "<input URL>", "links": ["<list of unique extracted URLs>"] } }Contains the original URL and a deduplicated list of all hyperlinks found on the page.
Extract Text
{ "status": "success", "message": "Text extraction finished", "data": { "url": "<input URL>", "text": "<extracted visible text content>" } }Contains the original URL and the trimmed textual content of the page body.
Extract HTML
{ "status": "success", "message": "HTML extraction finished", "data": { "url": "<input URL>", "html": "<raw HTML source code>" } }Contains the original URL and the full HTML markup of the page.
No binary data output is produced by this node.
Dependencies
- Uses the
crawleelibrary for crawling and scraping functionality. - Requires network access to the target URLs.
- No explicit API keys or external service credentials are needed.
- Runs within n8n environment with internet connectivity.
Troubleshooting
Common issues:
- Invalid or unreachable URLs will cause errors during crawling.
- Pages heavily reliant on JavaScript might not render fully since the crawler uses Cheerio (a server-side DOM parser) rather than a full browser engine.
- Large or deeply nested sites may exceed the max requests limit or timeout.
Error messages:
- Timeout errors if the page takes too long to respond (default 30 seconds).
- URL parsing errors if malformed links are encountered.
- Network errors if the target site is down or blocked.
Resolutions:
- Verify URLs are correct and accessible.
- Adjust the "Max Depth" or reduce the number of requests if crawling large sites.
- Use other tools if JavaScript rendering is required.
Links and References
- Crawlee GitHub Repository – underlying crawling library used.
- Cheerio Documentation – for understanding how HTML parsing works in this context.
- n8n Documentation – general guidance on creating and using custom nodes.