Crawl and Scrape

Crawl websites and extract data

Actions3

Overview

This node enables crawling and scraping of web pages using a headless browser approach. It supports extracting links, text content, or raw HTML from a specified URL. The node is useful for scenarios such as web data collection, link discovery, content aggregation, or monitoring changes on websites.

For example:

Extracting all hyperlinks from a webpage to map site structure.
Scraping the main textual content of an article for analysis.
Retrieving the full HTML markup of a page for further processing.

Properties

Name	Meaning
URL	The web address to crawl or scrape.
Operation	The type of extraction to perform: "Extract Links", "Extract Text", or "Extract HTML".
Max Depth	Maximum depth of crawling (only applicable when extracting links).

Output

The output is an array of JSON objects with the following structure depending on the operation:

Extract Links

{
  "status": "success",
  "message": "Crawling finished",
  "data": {
    "url": "<input URL>",
    "links": ["<list of unique extracted URLs>"]
  }
}

Contains the original URL and a deduplicated list of all hyperlinks found on the page.

Extract Text

{
  "status": "success",
  "message": "Text extraction finished",
  "data": {
    "url": "<input URL>",
    "text": "<extracted visible text content>"
  }
}

Contains the original URL and the trimmed textual content of the page body.

Extract HTML

{
  "status": "success",
  "message": "HTML extraction finished",
  "data": {
    "url": "<input URL>",
    "html": "<raw HTML source code>"
  }
}

Contains the original URL and the full HTML markup of the page.

No binary data output is produced by this node.

Dependencies

Uses the crawlee library for crawling and scraping functionality.
Requires network access to the target URLs.
No explicit API keys or external service credentials are needed.
Runs within n8n environment with internet connectivity.

Troubleshooting

Common issues:
- Invalid or unreachable URLs will cause errors during crawling.
- Pages heavily reliant on JavaScript might not render fully since the crawler uses Cheerio (a server-side DOM parser) rather than a full browser engine.
- Large or deeply nested sites may exceed the max requests limit or timeout.
Error messages:
- Timeout errors if the page takes too long to respond (default 30 seconds).
- URL parsing errors if malformed links are encountered.
- Network errors if the target site is down or blocked.
Resolutions:
- Verify URLs are correct and accessible.
- Adjust the "Max Depth" or reduce the number of requests if crawling large sites.
- Use other tools if JavaScript rendering is required.

Links and References

Crawlee GitHub Repository – underlying crawling library used.
Cheerio Documentation – for understanding how HTML parsing works in this context.
n8n Documentation – general guidance on creating and using custom nodes.