Crawl and Scrape

Crawl websites and extract data

Actions3

Overview

This node enables crawling and scraping of web pages by fetching a specified URL and extracting data from it. It supports three main operations: extracting all links on the page, extracting the visible text content, or extracting the raw HTML source. This is useful for scenarios such as gathering URLs for further crawling, collecting textual information for analysis, or obtaining full HTML snapshots for archival or processing.

Practical examples include:

Extracting all outbound links from a news article to analyze referenced sources.
Scraping the main text content of a blog post for sentiment analysis.
Downloading the complete HTML of a product page for offline processing or comparison.

Properties

Name	Meaning
URL	The URL to crawl or scrape. Must be a valid web address.
Operation	The type of extraction to perform. Options: "Extract Links", "Extract Text", "Extract HTML".
Max Depth	(Only for "Extract Links" operation) Maximum depth of crawling (number of link hops).

Output

The node outputs JSON objects with the following structure depending on the selected operation:

Extract Links

{
  "status": "success",
  "message": "Crawling finished",
  "data": {
    "url": "<input URL>",
    "links": ["<list of unique absolute URLs extracted from the page>"]
  }
}

Contains the original URL and an array of all unique hyperlinks found on that page.

Extract Text

{
  "status": "success",
  "message": "Text extraction finished",
  "data": {
    "url": "<input URL>",
    "text": "<all visible text content inside the <body> tag>"
  }
}

Contains the original URL and the concatenated visible text content of the page.

Extract HTML

{
  "status": "success",
  "message": "HTML extraction finished",
  "data": {
    "url": "<input URL>",
    "html": "<raw HTML source of the page>"
  }
}

Contains the original URL and the full raw HTML markup of the page.

The node does not output binary data.

Dependencies

Uses the crawlee library for crawling and scraping functionality.
Requires network access to the target URLs.
No explicit API keys or authentication tokens are needed unless accessing protected resources.
Runs within n8n environment with standard HTTP request capabilities.

Troubleshooting

Common issues:
- Invalid or malformed URLs may cause errors during crawling.
- Target websites blocking automated requests can result in empty or failed responses.
- Network timeouts if the site is slow or unreachable.
- Large pages or deep crawls may exceed timeout limits or resource constraints.
Error messages:
- Errors related to URL parsing indicate invalid input URLs; verify and correct them.
- Timeout errors suggest increasing crawler timeout settings or checking network connectivity.
- If the node fails but "Continue On Fail" is enabled, error details will be included in the output item.

Links and References

Crawlee GitHub Repository – underlying crawling library used.
n8n Documentation – general guidance on creating and using custom nodes.