Crawl and Scrape icon

Crawl and Scrape

Crawl websites and extract data

Overview

This node enables crawling and scraping of websites by fetching a specified URL and extracting data from it. It supports three main operations: extracting all links on the page, extracting the visible text content, or extracting the raw HTML content. This node is useful for web data collection, link discovery, content analysis, or preparing data for further processing.

Practical examples:

  • Extracting all hyperlinks from a webpage to build a sitemap or discover related pages.
  • Extracting the main textual content of an article for sentiment analysis or summarization.
  • Retrieving the full HTML source of a page for custom parsing or archiving.

Properties

Name Meaning
URL The URL of the webpage to crawl or scrape.
Operation The type of extraction to perform. Options: "Extract Links", "Extract Text", "Extract HTML".
Max Depth (Only for "Extract Links" operation) Maximum depth of crawling (number of link levels).

Output

The output is an array of JSON objects with the following structure depending on the selected operation:

  • Extract Links

    {
      "status": "success",
      "message": "Crawling finished",
      "data": {
        "url": "<input URL>",
        "links": ["<list of unique absolute URLs extracted>"]
      }
    }
    

    Contains the original URL and a list of unique absolute links found on the page.

  • Extract Text

    {
      "status": "success",
      "message": "Text extraction finished",
      "data": {
        "url": "<input URL>",
        "text": "<extracted visible text content>"
      }
    }
    

    Contains the original URL and the trimmed visible text content of the page body.

  • Extract HTML

    {
      "status": "success",
      "message": "HTML extraction finished",
      "data": {
        "url": "<input URL>",
        "html": "<raw HTML content of the page>"
      }
    }
    

    Contains the original URL and the raw HTML source code of the page.

No binary data output is produced by this node.

Dependencies

  • Uses the crawlee library for crawling and scraping functionality.
  • Requires network access to the target URLs.
  • No explicit API keys or authentication tokens are required by default.
  • Runs within n8n environment with internet connectivity.

Troubleshooting

  • Common issues:

    • Invalid or unreachable URLs will cause errors during crawling.
    • Pages requiring JavaScript rendering may not be fully processed since this uses static HTML parsing.
    • Network timeouts or slow responses can cause request handler timeouts (default 30 seconds).
    • If the node fails but "Continue On Fail" is enabled, error details will be included in the output.
  • Error messages:

    • Errors related to invalid URLs or network failures typically indicate connectivity or URL format problems.
    • Timeout errors suggest increasing timeout settings or checking target server responsiveness.
    • Parsing errors might occur if the page structure is unexpected or malformed.

Links and References

Discussion