Crawl and Scrape icon

Crawl and Scrape

Crawl websites and extract data

Overview

This node enables crawling and scraping of web pages by fetching a specified URL and extracting data from it. It supports three main operations: extracting all links on the page, extracting the visible text content, or extracting the raw HTML source. This is useful for scenarios such as gathering URLs for further crawling, scraping textual information for analysis, or obtaining full HTML markup for custom parsing.

Practical examples:

  • Extracting all outbound links from a news article to discover related content.
  • Scraping product descriptions or reviews from an e-commerce page.
  • Downloading the full HTML of a webpage for offline processing or archiving.

Properties

Name Meaning
URL The URL to crawl or scrape. Must be a valid web address.
Operation The type of extraction to perform. Options: "Extract Links", "Extract Text", "Extract HTML".
Max Depth (Only for "Extract Links" operation) Maximum depth of crawling (number).

Output

The output is a JSON object with the following structure depending on the selected operation:

  • Extract Links

    {
      "status": "success",
      "message": "Crawling finished",
      "data": {
        "url": "<input URL>",
        "links": ["<list of unique absolute URLs extracted>"]
      }
    }
    

    Contains the original URL and an array of all unique hyperlinks found on that page.

  • Extract Text

    {
      "status": "success",
      "message": "Text extraction finished",
      "data": {
        "url": "<input URL>",
        "text": "<all visible text content from the body>"
      }
    }
    

    Contains the original URL and the concatenated visible text content from the page's body.

  • Extract HTML

    {
      "status": "success",
      "message": "HTML extraction finished",
      "data": {
        "url": "<input URL>",
        "html": "<raw HTML source of the page>"
      }
    }
    

    Contains the original URL and the full raw HTML markup of the page.

No binary data output is produced by this node.

Dependencies

  • Requires internet access to fetch the target URLs.
  • Uses the crawlee library internally for crawling and scraping.
  • No special API keys or credentials are required.
  • The node relies on n8n’s standard HTTP request capabilities and environment.

Troubleshooting

  • Common issues:

    • Invalid or unreachable URLs will cause errors.
    • Pages requiring authentication or JavaScript rendering may not return expected results.
    • Network timeouts if the target server is slow or unresponsive.
  • Error messages:

    • Errors during crawling or extraction will throw exceptions with context about the failed item index.
    • If "Continue On Fail" is enabled, errors will be included in the output JSON for the respective input item.
    • To resolve, verify URL correctness, network connectivity, and consider using proxies or authenticated sessions if needed.

Links and References

Discussion