Actions18
- Search Actions
- Map Actions
- Extract Actions
- Research Actions
- Crawl Actions
Overview
This node performs a web crawling operation starting from a specified root URL. It is designed to gather content from web pages by following links up to a defined depth and breadth, with options to filter URLs by domain or path patterns, include images, and format the extracted content. This node is useful for scenarios such as data scraping, content aggregation, SEO analysis, or research where automated extraction of web page data is required.
Use Case Examples
- Crawling a website to collect product descriptions for market analysis.
- Extracting blog posts from a specific domain with content formatted in markdown.
- Gathering images and text from a set of related web pages for content curation.
Properties
| Name | Meaning |
|---|---|
| URL | The root URL to begin the crawl, serving as the starting point for the web crawler. |
| Options | A collection of optional parameters to customize the crawl behavior and output. |
Output
JSON
results- ``
*url- The URL of the crawled page.
*content- Extracted content from the page, formatted as specified (markdown or text).
*images- Optional array of image URLs included if 'Include Images' is enabled.
*favicon- Optional favicon URL if 'Include Favicon' is enabled.
- ``
usage- Optional credit usage information if 'Include Usage' is enabled.
Dependencies
- This node requires access to a web crawling service or API capable of fetching and parsing web pages. It may require API authentication credentials to operate.
Troubleshooting
- Common issues include invalid or unreachable root URL, which results in crawl failure.
- Incorrect regex patterns in select or exclude options may lead to unexpected filtering of URLs.
- Setting very high max depth or breadth values can cause long execution times or timeouts.
- If instructions are provided but chunks per source is set outside the allowed range (1-5), the node may throw an error.
Links
- HTTP Overview - MDN - General information about web protocols relevant to crawling.
- Web Crawler - Wikipedia - Background information on web crawling technology and techniques.