Spider icon

Spider

爬虫

Actions2

Overview

This node, named "Spider," is designed for web scraping and data extraction tasks. It supports two main resources: URL collection rules ("网址采集规则") and data processing rules. For the "网址采集规则" resource with the "列表页" (list page) operation, it collects URLs from specified starting points, applying filters based on inclusion or exclusion criteria within the links.

Common scenarios where this node is beneficial include:

  • Crawling multiple web pages starting from given URLs to gather lists of target links.
  • Filtering collected URLs by specifying substrings that must or must not be present.
  • Extracting structured data from web pages for further processing or automation workflows.

Practical example:

  • You want to scrape product listing pages from an e-commerce site. You provide starting URLs (one per line), specify HTML regions to start and end scraping, and filter links to only those containing certain keywords (e.g., "product") while excluding others (e.g., "ads"). The node returns a deduplicated list of URLs matching these criteria.

Properties

Name Meaning
采集网址 (url) Starting URLs for scraping. Multiple URLs can be provided separated by new lines.
开始采集区域 (start) Optional string defining the start region in the HTML content to begin scraping links.
结束采集区域 (end) Optional string defining the end region in the HTML content to stop scraping links.
链接包含 (exist) Filter to include only links containing any of the specified substrings (newline-separated).
链接不包含 (noExist) Filter to exclude links containing any of the specified substrings (newline-separated).

Note: The last four properties are shown only when the resource is "url" and the operation is "conventional" (which corresponds to the "列表页" operation).

Output

The node outputs an array of JSON objects representing the collected URLs after filtering and deduplication. Each object contains at least a currentUrl field holding the scraped URL string.

Example output structure:

[
  {
    "currentUrl": "https://example.com/page1"
  },
  {
    "currentUrl": "https://example.com/page2"
  }
]

No binary data output is produced by this operation.

Dependencies

  • Uses concurrency-controlled asynchronous URL fetching and processing.
  • Relies on external libraries such as:
    • lodash for deep cloning and unique filtering.
    • p-map for concurrent promise mapping.
    • cheerio for HTML parsing and manipulation.
    • Node.js built-in modules like url and path.
  • Requires network access to fetch URLs.
  • No explicit API keys or credentials are needed for this operation.

Troubleshooting

  • Empty results: If no URLs are returned, verify that the starting URLs are correct and accessible, and that the "开始采集区域" and "结束采集区域" selectors correctly delimit the HTML region containing links.
  • Incorrect filtering: Ensure that the "链接包含" and "链接不包含" fields contain accurate substrings matching the desired links. These filters are case-sensitive and applied as substring matches.
  • Concurrency limits: The node uses a concurrency limit of 5 for fetching URLs. If you experience timeouts or slow performance, consider network conditions or server rate limiting.
  • Malformed URLs: Input URLs should be valid and properly formatted; otherwise, the node may skip or fail to process them.
  • Node errors: Common error messages may relate to network failures or invalid HTML parsing. Check connectivity and input correctness.

Links and References

Discussion