Spider icon

Spider

爬虫

Actions2

Overview

This node, named "Spider," is designed for web scraping and data extraction tasks. It supports two main resources: URL collection rules ("网址采集规则") and data processing rules. For the URL collection resource with the "列表页" (list page) operation, it takes one or multiple starting URLs and extracts links from those pages based on specified inclusion and exclusion criteria.

Common scenarios where this node is beneficial include:

  • Automatically gathering lists of URLs from paginated web pages.
  • Filtering collected URLs by keywords they must contain or must not contain.
  • Preparing a set of URLs for further scraping or processing in subsequent workflow steps.

For example, you can input multiple start URLs separated by newlines, specify HTML regions to start and end extraction, and define substrings that links should or should not contain. The node will then crawl these pages concurrently, extract matching URLs, and output a deduplicated list.

Properties

Name Meaning
采集网址 (url) One or more starting URLs to begin crawling. Multiple URLs can be entered separated by newlines.
开始采集区域 (start) Optional string defining the start region in the HTML content to begin link extraction.
结束采集区域 (end) Optional string defining the end region in the HTML content to stop link extraction.
链接包含 (exist) Substrings that extracted links must contain. Multiple values can be separated by newlines.
链接不包含 (noExist) Substrings that extracted links must NOT contain. Multiple values can be separated by newlines.

Note: The last four properties are only shown when the resource is "url" and the operation is "conventional" (which corresponds to the "列表页" operation).

Output

The node outputs an array of JSON objects under the json field, each representing a unique URL found during crawling. Each object contains at least a currentUrl property holding the extracted URL string.

The output is deduplicated by URL to avoid repeated entries.

No binary data output is produced by this operation.

Dependencies

  • The node uses concurrency-limited asynchronous crawling to fetch and parse URLs.
  • It depends on external libraries such as:
    • p-map for concurrent promise mapping.
    • cheerio for HTML parsing and DOM manipulation.
    • lodash for utility functions like cloning and flattening arrays.
  • No direct external API keys or credentials are required for this URL collection operation.
  • Network access is necessary to fetch the target web pages.

Troubleshooting

  • Empty output or no URLs found: Check that the start URLs are correct and accessible. Verify that the "开始采集区域" and "结束采集区域" selectors correctly delimit the HTML region containing links. Also ensure that the "链接包含" and "链接不包含" filters are not too restrictive.
  • Errors related to network requests: Ensure the node has internet access and the target websites are reachable. Some sites may block automated scraping.
  • Performance issues: The node limits concurrency to 5 simultaneous requests; increasing concurrency would require code changes and might risk IP blocking.
  • Malformed URLs or unexpected results: Confirm that the input URLs are valid and that the page structure matches expectations for the extraction rules.

Links and References

  • Cheerio Documentation – Used for HTML parsing and DOM traversal.
  • p-map GitHub – Controls concurrency of asynchronous operations.
  • General web scraping best practices and legal considerations should be reviewed before use.

Discussion