Spider icon

Spider

爬虫

Actions2

Overview

This node, named "Spider," provides data processing capabilities focused on transforming input JSON data according to user-defined rules. It is especially useful for cleaning, extracting, replacing, or augmenting text fields within structured data. Common scenarios include:

  • Extracting specific substrings or patterns from text fields using regular expressions or delimiters.
  • Replacing or removing unwanted content such as line breaks, hyperlinks, or HTML tags.
  • Adding prefixes, suffixes, or default/fixed values to fields.
  • Downloading and rewriting image URLs embedded in HTML content.

For example, a user might use this node to clean scraped web data by removing HTML tags, extract email addresses from text fields, or download images referenced in the data while updating their URLs accordingly.

Properties

Name Meaning
字段处理 (map) A collection of field processing rules. Each entry specifies:
- 字段名称 (Field Name): The name of the field to process.
- 数据处理规则 (Data Processing Rules): One or more rules applied sequentially to the field's value.

Each 数据处理规则 (rule) supports the following processing methods:

  • 正则提取 (extract): Extract text matching a regular expression.
  • 正则替换 (replace): Replace text matching a pattern with specified replacement.
  • 文本提取 (extractText): Extract text between specified start and end strings.
  • 文本替换 (replaceText): Replace occurrences of specified text with another.
  • 清除换行符 (removeLineBreaks): Remove all line breaks.
  • 清除超链接 (removeLinks): Remove hyperlinks.
  • 添加前缀 (addPrefix): Add a prefix string to the field content.
  • 添加后缀 (addSuffix): Add a suffix string to the field content.
  • 设置默认值 (setDefault): Set a default value if the field is empty.
  • 设置固定值 (setFixedText): Set a fixed value regardless of current content.
  • 清除Html标签 (stripHtmlTags): Remove all HTML tags.
  • 提取Url链接 (extractUrl): Extract URL links.
  • 提取电子邮件 (extractEmail): Extract email addresses.
  • 图片下载 (imageDown): Download images referenced in the field's HTML content and update their URLs.

Additional parameters are available depending on the chosen rule, such as:

  • 查找内容 (find) and 替换为 (replace) for replace/replaceText.
  • 开始字符串 (start) and 结束字符串 (end) for extractText.
  • 正则表达式 (reg) for extract.
  • 下载地址 (downUrl) and 内容地址前缀修改 (downName) for imageDown.

Output

The node outputs an array of JSON objects corresponding to the processed input items. Each item's JSON contains the original fields with modifications applied according to the configured rules.

If the "图片下载" (imageDown) rule is used, the node downloads images referenced in the HTML content of the specified fields, replaces the image URLs with new paths, and manages these downloads internally. The binary data of downloaded images is handled by the node but not directly output; instead, the JSON fields contain updated URLs pointing to the downloaded images.

Dependencies

  • Uses lodash for deep cloning and utility functions.
  • Uses @cjs-exporter/p-map for concurrent asynchronous processing.
  • Uses cheerio for parsing and manipulating HTML content.
  • Requires network access to download images when using the image download feature.
  • No explicit external API keys or credentials are required for the data processing itself.
  • Node expects standard n8n environment with support for async execution and file system path utilities.

Troubleshooting

  • Empty or unchanged fields: Ensure that the field names specified in the rules exactly match those in the input JSON. Case sensitivity or missing fields will cause no changes.
  • Regular expression errors: Invalid regex patterns in the extract rule may cause failures or unexpected results. Validate regex syntax before use.
  • Image download failures: If image URLs are invalid or inaccessible, downloads will fail silently or throw errors. Verify URLs and network connectivity.
  • HTML parsing issues: Fields expected to contain HTML must be valid HTML strings for cheerio to parse correctly. Malformed HTML may lead to incomplete processing.
  • Concurrency limits: The node uses concurrency of 5 for downloads and URL fetching; very large batches may require adjusting concurrency or splitting inputs.

Common error messages relate to invalid parameters or network errors during image downloads. Reviewing the configuration of each rule and ensuring proper input data format usually resolves these issues.

Links and References

  • Cheerio Documentation – For understanding HTML parsing and manipulation.
  • Lodash Documentation – Utility functions used for data cloning and manipulation.
  • p-map GitHub – For concurrency control in asynchronous operations.
  • Regular expressions tutorials for crafting extraction and replacement patterns.

Discussion