Spider

爬虫

Actions2

网址采集规则 Actions
- 列表页
数据处理规则 Actions
- 数据处理

Overview

This node, named "Spider," provides data processing capabilities focused on transforming input JSON data according to user-defined rules. Specifically, the "数据处理规则" (Data Processing Rules) resource with the "数据处理" (Data Processing) operation allows users to apply various text and content manipulation rules on specified fields within the input data items.

Common scenarios where this node is beneficial include:

Cleaning and normalizing scraped or imported textual data by removing unwanted characters, HTML tags, or links.
Extracting specific substrings or patterns using regular expressions or delimiters.
Replacing or modifying text content based on search-and-replace rules.
Adding prefixes or suffixes to field values for formatting purposes.
Setting default or fixed values when certain data is missing or needs standardization.
Downloading images referenced in the data and updating their URLs accordingly.

Practical example:
Suppose you have a dataset of product descriptions containing HTML tags and embedded image URLs. You can configure this node to strip HTML tags from the description fields, extract email addresses or URLs, replace certain keywords, and download images to a local folder while updating the image source paths in the data.

Properties

Name	Meaning
字段处理	A collection of field processing rules. Each entry specifies: - 字段名称 (Field Name): The name of the field to process. - 数据处理规则 (Data Processing Rules): One or more rules to apply sequentially on the field's value.
处理方式	The type of processing to apply. Options include: - 正则提取 (extract): Extract using regex. - 正则替换 (replaceText): Replace using regex. - 文本提取 (extractText): Extract substring between start and end strings. - 文本替换 (replace): Replace substring. - 清除换行符 (removeLineBreaks): Remove line breaks. - 清除超链接 (removeLinks): Remove hyperlinks. - 添加前缀 (addPrefix): Add prefix string. - 添加后缀 (addSuffix): Add suffix string. - 设置默认值 (setDefault): Set default value if empty. - 设置固定值 (setFixedText): Set fixed value. - 清除Html标签 (stripHtmlTags): Remove HTML tags. - 提取Url链接 (extractUrl): Extract URLs. - 提取电子邮件 (extractEmail): Extract emails. - 图片下载 (imageDown): Download images and update URLs.
查找内容	Text to find (used with replaceText and replace).
替换为	Replacement text (used with replaceText and replace).
开始字符串	Start string for text extraction (used with extractText).
结束字符串	End string for text extraction (used with extractText).
正则表达式	Regular expression pattern (used with extract). Default is `.*`.
字段内容	Content string used for prefix, suffix, default, or fixed text settings.
下载地址	Base URL for downloading images (used with imageDown).
内容地址前缀修改	Prefix modification for downloaded content URLs (used with imageDown).

Output

The node outputs an array of JSON objects representing the processed data items. Each item's fields are transformed according to the configured rules. For example, text fields may have cleaned or extracted content, replaced substrings, or updated URLs for images.

If image downloading is involved, the node downloads images concurrently and updates the corresponding field values to point to the new local or prefixed URLs.

Binary data output is not explicitly handled; instead, image files are downloaded as side effects, and their references in JSON are updated accordingly.

Dependencies

Requires internet access if image downloading is enabled.
Uses concurrency control for HTTP requests (up to 5 simultaneous downloads).
Relies on external libraries bundled with the node for:
- URL parsing and path handling.
- HTML parsing and manipulation.
- HTTP requests for downloading images.
Requires configuration of base download URL and optional prefix for image URLs.

Troubleshooting

Image download failures: If images fail to download, check network connectivity and ensure the provided download URL base is correct and accessible.
Invalid regular expressions: Incorrect regex patterns may cause errors or unexpected results. Validate regex syntax before use.
Empty or missing fields: If a specified field does not exist in the input data, no processing occurs on that field.
Unsupported HTML content: Complex or malformed HTML might not be fully cleaned by the HTML tag stripping rule.
Concurrency limits: Large numbers of images may slow down processing due to concurrency limits; adjust if necessary.

Common error messages may relate to:

Network timeouts during image downloads.
Invalid parameters for text processing rules.
Missing required properties in the input data.

Resolving these typically involves verifying input correctness, adjusting rule configurations, and ensuring network availability.