Overview
This node performs fuzzy record linking between two datasets using advanced fuzzy matching algorithms. It is useful for scenarios where you need to find approximate matches between records in a source dataset and a target dataset, such as deduplication, data integration, or record reconciliation. The node supports multiple matching algorithms, customizable field mappings with weights and thresholds, preprocessing options, keyword-based score boosting, and different match modes (best match, all matches above threshold, one-to-one matching). It outputs matched records with optional detailed scoring information and can include unmatched records for debugging.
Use Case Examples
- Matching customer names and addresses between two databases to identify duplicates or related records.
- Linking product records from two different sources where exact matches are rare due to variations in spelling or formatting.
- Reconciling vendor lists by fuzzy matching company names and IDs with configurable thresholds and weights.
Properties
| Name | Meaning |
|---|---|
| Matching Algorithm | The fuzzy matching algorithm to use for comparing fields between source and target records. Options include Dice Coefficient, Jaro-Winkler Distance, Levenshtein Distance, Partial Ratio, Token Set Ratio, Token Sort Ratio, and Weighted Ratio. |
| Match Threshold | Minimum similarity score (0-100) to consider a match. |
| Field Mappings | Mappings between source and target dataset fields, including source field name, target field name, weight for the field in overall score, field-specific threshold, optional override of the matching algorithm for the field, and whether the field is required to pass its threshold for the overall match to succeed. |
| Preprocessing Options | Options to preprocess text fields before matching, such as converting to lowercase, normalizing Vietnamese characters, removing diacritics, extra spaces, special characters, symbols, tokenizing, and trimming whitespace. |
| Keyword Boost | Configuration to boost match scores when specific keywords are found in both source and target records, including enabling boost, keyword-weight pairs, maximum total boost, and case sensitivity. |
| Match Mode | How to handle multiple potential matches: best match only, all matches above threshold, or one-to-one matching where each target record can only match once. |
| Output Options | Options controlling output details such as including individual field scores, including overall match score, including unmatched records, maximum unmatched records to output, merging source and target records, and prefixes for merged fields. |
| Advanced Options | Advanced matching options including case sensitivity, ignoring previous matches, length filter ratio to reject matches with large length differences, maximum candidates to evaluate, maximum string length for comparison, parallel batch size for performance, partial threshold for partial matching algorithms, and whether all fields with thresholds must pass for a match to succeed. |
Output
JSON
matched- Indicates if the record is matched (true) or unmatched (false, if unmatched records are included).match_score- Overall similarity score of the match (0-100).field_scores- Individual similarity scores for each mapped field if included in output options.targets- For unmatched records, an array of top candidate matches with their scores and optional field scores.source_*- Fields from the source record, optionally prefixed if merging is enabled.target_*- Fields from the target record, optionally prefixed if merging is enabled.
Dependencies
- No external API dependencies; uses internal fuzzy matching algorithms and utilities.
Troubleshooting
- Common issues include missing or invalid field mappings, which cause the node to throw an error requiring at least one valid field mapping.
- If source or target datasets are empty, the node throws an error indicating no data in the respective input.
- Performance may degrade with very large datasets; using advanced options like maxCandidates, maxStringLength, lengthFilterRatio, and parallelBatchSize can help optimize performance.
- Match thresholds and field-specific thresholds need careful tuning to avoid too few or too many matches.
- Enabling 'Require All Fields Pass' can cause matches to be rejected if any required field fails its threshold, which might be unexpected if not configured properly.