Semantic Double-Pass Merging Text Splitter

Split text using semantic similarity with double-pass merging for optimal chunking

Overview

This node implements a semantic text splitter that divides input text into meaningful chunks using a double-pass merging strategy based on semantic similarity. It first splits the text into sentences, then combines sentences with context buffers, creates embeddings for these combined segments, and calculates distances between them to identify natural breakpoints. After an initial chunking, it performs a second pass to merge chunks that are semantically similar above a configurable threshold, optimizing chunk boundaries for better contextual coherence.

This node is beneficial in scenarios where you need to preprocess large texts for downstream AI tasks such as embedding generation, semantic search, or summarization, ensuring that chunks maintain semantic integrity rather than arbitrary length-based splitting. For example, it can be used to prepare documents for vector databases or to segment long articles into coherent parts for question answering systems.

Properties

Name	Meaning
Options	A collection of parameters controlling the splitting behavior:
Buffer Size	Number of sentences combined around each sentence to provide context when creating embeddings.
Breakpoint Threshold Type	Method to determine chunk boundaries based on distance metrics between sentence embeddings. Options: Percentile, Standard Deviation, Interquartile, Gradient (maximum gradient change).
Breakpoint Threshold Amount	Manual numeric threshold (0-1) overriding the breakpoint threshold type to decide chunk boundaries.
Number of Chunks	Target number of chunks to create. If set to a positive number, overrides threshold-based chunking. Set to 0 to use threshold methods instead.
Second Pass Threshold	Similarity threshold (0-1) for merging chunks during the second pass. Higher values require chunks to be more similar to merge.
Min Chunk Size	Minimum number of characters per chunk after splitting and merging.
Max Chunk Size	Maximum number of characters allowed per chunk. Large chunks will be further split by sentences to respect this limit.
Sentence Split Regex	Regular expression used to split the input text into sentences. Default splits on punctuation marks followed by whitespace (e.g., periods, question marks, exclamation points).

Output

The node outputs an array of text chunks (json field), each representing a semantically coherent segment of the original input text. Each chunk is a string containing one or more sentences merged based on semantic similarity and size constraints.

If multiple documents are processed, the output is an array of document objects, each containing:

pageContent: The chunked text segment.
metadata: Metadata copied from the original document.

No binary data output is produced by this node.

Dependencies

Requires an embeddings provider passed as input to generate semantic embeddings of text segments. This typically involves an API key credential for an embedding service.
Uses regular expressions and cosine distance calculations internally.
No additional external services beyond the embeddings provider are required.

Troubleshooting

Issue: Output chunks are too small or too large.
- Cause: Improper configuration of minChunkSize and maxChunkSize.
- Solution: Adjust these properties to desired character limits to control chunk sizes.
Issue: Chunk boundaries do not align well with semantic breaks.
- Cause: Inappropriate breakpointThresholdType or breakpointThresholdAmount.
- Solution: Experiment with different threshold types or manually set the threshold amount to better fit your text characteristics.
Issue: Second pass merging merges too aggressively or not enough.
- Cause: secondPassThreshold value too low or too high.
- Solution: Increase the threshold to require higher similarity for merging or decrease it to allow more merging.
Error: Embeddings input missing or invalid.
- Cause: The node requires a valid embeddings input connection.
- Solution: Ensure the node receives embeddings from a compatible upstream node configured with a valid API key credential.

Links and References

n8n Documentation: Semantic Double-Pass Merging Text Splitter
Concepts related to text chunking and semantic embeddings can be explored in LangChain documentation and general NLP resources on text segmentation.