Overview
This node implements a semantic text splitter that divides input text into meaningful chunks using a double-pass merging strategy based on semantic similarity. It first splits the text into sentences, then combines sentences with context buffers, creates embeddings for these combined segments, and calculates distances between them to identify natural breakpoints. After an initial chunking, it performs a second pass to merge chunks that are semantically similar above a configurable threshold, optimizing chunk boundaries for better contextual coherence.
This node is beneficial in scenarios where you need to preprocess large texts for downstream AI tasks such as embedding generation, semantic search, or summarization, ensuring that chunks maintain semantic integrity rather than arbitrary length-based splitting. For example, it can be used to prepare documents for vector databases or to segment long articles into coherent parts for question answering systems.
Properties
| Name | Meaning |
|---|---|
| Options | A collection of parameters controlling the splitting behavior: |
| Buffer Size | Number of sentences combined around each sentence to provide context when creating embeddings. |
| Breakpoint Threshold Type | Method to determine chunk boundaries based on distance metrics between sentence embeddings. Options: Percentile, Standard Deviation, Interquartile, Gradient (maximum gradient change). |
| Breakpoint Threshold Amount | Manual numeric threshold (0-1) overriding the breakpoint threshold type to decide chunk boundaries. |
| Number of Chunks | Target number of chunks to create. If set to a positive number, overrides threshold-based chunking. Set to 0 to use threshold methods instead. |
| Second Pass Threshold | Similarity threshold (0-1) for merging chunks during the second pass. Higher values require chunks to be more similar to merge. |
| Min Chunk Size | Minimum number of characters per chunk after splitting and merging. |
| Max Chunk Size | Maximum number of characters allowed per chunk. Large chunks will be further split by sentences to respect this limit. |
| Sentence Split Regex | Regular expression used to split the input text into sentences. Default splits on punctuation marks followed by whitespace (e.g., periods, question marks, exclamation points). |
Output
The node outputs an array of text chunks (json field), each representing a semantically coherent segment of the original input text. Each chunk is a string containing one or more sentences merged based on semantic similarity and size constraints.
If multiple documents are processed, the output is an array of document objects, each containing:
pageContent: The chunked text segment.metadata: Metadata copied from the original document.
No binary data output is produced by this node.
Dependencies
- Requires an embeddings provider passed as input to generate semantic embeddings of text segments. This typically involves an API key credential for an embedding service.
- Uses regular expressions and cosine distance calculations internally.
- No additional external services beyond the embeddings provider are required.
Troubleshooting
Issue: Output chunks are too small or too large.
- Cause: Improper configuration of
minChunkSizeandmaxChunkSize. - Solution: Adjust these properties to desired character limits to control chunk sizes.
- Cause: Improper configuration of
Issue: Chunk boundaries do not align well with semantic breaks.
- Cause: Inappropriate
breakpointThresholdTypeorbreakpointThresholdAmount. - Solution: Experiment with different threshold types or manually set the threshold amount to better fit your text characteristics.
- Cause: Inappropriate
Issue: Second pass merging merges too aggressively or not enough.
- Cause:
secondPassThresholdvalue too low or too high. - Solution: Increase the threshold to require higher similarity for merging or decrease it to allow more merging.
- Cause:
Error: Embeddings input missing or invalid.
- Cause: The node requires a valid embeddings input connection.
- Solution: Ensure the node receives embeddings from a compatible upstream node configured with a valid API key credential.
Links and References
- n8n Documentation: Semantic Double-Pass Merging Text Splitter
- Concepts related to text chunking and semantic embeddings can be explored in LangChain documentation and general NLP resources on text segmentation.