Overview
This node integrates with a Zilliz vector database to build and manage Retrieval-Augmented Generation (RAG) knowledge bases. Specifically, the Process and Store Documents operation cleans, chunks, vectorizes, and stores documents into a specified Zilliz collection. It is useful for scenarios where large text documents need to be split into manageable pieces, embedded as vectors, and stored for efficient semantic search or AI-driven retrieval.
Practical examples include:
- Preparing a corpus of articles or manuals by chunking and embedding them for later semantic search.
- Storing customer support transcripts in vector form to enable fast retrieval of relevant past conversations.
- Processing research papers or legal documents to create a searchable knowledge base.
Properties
| Name | Meaning |
|---|---|
| Database Name | Name of the Zilliz database to use (default: "default"). |
| Collection Name | Name of the Zilliz collection where documents will be stored. Must start with a letter and contain only letters, numbers, and underscores. |
| Document Content Field | The field name in the input data that contains the document's main textual content. |
| Document Title Field | The field name in the input data that contains the document's title. Defaults to "title". |
| Text Processing Options | Options controlling how the text is processed before storage: |
| - Chunk Overlap | Number of characters overlapping between consecutive chunks (default 200). |
| - Chunk Size | Maximum number of characters per text chunk (default 1000). |
| - Clean Text | Whether to remove extra whitespace and normalize text (default true). |
| - Min Chunk Size | Minimum number of characters required for a chunk to be included (default 50). |
| - Remove HTML Tags | Whether to strip HTML tags from the content (default true). |
| Embedding Settings | Settings related to vector embeddings: |
| - Vector Dimension | Dimensionality of the embedding vectors (default 1536). Must match the embedding model used. |
| - Embedding Field | Field name containing the precomputed vector embeddings in the input data (default "embedding"). |
| - Metric Type | Distance metric used for similarity calculations. Options: Cosine, Euclidean (L2), Inner Product (IP). |
Output
The output JSON object includes detailed information about the processing and storage result:
success: Boolean indicating if the operation succeeded.processed_document: The title of the processed document.chunks_created: Number of text chunks created from the original document.total_characters: Total character count of the processed text after cleaning.average_chunk_size: Average size of each chunk in characters.insert_count: Number of vectors successfully inserted into the collection.insert_ids: Array of IDs assigned to the inserted vectors.collection: Name of the collection where data was stored.database: Name of the database used.
Note: The node expects the input data to already contain vector embeddings under the specified embedding field. It does not compute embeddings itself but requires them to be precomputed by another node.
Dependencies
- Requires an API key credential and cluster endpoint for authenticating with the Zilliz vector database service.
- Depends on a separate embedding generation step/node to provide vector embeddings for each document chunk.
- Uses the Zilliz client library bundled within the node for communication with the vector database.
Troubleshooting
- Missing Content Field: If the specified content field is missing or empty in the input data, the node throws an error indicating the content field was not found.
- Missing Embeddings: The node requires precomputed vector embeddings in the input data. If these are absent or not an array, it throws an error instructing users to compute embeddings first.
- Invalid Collection Name: Collection names must start with a letter and contain only letters, numbers, and underscores. Violations cause validation errors.
- API Authentication Errors: Ensure the API key and cluster endpoint credentials are correctly configured and valid.
- Chunking Issues: Improper chunk size or overlap settings may lead to very small or too few chunks; adjust parameters accordingly.