Document Preprocessor icon

Document Preprocessor

Preprocess documents from any source (PDF, LLMs, etc.) for vector stores with intelligent chunking

Overview

This node, named Document Preprocessor, is designed to preprocess documents from various sources such as PDFs or large language models for use in vector stores. It intelligently chunks the input text into manageable pieces based on specified chunk sizes, ensuring each chunk meets minimum and maximum character limits. The node supports two processing modes: combining all input items into a single document for processing or processing each input item separately. This functionality is useful for preparing large documents or multiple documents for downstream tasks like vector embedding or search indexing.

Use Case Examples

  1. Combining multiple text inputs into one document and splitting it into chunks for vector storage.
  2. Processing each document separately to maintain individual document context while chunking.

Properties

Name Meaning
Chunk Size Maximum characters per chunk to split the document into.
Min Chunk Size Minimum characters per chunk to ensure chunks are not too small.
Processing Mode Determines whether to combine all input items into one document for processing or to process each input item separately.
Namespace Max Length Maximum length allowed for the namespace string used in metadata.
Additional Metadata Fields Custom metadata fields to add to each chunk, allowing users to enrich chunk metadata with extra information.

Output

JSON

  • pageContent - The text content of the chunk.
  • text - Duplicate of the chunk text content for compatibility.
  • metadata
    • document_title - Title of the document the chunk belongs to.
    • chapter - Chapter name, defaulted to 'Main Content'.
    • section - Section name, defaulted to 'Document Section'.
    • content_type - Type of content, defaulted to 'general'.
    • chunk_index - Index of the chunk within the entire document.
    • local_chunk_index - Local index of the chunk within the current processing context.
    • chapter_index - Index of the chapter, defaulted to 0.
    • total_chunks - Total number of chunks created from the document.
    • namespace - Namespace string derived from the document title, truncated to the max length.
    • source_file - Original source file name without extension.
    • character_count - Number of characters in the chunk.
    • processing_timestamp - Timestamp when the chunk was processed.
    • item_index - Index of the input item when processing each item separately.

Troubleshooting

  • If the input documents have no text content, the node outputs an empty chunk with metadata indicating an empty document. Ensure input data contains valid text fields.
  • Namespace strings are sanitized and truncated; if the namespace appears incorrect, check the document titles and their formatting.
  • Chunking is based on sentence delimiters; very short or malformed input text may result in fewer or uneven chunks.

Discussion