Agentic RAG Supabase icon

Agentic RAG Supabase

Handle RAG operations with Supabase pgvector for PDF/TXT files

Overview

The node "Agentic RAG Supabase" provides a comprehensive pipeline for handling Retrieval-Augmented Generation (RAG) operations using Supabase's pgvector extension, primarily focused on processing PDF, TXT, and DOCX files. It supports parsing files, extracting structured data, generating embeddings for text chunks, and completing full processing workflows that combine these steps.

For the File resource with the Complete Processing operation, the node:

  • Parses the input file to extract raw text.
  • Extracts structured data from the file in JSON format.
  • Generates vector embeddings for chunks of the extracted text.
  • Returns a combined result containing parsed text, structured data, embeddings, and a completion flag.

This node is beneficial in scenarios where users want to ingest documents into a vector database for semantic search or AI-assisted querying. For example, it can be used to prepare documents for knowledge base creation, enabling downstream AI models to answer questions based on document content.


Properties

Name Meaning
File Path The local path to the file to be processed. Supported file types include PDF, TXT, DOCX.

Output

The output JSON object for the Complete Processing operation includes:

  • parsed: An object containing:

    • text: The full extracted text content from the file.
    • chunks: An array of text chunks derived from the full text (default chunk size ~200 words with overlap).
    • fileType: The file extension/type (e.g., .pdf, .txt, .docx).
    • fileName: The base name of the file.
  • structured: An object containing:

    • structuredData: Structured representation of the file content (for PDFs, TXT, CSV), typically as arrays or tables.
    • format: The output format requested (JSON in this case).
    • fileName: The base name of the file.
  • embeddings: An object containing:

    • embeddings: An array of objects, each representing a chunk with its embedding vector and metadata.
    • totalChunks: Total number of chunks created from the text.
    • embeddingModel: The name of the embedding model used (thenlper/gte-small).
  • processingComplete: A boolean flag indicating the processing finished successfully (true).

No binary data is output by this operation.


Dependencies

  • Supabase: Used as the vector database backend with pgvector extension for storing and querying embeddings.
  • Hugging Face Inference API: Utilized to generate text embeddings via the thenlper/gte-small model.
  • OpenAI API: Used internally for advanced query answering and evaluation in other operations (not directly in Complete Processing).
  • Node.js libraries:
    • pdf-parse and pdf2json for PDF parsing.
    • mammoth for DOCX text extraction.
    • csv-parser for CSV structured extraction.
    • axios for HTTP requests to external APIs.
    • fs and path for file system operations.

Required n8n configurations:

  • Credentials providing:
    • Supabase project URL and API key.
    • Hugging Face API key for embedding generation.

Troubleshooting

  • Unsupported file type error: If the file extension is not .pdf, .txt, or .docx, the node will throw an error. Ensure the file path points to a supported file type.

  • File read errors: If the file path is incorrect or inaccessible, the node will fail reading the file. Verify the file exists and permissions are correct.

  • Embedding generation errors: Failures in calling the Hugging Face API may occur due to invalid or missing API keys, network issues, or rate limits. Check the API key validity and network connectivity.

  • Structured extraction errors: For structured extraction, unsupported formats like non-CSV/TXT/PDF files will cause errors.

  • General error handling: Errors during processing are caught and returned with an error message and operation context.


Links and References

Discussion