GitLab Code Splitter

Split GitLab repository code into manageable chunks

Overview

The Stem4 Integration - Split All Files with GCP Upload operation is designed to process all files in a specified GitLab repository by splitting their code content into manageable chunks and then uploading these chunks to Google Cloud Platform (GCP) Storage. This node is particularly useful for scenarios where large codebases need to be analyzed, indexed, or processed in smaller parts, such as for code search engines, AI code analysis, or documentation generation.

Typical use cases include:

  • Splitting an entire GitLab project’s source code into token-limited chunks for downstream processing.
  • Filtering files by extension and excluding certain paths to focus on relevant code.
  • Automatically uploading the processed chunks to GCP Storage for scalable storage and further cloud-based workflows.

For example, a developer might use this node to split all .go files in a GitLab repo's main branch, excluding vendor directories, and upload the results to GCP for integration with a code intelligence platform.


Properties

Name Meaning
GitLab URL The base URL of the GitLab instance hosting the repository (e.g., https://gitlab.com).
Project ID Identifier of the GitLab project in the format group/project-name.
GitLab Token Personal access token for authenticating with GitLab API. Must start with glpat-, gldt-, or gloas-.
Branch The Git branch to process (default is main).
Target Path Optional prefix path to prepend to the output files after processing.
Service Optional service identifier to include in the metadata of the output chunks.
File Extensions List of file extensions to include in processing (e.g., .go, .js). Only files matching these extensions will be processed.
Exclude Paths List of directory paths to exclude from processing (e.g., .git, vendor, node_modules).
Split Options Collection of options controlling how code is split into chunks:
• Max Tokens: Maximum tokens per chunk (default 800).
• Overlap: Number of overlapping tokens between chunks (default 50).
• Min Chunk Size: Minimum tokens per chunk (default 100).
• Preserve Newlines: Whether to keep newline characters (default true).
Max File Size (Bytes) Maximum size of files to process in bytes (default 2MB). Files larger than this are skipped.

Output

The node outputs JSON data representing the result of the splitting and uploading operation. The structure typically includes metadata about the processed files and the generated chunks. Each output item corresponds to the response from the backend API that performs the splitting and GCP upload.

  • The json output contains details such as file paths, chunk information, and possibly URLs or identifiers related to the uploaded chunks in GCP Storage.
  • No binary data output is produced by this node.

Dependencies

  • Requires a valid API key credential for the external GitLab Code Splitter API service.
  • Requires a GitLab personal access token with appropriate permissions to read the target repository.
  • The node communicates with an external API endpoint (configured via credentials) that handles the actual splitting and uploading logic.
  • The external API must have access to Google Cloud Storage for uploading the split chunks.
  • Network connectivity to GitLab and the external API service is necessary.
  • No direct configuration of GCP credentials is done within the node; it relies on the external API managing GCP uploads.

Troubleshooting

  • Invalid API URL or Key: If the API URL or API key credential is missing or malformed, the node throws an error indicating invalid credentials. Ensure the API URL is a valid URL and the API key meets length and character requirements.
  • Invalid GitLab URL: The node validates the GitLab URL format. An incorrect URL will cause an error.
  • Invalid GitLab Token Format: The GitLab token must start with one of the accepted prefixes (glpat-, gldt-, or gloas-). Using an incorrect token format will cause the node to fail.
  • File Size Limits: Files exceeding the configured max file size (default 2MB) are skipped. Adjust the limit if needed.
  • Timeouts: The request to the external API has a long timeout (1 hour). Network issues or very large repositories may cause delays or failures.
  • Permission Issues: Ensure the GitLab token has sufficient permissions to access the repository and its files.
  • Excluding Paths: Misconfiguration of exclude paths might lead to unexpected files being processed or skipped.
  • API Errors: Any errors returned by the external API will be surfaced. Check the API logs or responses for detailed diagnostics.

Links and References


If you need further details on other operations or resources, please let me know!

Discussion