Actions9
- Code Actions
- Repository Actions
- Stem4 Integration Actions
- System Actions
Overview
The "Stem4 Integration" resource with the "Split by Path with GCP Upload" operation is designed to process source code files from a GitLab repository, split them into manageable chunks based on specified paths, and upload the processed chunks to Google Cloud Platform (GCP) Storage. This node is particularly useful for developers or teams who want to analyze, index, or transform large codebases stored in GitLab by breaking down files into smaller pieces while filtering by directory paths. It supports token-based chunking with overlap and size controls, enabling efficient handling of large repositories.
Practical examples:
- Splitting all Go source files under specific directories in a GitLab project to prepare them for code analysis or search indexing.
- Processing only certain architectural layers or folders within a repository and uploading the results to GCP for further processing or storage.
- Excluding vendor or third-party directories to focus on proprietary code during splitting and upload.
Properties
| Name | Meaning |
|---|---|
| GitLab URL | The URL of the GitLab instance hosting the repository (e.g., https://gitlab.com). |
| Project ID | Identifier of the GitLab project in the format group/project-name. |
| GitLab Token | Personal access token for authenticating with GitLab API. Must start with glpat-, gldt-, or gloas-. |
| Branch | The Git branch to process (default is main). |
| Target Path | Optional prefix path to prepend to the processed files' output location. |
| Service | Optional service identifier to include in the metadata of the output chunks. |
| File Extensions | List of file extensions to include in processing (e.g., .go, .js). Only files matching these extensions will be processed. |
| Exclude Paths | List of directory paths to exclude from processing (e.g., .git, vendor, node_modules). |
| Split Options | Collection of options controlling how files are split into chunks: - Max Tokens: Maximum tokens per chunk (default 800). - Overlap: Number of overlapping tokens between chunks (default 50). - Min Chunk Size: Minimum tokens per chunk (default 100). - Preserve Newlines: Whether to keep newline characters (default true). |
| Max File Size | Maximum file size in bytes to process (default 2MB). Files larger than this size will be skipped. |
Output
The node outputs an array of JSON objects representing the result of the splitting and upload operation. Each output item contains metadata and content chunks corresponding to the processed files from the GitLab repository paths specified.
- The JSON structure typically includes details about each chunk such as its text content, associated file path, and any included metadata like service identifiers or target paths.
- The node does not output binary data directly; instead, it handles uploading the processed chunks to GCP Storage internally.
Dependencies
- Requires a valid API key credential for the external GitLab Code Splitter API service.
- Requires a GitLab personal access token with appropriate permissions to read the repository.
- The node interacts with GitLab's API to fetch repository files.
- The node uploads processed chunks to Google Cloud Platform Storage (GCP), so proper GCP credentials and configuration must be set up externally.
- Network connectivity to both GitLab and the external splitter API endpoint is necessary.
Troubleshooting
- Invalid API URL or Key: The node validates the API URL and API key format before execution. Errors like "Invalid API URL provided in credentials" or "Valid API key is required" indicate misconfiguration of the external API credentials.
- Invalid GitLab URL or Token: The node checks that the GitLab URL is a valid URL and that the token starts with expected prefixes (
glpat-,gldt-, orgloas-). Errors here mean the user should verify the GitLab instance URL and token correctness. - Timeouts: The request to the external API has a long timeout (1 hour). If the repository is very large or network is slow, operations may time out or fail. Consider reducing the scope or increasing timeout if possible.
- File Size Limits: Files exceeding the configured max file size (default 2MB) are skipped. Large files might not be processed unless the limit is increased.
- Empty or Incorrect Paths: If include or exclude paths are misconfigured, no files may be processed or unexpected files included. Verify path patterns carefully.
- API Rate Limits: GitLab API rate limits or permission issues can cause failures fetching repository files. Ensure the token has sufficient rights and usage is within limits.