GPT-Tokenizer

Encode / decodes BPE Tokens or check Token Limits before working with the OpenAI GPT models.

Actions5

Overview

This node, named "GPT-Tokenizer," provides various operations to work with GPT-style tokenization based on Byte Pair Encoding (BPE). It is useful for encoding strings into tokens, decoding tokens back into strings, counting tokens in a string, checking if a string fits within a specified token limit, and slicing a string into chunks that each fit within a maximum token count.

Common scenarios include:

Preparing text inputs for OpenAI GPT models by encoding them into tokens.
Validating whether input text exceeds token limits before sending it to an API.
Splitting long texts into smaller parts that comply with token limits.
Decoding token arrays back into human-readable strings.

Practical examples:

Before calling an OpenAI GPT model, use the "Count Tokens" operation to ensure your prompt does not exceed the model's token limit.
Use "Slice to Max Token Limit" to split a large document into manageable pieces for batch processing.
Convert user input into tokens with "Encode" for custom token-based processing or analysis.

Properties

Name	Meaning
Input String	The string of text to process. Required for operations: encode, countTokens, isWithinTokenLimit, sliceMatchingTokenLimit.
Destination Key	The JSON key where the result will be stored. If left empty, default keys are used depending on the operation.

Note: For the "Count Tokens" operation specifically, only "Input String" and "Destination Key" are relevant here.

Output

The output is a JSON object with a key containing the token count:

By default, the token count is stored under the key tokenCount unless a custom destination key is provided.
The value is a number representing how many tokens the input string produces according to the GPT tokenizer.

Example output JSON snippet:

{
  "tokenCount": 42
}

No binary data output is produced by this operation.

Dependencies

Uses the gpt-tokenizer package for encoding/decoding tokens.
Uses the js-tiktoken/lite package to perform token counting.
Fetches a remote JSON file from https://tiktoken.pages.dev/js/o200k_base.json to initialize the tokenizer for counting tokens.
Requires internet access at runtime to fetch the tokenizer data for counting tokens.

Troubleshooting

Input String is not a string: This error occurs if the provided input is not a valid string type. Ensure the input is a proper string.
Input String field is empty: The input string must not be empty; provide valid text to process.
Provide Max Tokens (bigger than 0): When using token limit checks, the max tokens parameter must be a positive number.
String exceeds token limit: If configured to throw errors when exceeding token limits, this error indicates the input is too long. Either increase the limit or shorten the input.
Network issues fetching tokenizer data: Since the token counting relies on downloading a JSON file, network problems may cause failures. Ensure stable internet connectivity.