Overview
This node, named "GPT-Tokenizer," provides various operations to work with GPT-style tokenization based on Byte Pair Encoding (BPE). It is useful for encoding strings into tokens, decoding tokens back into strings, counting tokens in a string, checking if a string fits within a specified token limit, and slicing a string into chunks that each fit within a maximum token count.
Common scenarios include:
- Preparing text inputs for OpenAI GPT models by encoding them into tokens.
- Validating whether input text exceeds token limits before sending it to an API.
- Splitting long texts into smaller parts that comply with token limits.
- Decoding token arrays back into human-readable strings.
Practical examples:
- Before calling an OpenAI GPT model, use the "Count Tokens" operation to ensure your prompt does not exceed the model's token limit.
- Use "Slice to Max Token Limit" to split a large document into manageable pieces for batch processing.
- Convert user input into tokens with "Encode" for custom token-based processing or analysis.
Properties
| Name | Meaning |
|---|---|
| Input String | The string of text to process. Required for operations: encode, countTokens, isWithinTokenLimit, sliceMatchingTokenLimit. |
| Destination Key | The JSON key where the result will be stored. If left empty, default keys are used depending on the operation. |
Note: For the "Count Tokens" operation specifically, only "Input String" and "Destination Key" are relevant here.
Output
The output is a JSON object with a key containing the token count:
- By default, the token count is stored under the key
tokenCountunless a custom destination key is provided. - The value is a number representing how many tokens the input string produces according to the GPT tokenizer.
Example output JSON snippet:
{
"tokenCount": 42
}
No binary data output is produced by this operation.
Dependencies
- Uses the
gpt-tokenizerpackage for encoding/decoding tokens. - Uses the
js-tiktoken/litepackage to perform token counting. - Fetches a remote JSON file from
https://tiktoken.pages.dev/js/o200k_base.jsonto initialize the tokenizer for counting tokens. - Requires internet access at runtime to fetch the tokenizer data for counting tokens.
Troubleshooting
- Input String is not a string: This error occurs if the provided input is not a valid string type. Ensure the input is a proper string.
- Input String field is empty: The input string must not be empty; provide valid text to process.
- Provide Max Tokens (bigger than 0): When using token limit checks, the max tokens parameter must be a positive number.
- String exceeds token limit: If configured to throw errors when exceeding token limits, this error indicates the input is too long. Either increase the limit or shorten the input.
- Network issues fetching tokenizer data: Since the token counting relies on downloading a JSON file, network problems may cause failures. Ensure stable internet connectivity.