GPT-Tokenizer icon

GPT-Tokenizer

Encode / decodes BPE Tokens or check Token Limits before working with the OpenAI GPT models.

Overview

This node, named "GPT-Tokenizer," provides various operations to work with GPT-style tokenization based on Byte Pair Encoding (BPE). It is useful for encoding strings into tokens, decoding tokens back into strings, counting tokens in a string, checking if a string fits within a specified token limit, and slicing a string into chunks that each fit within a maximum token count.

Common scenarios include:

  • Preparing text inputs for OpenAI GPT models by encoding them into tokens.
  • Validating whether input text exceeds token limits before sending it to an API.
  • Splitting long texts into smaller parts that comply with token limits.
  • Decoding token arrays back into human-readable strings.

Practical examples:

  • Before calling an OpenAI GPT model, use the "Count Tokens" operation to ensure your prompt does not exceed the model's token limit.
  • Use "Slice to Max Token Limit" to split a large document into manageable pieces for batch processing.
  • Convert user input into tokens with "Encode" for custom token-based processing or analysis.

Properties

Name Meaning
Input String The string of text to process. Required for operations: encode, countTokens, isWithinTokenLimit, sliceMatchingTokenLimit.
Destination Key The JSON key where the result will be stored. If left empty, default keys are used depending on the operation.

Note: For the "Count Tokens" operation specifically, only "Input String" and "Destination Key" are relevant here.

Output

The output is a JSON object with a key containing the token count:

  • By default, the token count is stored under the key tokenCount unless a custom destination key is provided.
  • The value is a number representing how many tokens the input string produces according to the GPT tokenizer.

Example output JSON snippet:

{
  "tokenCount": 42
}

No binary data output is produced by this operation.

Dependencies

  • Uses the gpt-tokenizer package for encoding/decoding tokens.
  • Uses the js-tiktoken/lite package to perform token counting.
  • Fetches a remote JSON file from https://tiktoken.pages.dev/js/o200k_base.json to initialize the tokenizer for counting tokens.
  • Requires internet access at runtime to fetch the tokenizer data for counting tokens.

Troubleshooting

  • Input String is not a string: This error occurs if the provided input is not a valid string type. Ensure the input is a proper string.
  • Input String field is empty: The input string must not be empty; provide valid text to process.
  • Provide Max Tokens (bigger than 0): When using token limit checks, the max tokens parameter must be a positive number.
  • String exceeds token limit: If configured to throw errors when exceeding token limits, this error indicates the input is too long. Either increase the limit or shorten the input.
  • Network issues fetching tokenizer data: Since the token counting relies on downloading a JSON file, network problems may cause failures. Ensure stable internet connectivity.

Links and References

Discussion