GPT-Tokenizer icon

GPT-Tokenizer

Encode / decodes BPE Tokens or check Token Limits before working with the OpenAI GPT models.

Overview

This node provides various operations to work with GPT-style tokenization, specifically using Byte Pair Encoding (BPE) tokens compatible with OpenAI GPT models. It allows encoding strings into tokens, decoding tokens back into strings, counting tokens in a string, checking if a string fits within a specified token limit, and slicing a string into chunks that each fit within a token limit.

Common scenarios where this node is useful include:

  • Preparing text inputs for GPT models by encoding them into tokens.
  • Validating whether input text exceeds model token limits before sending requests.
  • Splitting long texts into smaller parts that comply with token limits.
  • Decoding token arrays back into readable text.

Practical examples:

  • Before calling an OpenAI GPT API, use "Check Token Limit" to ensure the prompt does not exceed the model's max tokens.
  • Use "Slice to Max Token Limit" to split a large document into manageable chunks for sequential processing.
  • Encode user input into tokens for custom token-based processing or analysis.
  • Decode tokens received from a GPT model back into human-readable text.

Properties

Name Meaning
Operation The action to perform: Encode, Decode, Count Tokens, Check Token Limit, or Slice to Max Token Limit.
Input String The string text to process (required for encode, countTokens, isWithinTokenLimit, sliceMatchingTokenLimit).
Input Tokens An array of BPE tokens to decode (required for decode operation).
Max Tokens The maximum number of tokens allowed (required for isWithinTokenLimit and sliceMatchingTokenLimit).
Error When Exceeding Token Limit Whether to throw an error if the input string exceeds the max token limit (only for isWithinTokenLimit).
Destination Key The key name under which to store the result in the output JSON. If empty, default keys are used.

Output

The node outputs JSON data with different structures depending on the operation:

  • Encode: Outputs an array of tokens under the key "tokens" (or custom destination key).
  • Decode: Outputs the decoded string under the key "data" (or custom destination key).
  • Count Tokens: Outputs an object under "stats" containing:
    • resume: Number of tokens counted.
    • tokens: Array of token IDs.
  • Check Token Limit: Outputs a boolean (true or false) indicating if the input string is within the token limit under the key "isWithinTokenLimit" (or custom destination key).
  • Slice to Max Token Limit: Outputs an array of string slices under the key "slices" (or custom destination key), where each slice fits within the max token limit.

The node does not output binary data.

Dependencies

  • Uses the gpt-tokenizer package for encoding, decoding, and token limit checks.
  • Uses the js-tiktoken/lite package to count tokens accurately.
  • Fetches a token encoding base JSON from a remote URL (https://tiktoken.pages.dev/js/o200k_base.json) at runtime to initialize the tokenizer.
  • Requires internet access to fetch the encoding base JSON during execution.
  • No internal credential or API key is required for the node itself.

Troubleshooting

Common Issues

  • Input String is empty or not a string: The node requires a valid non-empty string for most operations except decode. Ensure the input is correctly provided.
  • Input Tokens is not an array or empty: For decode operation, the input tokens must be a non-empty array of numbers.
  • Max Tokens not provided or less than or equal to zero: For token limit related operations, max tokens must be a positive number.
  • Exceeding token limit without error flag: If the input exceeds the token limit and the error flag is false, the node returns false but does not throw an error.
  • Network issues fetching encoding base JSON: Since the node fetches a remote JSON file to initialize the tokenizer, network problems can cause failures.

Error Messages and Resolutions

  • "Input String is not a string": Provide a valid string input.
  • "Input String field is empty": Ensure the input string is not empty.
  • "Input Tokens is not an array": Provide a valid array of tokens for decoding.
  • "Input Tokens field is empty": Provide a non-empty array of tokens.
  • "Provide Max Tokens. (bigger then 0)": Set a positive number for max tokens.
  • "String exceeds token limit": Enable the error flag to throw an error or handle the false return value gracefully.

Links and References

Discussion