GPT-Tokenizer icon

GPT-Tokenizer

Encode / decodes BPE Tokens or check Token Limits before working with the OpenAI GPT models.

Overview

This node, named "GPT-Tokenizer," provides various operations to work with GPT-style Byte Pair Encoding (BPE) tokens. It is useful for encoding strings into tokens, decoding tokens back into strings, counting tokens in a string, checking if a string fits within a token limit, and slicing a string into chunks that each fit within a specified token limit.

Common scenarios include:

  • Preparing text input for OpenAI GPT models by encoding it into tokens.
  • Validating whether a prompt or input text stays within model token limits before sending it to the API.
  • Splitting long texts into smaller parts that comply with token limits.
  • Decoding token arrays back into human-readable strings.

Practical examples:

  • Encoding a user message into tokens before passing it to an AI completion endpoint.
  • Checking if a blog post draft exceeds the maximum allowed tokens for summarization.
  • Automatically splitting a large document into token-sized chunks for batch processing.

Properties

Name Meaning
Input String The string of text to process (encode, count tokens, check token limit, or slice).
Destination Key The key name where the result will be stored in the output JSON. If empty, defaults are used.

Additional properties relevant to other operations (not requested here but present in the node):

  • Max Tokens: Maximum number of tokens allowed (used in token limit checks and slicing).
  • Error When Exceeding Token Limit: Whether to throw an error if the string exceeds the max token limit.
  • Input Tokens: An array of BPE tokens to decode (used in decode operation).

Output

The output JSON structure varies depending on the operation:

  • Encode: Outputs an array of tokens under the specified destination key (default key: "tokens").

    {
      "tokens": [5661, 318, 1337, ...]
    }
    
  • Decode: Outputs the decoded string under the specified destination key (default key: "data").

    {
      "data": "decoded string"
    }
    
  • Count Tokens: Outputs an object with token statistics under the specified destination key (default key: "tokenCount"), e.g.:

    {
      "tokenCount": {
        "resume": 42,
        "tokens": [/* array of token IDs */]
      }
    }
    
  • Check Token Limit: Outputs a boolean indicating if the input string is within the max token limit under the specified destination key (default key: "isWithinTokenLimit").

    {
      "isWithinTokenLimit": true
    }
    
  • Slice to Max Token Limit: Outputs an array of string slices, each fitting within the max token limit, under the specified destination key (default key: "slices").

    {
      "slices": [
        "first chunk of text",
        "second chunk of text",
        ...
      ]
    }
    

The node does not output binary data.

Dependencies

  • Uses the gpt-tokenizer library for encoding, decoding, and token limit checks.
  • Uses the js-tiktoken/lite package to count tokens via a tokenizer initialized with a remote JSON file fetched from https://tiktoken.pages.dev/js/o200k_base.json.
  • Requires internet access to fetch the tokenizer configuration JSON at runtime.
  • No internal credential types are required, but network connectivity is necessary for token counting.

Troubleshooting

  • Input String is not a string: This error occurs if the provided input is not a valid string. Ensure the input data type is correct.
  • Input String field is empty: The node requires a non-empty string for most operations; provide valid text input.
  • Input Tokens is not an array: For decoding, the input must be an array of tokens. Check the format of your token input.
  • Provide Max Tokens (bigger than 0): When checking or slicing by token limit, the max tokens value must be a positive integer.
  • String exceeds token limit: If enabled, the node throws this error when the input string's token count surpasses the max token limit. Either increase the limit or disable the error flag.
  • Network errors fetching tokenizer config: Counting tokens depends on downloading a JSON file. Network issues may cause failures; ensure stable internet connection.

Links and References

Discussion