GPT-Tokenizer

Encode / decodes BPE Tokens or check Token Limits before working with the OpenAI GPT models.

Actions5

Overview

This node, named "GPT-Tokenizer," provides various operations to work with Byte Pair Encoding (BPE) tokens commonly used in OpenAI GPT models. It can encode strings into BPE tokens, decode arrays of tokens back into strings, count the number of tokens in a string, check if a string is within a specified token limit, and slice a string into smaller parts that each fit within a maximum token limit.

Common scenarios where this node is useful include:

Preparing text input for GPT models by encoding it into tokens.
Decoding token arrays back into readable text.
Validating or enforcing token limits before sending data to GPT APIs.
Splitting large texts into manageable chunks that comply with token limits.

For example, before sending a prompt to an OpenAI GPT model, you might use this node to ensure the prompt does not exceed the model's token limit or to split a long prompt into smaller pieces.

Properties

Name	Meaning
Input Tokens	An array of BPE tokens to decode (used only in the Decode operation).
Destination Key	The key in the output JSON where the result will be stored. If left empty, default keys are used.

Output

The node outputs JSON data with fields depending on the selected operation:

Decode: Outputs the decoded string under the specified destination key (default key: data).
Encode: Outputs an array of BPE tokens under the specified destination key (default key: tokens).
Count Tokens: Outputs an object with token statistics under the specified destination key (default key: tokenCount). The object includes:
- resume: Number of tokens counted.
- tokens: Array of token IDs.
Check Token Limit: Outputs a boolean (true or false) indicating whether the input string is within the max token limit under the specified destination key (default key: isWithinTokenLimit).
Slice to Max Token Limit: Outputs an array of string slices, each fitting within the max token limit, under the specified destination key (default key: slices).

The node does not output binary data.

Dependencies

Uses the gpt-tokenizer package for encoding and decoding BPE tokens.
Uses the js-tiktoken/lite package to count tokens and handle token limits.
Fetches a remote JSON file from https://tiktoken.pages.dev/js/o200k_base.json to initialize the tokenizer for counting tokens.
Requires internet access to fetch the tokenizer base JSON during execution.
No internal credential or API key is required for basic tokenization operations.

Troubleshooting

Input String is not a string: This error occurs if the input provided for operations expecting a string is not of type string. Ensure the input is a valid string.
Input String field is empty: The node requires a non-empty string for most operations except decode. Provide a valid string input.
Input Tokens is not an array: For the decode operation, the input must be an array of tokens. Passing anything else will cause an error.
Input Tokens field is empty: The decode operation requires a non-empty array of tokens.
Provide Max Tokens (bigger than 0): For operations involving token limits, the max tokens parameter must be greater than zero.
String exceeds token limit: When checking token limits with error throwing enabled, this error indicates the input string is too long. Either increase the limit or disable the error flag.
Network issues fetching tokenizer base JSON: Counting tokens depends on downloading a JSON file. Network problems may cause failures here.

To resolve these errors, verify input types and values, ensure required parameters are set, and confirm network connectivity if counting tokens.