Overview
This node, named "GPT-Tokenizer," provides various operations to work with GPT-style Byte Pair Encoding (BPE) tokens. It is useful for encoding strings into tokens, decoding tokens back into strings, counting tokens in a string, checking if a string fits within a token limit, and slicing a string into chunks that each fit within a specified token limit.
Common scenarios include:
- Preparing text input for OpenAI GPT models by encoding it into tokens.
- Validating whether a prompt or input text stays within model token limits before sending it to the API.
- Splitting long texts into smaller parts that comply with token limits.
- Decoding token arrays back into human-readable strings.
Practical examples:
- Encoding a user message into tokens before passing it to an AI completion endpoint.
- Checking if a blog post draft exceeds the maximum allowed tokens for summarization.
- Automatically splitting a large document into token-sized chunks for batch processing.
Properties
| Name | Meaning |
|---|---|
| Input String | The string of text to process (encode, count tokens, check token limit, or slice). |
| Destination Key | The key name where the result will be stored in the output JSON. If empty, defaults are used. |
Additional properties relevant to other operations (not requested here but present in the node):
- Max Tokens: Maximum number of tokens allowed (used in token limit checks and slicing).
- Error When Exceeding Token Limit: Whether to throw an error if the string exceeds the max token limit.
- Input Tokens: An array of BPE tokens to decode (used in decode operation).
Output
The output JSON structure varies depending on the operation:
Encode: Outputs an array of tokens under the specified destination key (default key:
"tokens").{ "tokens": [5661, 318, 1337, ...] }Decode: Outputs the decoded string under the specified destination key (default key:
"data").{ "data": "decoded string" }Count Tokens: Outputs an object with token statistics under the specified destination key (default key:
"tokenCount"), e.g.:{ "tokenCount": { "resume": 42, "tokens": [/* array of token IDs */] } }Check Token Limit: Outputs a boolean indicating if the input string is within the max token limit under the specified destination key (default key:
"isWithinTokenLimit").{ "isWithinTokenLimit": true }Slice to Max Token Limit: Outputs an array of string slices, each fitting within the max token limit, under the specified destination key (default key:
"slices").{ "slices": [ "first chunk of text", "second chunk of text", ... ] }
The node does not output binary data.
Dependencies
- Uses the
gpt-tokenizerlibrary for encoding, decoding, and token limit checks. - Uses the
js-tiktoken/litepackage to count tokens via a tokenizer initialized with a remote JSON file fetched fromhttps://tiktoken.pages.dev/js/o200k_base.json. - Requires internet access to fetch the tokenizer configuration JSON at runtime.
- No internal credential types are required, but network connectivity is necessary for token counting.
Troubleshooting
- Input String is not a string: This error occurs if the provided input is not a valid string. Ensure the input data type is correct.
- Input String field is empty: The node requires a non-empty string for most operations; provide valid text input.
- Input Tokens is not an array: For decoding, the input must be an array of tokens. Check the format of your token input.
- Provide Max Tokens (bigger than 0): When checking or slicing by token limit, the max tokens value must be a positive integer.
- String exceeds token limit: If enabled, the node throws this error when the input string's token count surpasses the max token limit. Either increase the limit or disable the error flag.
- Network errors fetching tokenizer config: Counting tokens depends on downloading a JSON file. Network issues may cause failures; ensure stable internet connection.