This article is super easy to understand how tokens work in the LLM context!
Tokens are words, character sets, or combinations of words and punctuation that are used by large language models (LLMs) to decompose text into. Tokenization is the first step in training. The LLM analyzes the semantic relationships between tokens, such as how commonly they’re used together or whether they’re used in similar contexts. After training, the LLM uses those patterns and relationships to generate a sequence of output tokens based on the input sequence.
The set of unique tokens that an LLM is trained on is known as its vocabulary.
For example, consider the following sentence:
I heard a dog bark loudly at a cat
This text could be tokenized as:
- I
- heard
- a
- dog
- bark
- loudly
- at
- a
- cat
By having a sufficiently large set of training text, tokenization can compile a vocabulary of many thousands of tokens.
The specific tokenization method varies by LLM. Common Tokenization methods include:
- Word tokenization (text is split into individual words based on a delimiter)
- Character tokenization (text is split into individual characters)
- Subword tokenization (text is split into partial words or character sets)
For example, the GPT models, developed by OpenAI, use a type of subword tokenization that's known as Byte-Pair Encoding (BPE). OpenAI provides a tool to visualize how text will be tokenized.
After the LLM completes tokenization, it assigns an ID to each unique token. Consider our example sentence:
I heard a dog bark loudly at a cat
After the model uses a word tokenization method, it could assign token IDs as follows:
- I (1)
- heard (2)
- a (3)
- dog (4)
- bark (5)
- loudly (6)
- at (7)
- a (the “a” token is already assigned an ID of 3)
- cat (8)
By assigning IDs, text can be represented as a sequence of token IDs. The example sentence would be represented as [1, 2, 3, 4, 5, 6, 7, 3, 8]. The sentence “I heard a cat” would be represented as [1, 2, 3, 8].
As training continues, the model adds any new tokens in the training text to its vocabulary and assigns it an ID. For example:
- meow (9)
- run (10)
The semantic relationships between the tokens can be analyzed by using these token ID sequences. Multi-valued numeric vectors, known as embeddings, are used to represent these relationships. An embedding is assigned to each token based on how commonly it’s used together with, or in similar contexts to, the other tokens.
After it's trained, a model can calculate an embedding for text that contains multiple tokens. The model tokenizes the text, then calculates an overall embeddings value based on the learned embeddings of the individual tokens. This technique can be used for semantic document searches or adding memories to an AI.
During output generation, the model predicts a vector value for the next token in the sequence. The model then selects the next token from it’s vocabulary based on this vector value. In practice, the model calculates multiple vectors by using various elements of the previous tokens' embeddings. The model then evaluates all potential tokens from these vectors and selects the most probable one to continue the sequence.
Output generation is an iterative operation. The model appends the predicted token to the sequence so far and uses that as the input for the next iteration, building the final output one token at a time.
I don't quite understand this generation part. I'll study about embeddings tomorrow, hoping it would provide me with more contexts.
LLMs have limitations regarding the maximum number of tokens that can be used as input or generated as output. This limitation often causes the input and output tokens to be combined into a maximum context window.
For example, GPT-4 supports up to 8,192 tokens of context. This combined size of the input and output tokens can't exceed 8,192.
Taken together, a model's token limit and tokenization method determine the maximum length of text that can be provided as input or generated as output.
For example, consider a model that has a maximum context window of 100 tokens. The model processes our example sentences as input text:
I heard a dog bark loudly at a cat
By using a word-based tokenization method, the input is nine tokens. This leaves 91 words tokens available for the output.
By using a character-based tokenization method, the input is 34 tokens (including spaces). This leaves only 66 character tokens available for the output.
It's interesting that the token limits are for the combination of input and output. I misunderstood that.
Generative AI services often use token-based pricing. The cost of each request depends on the number of input and output tokens. The pricing might differ between input and output.
OpenAI provides a convenient Python package tiktoken
that is a fast BPE (Byte Pair Encoding, mentioned in the Microsoft page)
tokeniser.
In [1]: import tiktoken
In [2]: encoding = tiktoken.get_encoding("cl100k_base")
In [3]: text = "This is a sample string that is converted to a set of tokens."
In [4]: tokens = encoding.encode(text)
In [5]: tokens
Out[5]: [2028, 374, 264, 6205, 925, 430, 374, 16489, 311, 264, 743, 315, 11460, 13]
In [6]: len(tokens)
Out[6]: 14
I can see that “is” is assigned ID of 374
and “a” is
assigned 264
, confirming the content of the Microsoft page.
I can also count the number of tokens by checking the length of the
returned list.
I saw that some people checked the number of tokens before sending requests to OpenAI API to avoid rate limits, but I guess I can just send requests and get 429 Too Many Requests instead. This way, I don't have to write a long logic, but not 100% confident.
Puchao 200 Yogurt 300 Mac'n cheese 300 Salad(?) 1000
Total 1800 kcal
push-ups
MUST:
TODO: