on Mar 01, 2025 i took...

Into the LLM Tokenization

How do they create tokens?

So instead of feeding a raw text into model, it converted into tokens

Tokenization

Example: FineWeb-Edu dataset1

  1. With the cleaned dataset, the tokenizer firstly will breaking down the raw text into smaller pieces (also tokens). These can be words, parts of words, or punctuation,... depending on the tokenizer
  2. The tokens will be assigned a unique ID, commonly a number. For example, the word cat might be one token, the word running could split into run and ning
  3. Tokenization reduces raw text to numbers that a language model can process, so when a model usually introduced with 15 trillion tokens that means 15 trillion of these units were created after cleaning and filtering.

The chat

How the UI display
Chat under the hood, using tiktokenizer.vercel.app

Note: Did you notice asterisks (*) is recognized as 2 tokens? The first one have a space before.

What LLMs see

The mess

Some tokenizer will applied Byte Pair Encoding (BPE, or digram coding)2 to help model more generalized across the messy dataset.

Starts with individual tokens, it will repeatedly find most frequent pair of token in the raw data, and merges them into a single new token.

For example: "you" and "are" in dataset obviously appear together a lot, the BPE algorithm might create "you are" as one token instead of two.

This method can reduces the number of tokens, speeding up training and inference.

References

Footnotes

  1. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

  2. https://en.wikipedia.org/wiki/Byte_pair_encoding