Into the LLM Tokenization

How do they create tokens?

Starts with crawling data from the internet to build a massive dataset
Filter data with classifier to removing part of the data that low performance (noisy, duplicate content, low-quality text, and irrelevant information)
Cleaned data will be compressed into something the machine usable

So instead of feeding a raw text into model, it converted into tokens

Example: FineWeb-Edu dataset¹

With the cleaned dataset, the tokenizer firstly will breaking down the raw text into smaller pieces (also tokens). These can be words, parts of words, or punctuation,... depending on the tokenizer
The tokens will be assigned a unique ID, commonly a number. For example, the word cat might be one token, the word running could split into run and ning
Tokenization reduces raw text to numbers that a language model can process, so when a model usually introduced with 15 trillion tokens that means 15 trillion of these units were created after cleaning and filtering.

Note: Did you notice asterisks (*) is recognized as 2 tokens? The first one have a space before.

Some tokenizer will applied Byte Pair Encoding (BPE, or digram coding)² to help model more generalized across the messy dataset.

Starts with individual tokens, it will repeatedly find most frequent pair of token in the raw data, and merges them into a single new token.

For example: "you" and "are" in dataset obviously appear together a lot, the BPE algorithm might create "you are" as one token instead of two.

This method can reduces the number of tokens, speeding up training and inference.