Large Language Models (LLMs) are among the most transformative technologies ever applied to language, but to truly understand how they work—and how they’re priced—you need to understand one simple but powerful concept: the token. Tokens are the basic currency of language models. They define what the model can read, remember, and output, and they shape everything from cost to performance.
Let’s unpack what tokens are, how they relate to real words, and how they limit or enable what an LLM can do.
Why Tokens Matter in LLMs
Every time you type a prompt into ChatGPT, Claude, or Gemini, the model doesn’t see your message as words—it sees it as tokens. These tokens are numerical representations of language, the tiny slices of meaning that an LLM uses to understand and generate text.
Tokens sit at the intersection of language and computation. They’re abstract enough to represent flexible linguistic structures, yet concrete enough to be processed by mathematical models. Understanding them isn’t just academic curiosity: it gives you real control over cost, performance, and prompt design.
What Are Tokens in LLMs?
A token is a chunk of text—maybe a whole word, a piece of a word, or even just a space or punctuation mark.
For example, the word “apple” might be a single token, while the word “unbelievable” could become two or three: “un”, “believ”, and “able”.
LLMs don’t operate directly on words because language is too varied. Instead, they use tokenizers—algorithms like Byte-Pair Encoding (BPE) or SentencePiece—that break text into statistically frequent chunks. This allows models to learn patterns efficiently across languages and dialects.
So, when you feed a hundred words to a model, it doesn’t count them as a hundred—it counts the number of tokens after tokenization, which might be a few hundred depending on the language.
Tokenization: from Words to Tokens
Tokenization is the process of splitting text into smaller units called tokens, which can represent full words, subwords, characters, or even punctuation. It’s the first step in converting natural language into a form a model can understand—structured sequences of numbers. Tokenization ensures that text of arbitrary length and complexity can be systematically encoded for computation.
Take this sentence: “Cats chace mice.”
- Words:
["Cats", "chase", "mice", "."] - Tokens (using subword tokenization):
["Cat", "s", " chase", " mice", "."] - Numeric representation:
[4832, 318, 4531, 1129, 13]
Each token corresponds to a number in the model’s internal dictionary (called a vocabulary). These numbers become embeddings—dense vectors of floating‑point values—that the LLM processes mathematically.
The mapping from tokens to numbers (from step two to three above) is usually treated as part of the broader tokenization pipeline, but conceptually it is a distinct step that follows the actual splitting of text into tokens. First, the text is segmented into tokens (like subwords or characters); then each token is looked up in a fixed vocabulary to obtain a corresponding integer ID. In many libraries and APIs this whole sequence is casually called “tokenization,” but it is helpful to distinguish “tokenization” (segmenting the string) from “ID assignment” (mapping those tokens to numbers), because the former is about defining units of text and the latter is about encoding them into a numeric format the model can process.
There is no single universally standardized name, but the step that maps tokens to integers is most often referred to as “token ID assignment” or simply “encoding to token IDs.” In practice, many libraries bundle this together with tokenization and just call the whole process “tokenization,” but conceptually you can think of two stages: tokenization (splitting text into tokens as strings) and token ID assignment (looking up each token in the vocabulary to get its numeric ID)
Tokens as a Pricing Model
Because computation in LLMs happens per token, most providers charge by token usage.
- Input tokens are what you send to the model (your prompts, context, or instructions).
- Output tokens are what the model generates in response.
For example, using OpenAI’s GPT‑4‑Turbo might cost a small fraction of a cent per thousand input tokens and slightly more for output tokens. Anthropic, Mistral, and Google follow similar patterns.
This token-based pricing isn’t arbitrary—it reflects the actual computational cost of processing and predicting each token. That’s why understanding how text breaks into tokens can help you budget intelligently and structure prompts more efficiently.
Language Structure: Headwords, Unique Words, and Their Differences
To better grasp what tokens capture (and what they lose), it helps to revisit some linguistic concepts.
Headwords
A headword (or lemma) is the base dictionary form of a word. For instance, run, running, runs, and ran all share the headword run. Dictionaries are organized by headwords, not every inflected form.
Unique Words
Unique words in a document refer to different word forms that actually appear in text. If your document includes run and runs, that counts as two unique words.
Total Word Count
Total words refer to the complete count of all words in a text, including repetitions. For example, in the phrase “run and run again”, total words = 4, even though there are only 3 unique words.
The Relationship Between Headwords, Unique Words and Tokens
Headwords compress language forms into core meanings, while unique words capture how those meanings vary by tense, case, or usage. LLMs, however, see neither—only tokens. One token might represent part of a root word, or sometimes merge entire phrases.
Understanding this difference helps explain why “token counts” don’t perfectly track word counts and why that matters for both performance and interpretability.
In any given text, total words represent the overall length of the writing, unique words capture vocabulary diversity, and headwords show the underlying lexical roots. Together they describe how language is used on the surface and beneath it.
Tokens, meanwhile, sit one step below all of these—they are how a language model internally encodes the text. A single headword like run may produce several unique word forms (runs, running, ran), and each of these might split into multiple tokens during tokenization. Conversely, frequent short words like and or the might map one‑to‑one with single tokens. This mismatch between linguistic units and computational tokens is why two texts with similar word counts can yield very different token counts—and therefore different costs or context usage in an LLM.
How Tokens Relate to Words and Headwords
Tokens aren’t linguistically tidy. They’re built for efficiency, not for grammar. A single English word can split into multiple tokens if it’s long or uncommon, while a short and frequent phrase might be a single token.
Token-per-word Ratios
Token-per-word ratios vary widely across languages. In English, one word averages about 1.3–1.5 tokens. In Spanish, with its morphological richness, the same text might produce 1.6–1.8 tokens per word.
Languages like Chinese compress meaning densely into characters—each often corresponds to a token, giving a very different ratio.
Implications for Multilingual Models
Because tokens don’t map evenly to words across languages, multilingual models face different space and cost trade-offs. A prompt that fits within 4,000 tokens in English might overflow that limit in Spanish or German due to slightly higher tokenization density.
So when we speak of an LLM’s “context window,” it’s not a matter of how many words it can see, but how many tokens—and those differ by language and structure.
The Context Window Explained
The context window is effectively the LLM’s working memory, measured in tokens. It sets an upper bound on how much it can process at once.
Input context
All your input—system instructions, history, user queries—must fit inside the model’s input token window. If you exceed it, older parts are truncated or ignored.
Output context
Models also reserve space for output tokens—the generated response. If you push the input too close to the total limit, there’s less room left for output.
Combined budgeting
Most LLMs express their context limit as “total tokens.” For instance, a model might support 128k tokens total, meaning the sum of your input and the model’s output can’t exceed 128k. Tools like tiktoken (for OpenAI models) help estimate this before you send a request.
This design matters because once an LLM’s token memory fills up, it can no longer “recall” earlier text. A longer context window extends coherent reasoning and memory, but at higher computational cost.
The Problem: larger Input than Context Window
When the input text is larger than a model’s input context window, the extra tokens simply cannot be considered: they are either truncated (usually from the oldest parts of the text) or compressed/omitted by whatever application logic wraps the model, which means the model may ignore important earlier sections and produce incoherent or incorrect answers.
A common workaround is chunking: splitting long documents into smaller segments that fit into the context window, then using techniques like retrieval‑augmented generation (RAG), semantic search, or sliding windows to select only the most relevant chunks to send with each query.
Another strategy is summarization or compression, where earlier parts of a conversation or document are periodically summarized into shorter representations so they continue to “fit” in the limited context. All of these workarounds have computational costs: more chunking and retrieval steps mean extra embedding, indexing, and query operations, and larger effective context windows increase memory usage, latency, and attention computation cost, which typically grows at least quadratically with the number of tokens processed.
Comparing Frontier Models Context Windows
Here’s a snapshot of modern LLMs (approximate data as of early 2026) context windows:
| Model | Total Context Window (Tokens) | Open-Source Availability |
|---|---|---|
| GPT-5.2 (OpenAI) | 400k | Closed |
| Claude Sonnet 4.5 (Anthropic) | 200k | Closed |
| Gemini 3 Pro (Google DeepMind) | 1M | Closed |
| LLaMA 4 Scout (Meta) | 10M | Open model Weights |
| DeepSeek-v3 | 128k | Open |
Longer windows allow you to query entire books, datasets, or even an entire codebase, but they also increase context noise and latency. There’s a delicate balance between window size, cost, and model attentiveness.
Conclusion: Understanding Tokens Is Understanding LLMs
Tokens are to language models what atoms are to matter—they form the indivisible units from which everything else emerges.
By seeing your text not just as words but as tokens, you start thinking like the model itself. That perspective reveals why multilingual cost differs, why context cuts off when it does, and why even small prompt changes can shift total token usage and output quality.
In short: mastering tokens isn’t just about optimizing cost—it’s about understanding the very architecture of language inside these models.

Leave a Reply