Let’s keep in touch! Join me on the Javier Tiniaco Leyba newsletter 📩

Tokens, Words, and Windows: Understanding LLMs Token-Based Pricing model

Written in

by

Large Language Models Context Windows and Word Tokenization

Large Language Models (LLMs) are among the most transformative technologies ever applied to language, but to truly understand how they work—and how they’re priced—you need to understand one simple but powerful concept: the token. Tokens are the basic currency of language models. They define what the model can read, remember, and output, and they shape everything from cost to performance.

Let’s unpack what tokens are, how they relate to real words, and how they limit or enable what an LLM can do.

Why Tokens Matter in LLMs

Every time you type a prompt into ChatGPT, Claude, or Gemini, the model doesn’t see your message as words—it sees it as tokens. These tokens are numerical representations of language, the tiny slices of meaning that an LLM uses to understand and generate text.

Tokens sit at the intersection of language and computation. They’re abstract enough to represent flexible linguistic structures, yet concrete enough to be processed by mathematical models. Understanding them isn’t just academic curiosity: it gives you real control over cost, performance, and prompt design.

What Are Tokens in LLMs?

token is a chunk of text—maybe a whole word, a piece of a word, or even just a space or punctuation mark.
For example, the word “apple” might be a single token, while the word “unbelievable” could become two or three: “un”, “believ”, and “able”.

LLMs don’t operate directly on words because language is too varied. Instead, they use tokenizers—algorithms like Byte-Pair Encoding (BPE) or SentencePiece—that break text into statistically frequent chunks. This allows models to learn patterns efficiently across languages and dialects.

So, when you feed a hundred words to a model, it doesn’t count them as a hundred—it counts the number of tokens after tokenization, which might be a few hundred depending on the language.

Tokenization: from Words to Tokens

Tokenization is the process of splitting text into smaller units called tokens, which can represent full words, subwords, characters, or even punctuation. It’s the first step in converting natural language into a form a model can understand—structured sequences of numbers. Tokenization ensures that text of arbitrary length and complexity can be systematically encoded for computation.

Take this sentence: “Cats chace mice.”

  1. Words: ["Cats", "chase", "mice", "."]
  2. Tokens (using subword tokenization): ["Cat", "s", " chase", " mice", "."]
  3. Numeric representation: [4832, 318, 4531, 1129, 13]

Each token corresponds to a number in the model’s internal dictionary (called a vocabulary). These numbers become embeddings—dense vectors of floating‑point values—that the LLM processes mathematically.

The mapping from tokens to numbers (from step two to three above) is usually treated as part of the broader tokenization pipeline, but conceptually it is a distinct step that follows the actual splitting of text into tokens. First, the text is segmented into tokens (like subwords or characters); then each token is looked up in a fixed vocabulary to obtain a corresponding integer ID. In many libraries and APIs this whole sequence is casually called “tokenization,” but it is helpful to distinguish “tokenization” (segmenting the string) from “ID assignment” (mapping those tokens to numbers), because the former is about defining units of text and the latter is about encoding them into a numeric format the model can process.

There is no single universally standardized name, but the step that maps tokens to integers is most often referred to as “token ID assignment” or simply “encoding to token IDs.” In practice, many libraries bundle this together with tokenization and just call the whole process “tokenization,” but conceptually you can think of two stages: tokenization (splitting text into tokens as strings) and token ID assignment (looking up each token in the vocabulary to get its numeric ID)

Tokens as a Pricing Model

Because computation in LLMs happens per token, most providers charge by token usage.

  • Input tokens are what you send to the model (your prompts, context, or instructions).
  • Output tokens are what the model generates in response.

For example, using OpenAI’s GPT‑4‑Turbo might cost a small fraction of a cent per thousand input tokens and slightly more for output tokens. Anthropic, Mistral, and Google follow similar patterns.

This token-based pricing isn’t arbitrary—it reflects the actual computational cost of processing and predicting each token. That’s why understanding how text breaks into tokens can help you budget intelligently and structure prompts more efficiently.

Language Structure: Headwords, Unique Words, and Their Differences

To better grasp what tokens capture (and what they lose), it helps to revisit some linguistic concepts.

Headwords

headword (or lemma) is the base dictionary form of a word. For instance, runrunningruns, and ran all share the headword run. Dictionaries are organized by headwords, not every inflected form.

Unique Words

Unique words in a document refer to different word forms that actually appear in text. If your document includes run and runs, that counts as two unique words.

Total Word Count

Total words refer to the complete count of all words in a text, including repetitions. For example, in the phrase “run and run again”, total words = 4, even though there are only 3 unique words.

The Relationship Between Headwords, Unique Words and Tokens

Headwords compress language forms into core meanings, while unique words capture how those meanings vary by tense, case, or usage. LLMs, however, see neither—only tokens. One token might represent part of a root word, or sometimes merge entire phrases.

Understanding this difference helps explain why “token counts” don’t perfectly track word counts and why that matters for both performance and interpretability.

In any given text, total words represent the overall length of the writing, unique words capture vocabulary diversity, and headwords show the underlying lexical roots. Together they describe how language is used on the surface and beneath it.

Tokens, meanwhile, sit one step below all of these—they are how a language model internally encodes the text. A single headword like run may produce several unique word forms (runsrunningran), and each of these might split into multiple tokens during tokenization. Conversely, frequent short words like and or the might map one‑to‑one with single tokens. This mismatch between linguistic units and computational tokens is why two texts with similar word counts can yield very different token counts—and therefore different costs or context usage in an LLM.

How Tokens Relate to Words and Headwords

Tokens aren’t linguistically tidy. They’re built for efficiency, not for grammar. A single English word can split into multiple tokens if it’s long or uncommon, while a short and frequent phrase might be a single token.

Token-per-word Ratios

Token-per-word ratios vary widely across languages. In English, one word averages about 1.3–1.5 tokens. In Spanish, with its morphological richness, the same text might produce 1.6–1.8 tokens per word.
Languages like Chinese compress meaning densely into characters—each often corresponds to a token, giving a very different ratio.

Implications for Multilingual Models

Because tokens don’t map evenly to words across languages, multilingual models face different space and cost trade-offs. A prompt that fits within 4,000 tokens in English might overflow that limit in Spanish or German due to slightly higher tokenization density.

So when we speak of an LLM’s “context window,” it’s not a matter of how many words it can see, but how many tokens—and those differ by language and structure.

The Context Window Explained

The context window is effectively the LLM’s working memory, measured in tokens. It sets an upper bound on how much it can process at once.

Input context

All your input—system instructions, history, user queries—must fit inside the model’s input token window. If you exceed it, older parts are truncated or ignored.

Output context

Models also reserve space for output tokens—the generated response. If you push the input too close to the total limit, there’s less room left for output.

Combined budgeting

Most LLMs express their context limit as “total tokens.” For instance, a model might support 128k tokens total, meaning the sum of your input and the model’s output can’t exceed 128k. Tools like tiktoken (for OpenAI models) help estimate this before you send a request.

This design matters because once an LLM’s token memory fills up, it can no longer “recall” earlier text. A longer context window extends coherent reasoning and memory, but at higher computational cost.

The Problem: larger Input than Context Window

When the input text is larger than a model’s input context window, the extra tokens simply cannot be considered: they are either truncated (usually from the oldest parts of the text) or compressed/omitted by whatever application logic wraps the model, which means the model may ignore important earlier sections and produce incoherent or incorrect answers. 

A common workaround is chunking: splitting long documents into smaller segments that fit into the context window, then using techniques like retrieval‑augmented generation (RAG), semantic search, or sliding windows to select only the most relevant chunks to send with each query.

Another strategy is summarization or compression, where earlier parts of a conversation or document are periodically summarized into shorter representations so they continue to “fit” in the limited context. All of these workarounds have computational costs: more chunking and retrieval steps mean extra embedding, indexing, and query operations, and larger effective context windows increase memory usage, latency, and attention computation cost, which typically grows at least quadratically with the number of tokens processed.

Comparing Frontier Models Context Windows

Here’s a snapshot of modern LLMs (approximate data as of early 2026) context windows:

ModelTotal Context Window (Tokens)Open-Source Availability
GPT-5.2 (OpenAI)400kClosed
Claude Sonnet 4.5 (Anthropic)200kClosed
Gemini 3 Pro (Google DeepMind)1MClosed
LLaMA 4 Scout (Meta)10MOpen model Weights
DeepSeek-v3128kOpen

Longer windows allow you to query entire books, datasets, or even an entire codebase, but they also increase context noise and latency. There’s a delicate balance between window size, cost, and model attentiveness.

Conclusion: Understanding Tokens Is Understanding LLMs

Tokens are to language models what atoms are to matter—they form the indivisible units from which everything else emerges.

By seeing your text not just as words but as tokens, you start thinking like the model itself. That perspective reveals why multilingual cost differs, why context cuts off when it does, and why even small prompt changes can shift total token usage and output quality.

In short: mastering tokens isn’t just about optimizing cost—it’s about understanding the very architecture of language inside these models.

Let’s keep in touch! Join me on the Javier Tiniaco Leyba newsletter 📩

Leave a Reply

Discover more from Tiniaco Leyba

Subscribe now to keep reading and get access to the full archive.

Continue reading