The Architecture and Evolution of Large Language Models โ From Transformer to ChatGPT, an Interactive Deep Dive
An interactive, in-depth exploration of LLM evolution from Word2Vec to GPT-4, covering Transformer architecture, self-attention, text generation, and RLHF.
Introduction
Since the advent of ChatGPT, Large Language Models (LLMs) have transformed society. Writing code, summarizing papers, answering complex questions โ all of this emerges from a remarkably simple principle: predicting the next token.
This article explains the complete picture of LLMs through six pillars:
- History and genealogy โ Evolution from Word2Vec to GPT-4
- Tokenization โ How text is converted to numbers
- Transformer architecture โ The engine of LLMs
- Self-attention โ The core algorithm of Transformers
- Text generation โ Temperature, top-k, and sampling strategies
- Training pipeline โ From pre-training to RLHF
The LLM Genealogy โ A Decade of Evolution
Let's trace the evolution of language models.
The Word2Vec Era (2013)
Mikolov et al. (2013) proposed Word2Vec, which converts words into fixed-dimensional vectors (embeddings). The famous example king - man + woman โ queen demonstrated semantic relationships. However, Word2Vec produced static embeddings that didn't consider context.
The Transformer Revolution (2017)
Vaswani et al. (2017) changed everything with "Attention Is All You Need." By replacing RNN's sequential processing with self-attention, they enabled parallel computation. This innovation became the foundation for all modern LLMs.
The Pre-training Era (2018โ)
Three architecture families evolved from the Transformer base:
Encoder-only (BERT family)
BERT (2018) is pre-trained with Masked Language Modeling (MLM), predicting masked tokens. By capturing bidirectional context, it excels at understanding tasks like classification, NER, and similarity computation.
Encoder-Decoder (T5 family)
T5 (2019) unifies all tasks as "text-to-text." Translation, summarization, and QA all share the same format.
Decoder-only (GPT family)
The GPT series uses autoregressive language models, generating one token at a time from left to right. As parameter counts increased, capabilities improved dramatically โ GPT-3 (175B parameters) demonstrated few-shot learning. Today's ChatGPT, Claude, and Gemini all descend from this lineage.
Scaling Laws
Kaplan et al. (2020) discovered that language model loss follows a power law with respect to parameters , data , and compute :
Furthermore, Hoffmann et al. (2022) (Chinchilla paper) showed that parameters and tokens should be scaled equally. For example, a 70B parameter model needs approximately 1.4 trillion training tokens.
Tokenization โ Converting Text to Numbers
LLMs don't understand text directly โ they process tokens, numerical sequences representing text.
BPE (Byte Pair Encoding)
The most widely used tokenization method in modern LLMs is BPE (Sennrich et al., 2016).
Algorithm:
- Split text into character (or byte) units
- Find the most frequently adjacent pair in the corpus
- Merge that pair into a new token
- Repeat until vocabulary reaches target size
# Simplified BPE pseudocode
vocab = set(all_characters)
for i in range(num_merges):
# Find the most frequent pair
pair = most_frequent_pair(corpus)
# Merge into new token
new_token = pair[0] + pair[1]
vocab.add(new_token)
corpus = merge_in_corpus(corpus, pair, new_token)GPT-4 has a vocabulary of approximately 100,000 tokens. For languages like Japanese with complex morphology, a single character often maps to more than one token.
Why Subwords?
- No OOV problem: Unknown words can be represented as combinations of subwords
- Vocabulary control: A middle ground between character-level (small vocab, long sequences) and word-level (large vocab, OOV issues)
- Multilingual support: The same BPE tokenizer can handle multiple languages
Transformer Architecture
Let's examine the structure of the Transformer (decoder variant), the engine of LLMs.
A Transformer block consists of 5 core components. Let's explore each in detail.
1. Input Embedding + Positional Encoding
Converts token IDs into high-dimensional vectors (e.g., 4096 dimensions). Unlike RNNs, Transformers process inputs in parallel, so positional information must be explicitly injected via positional encoding.
Sinusoidal Positional Encoding (Original)
Vaswani et al. (2017)'s Transformer used sinusoidal positional encoding:
The heatmap below shows sinusoidal PE patterns. Lower dimensions (left) form high-frequency waves, while higher dimensions (right) form low-frequency waves.
Why sinusoids? For any fixed offset , can be expressed as a linear function of , making it easy for the model to learn relative positions.
RoPE (Rotary Position Embedding)
Modern LLMs (LLaMA, Gemma, Mistral, etc.) use RoPE. RoPE pairs vector dimensions and applies 2D rotations at position-dependent angles.
RoPE's key property: the dot product depends only on the relative position . This naturally captures relative positional relationships without explicitly encoding absolute positions.
RoPE also enables position extrapolation (handling sequences longer than training), forming the basis for context length extensions like NTK-aware scaling and YaRN.
2. Multi-Head Self-Attention
The heart of the Transformer. We'll cover the mechanics in detail in the next section, but here let's understand its role in the overall architecture.
A single attention head can only capture one type of relationship pattern. Multiple heads learn different patterns in parallel โ subject-verb agreement, adjective-noun modification, coreference, etc.
where . GPT-3 uses 96 heads; LLaMA 2 70B uses 64 heads.
MQA / GQA โ Optimizing Inference Efficiency
Standard multi-head attention (MHA) gives each head independent K, V projections. To address the massive KV cache during inference:
| Method | K,V Heads | Model Examples | Characteristics |
|---|---|---|---|
| MHA | = Q heads | GPT-3 | Best quality, high memory |
| MQA | 1 | PaLM, StarCoder | All K,V shared, fastest inference |
| GQA | Q / g | LLaMA 2, Mistral | Balance of quality and efficiency |
GQA (Grouped-Query Attention) groups Q heads and shares K,V within each group. LLaMA 2 70B adopted GQA, achieving near-MHA quality with significantly faster inference.
3. Residual Connections & Layer Normalization
Each sublayer (self-attention and FFN) output is added to its input (residual connection), followed by normalization.
Why Residual Connections Are Essential
Without residual connections, backpropagation gradients exponentially decay (or explode) passing through layers. Residual connections provide a gradient shortcut path, allowing gradients to reach deep layers directly.
This enables training networks with 100+ layers (GPT-3 has 96 layers, LLaMA 2 70B has 80 layers).
Post-LN vs Pre-LN
| Post-LN (Original) | Pre-LN (Modern) | |
|---|---|---|
| Norm placement | After sublayer | Before sublayer |
| Norm method | LayerNorm | LayerNorm (GPT-3) โ RMSNorm (LLaMA+) |
| Training stability | Requires warmup | Stable; converges easily even at scale |
| Adopted by | Original Transformer, BERT | GPT-3, LLaMA, Mistral |
RMSNorm (Root Mean Square Layer Normalization) is a simplified LayerNorm that skips mean subtraction:
It's computationally cheaper while matching LayerNorm's performance, making it standard in post-LLaMA models.
4. Feed-Forward Network (FFN)
A fully-connected network applied independently to each token. While attention captures "relationships between tokens," FFN performs "feature transformation of each token." FFN parameters comprise roughly 2/3 of a Transformer's total parameters โ a massive "knowledge store."
Standard FFN
The original Transformer (Vaswani et al., 2017) used ReLU; GPT-2 onward switched to GELU:
The intermediate dimension is typically (e.g., โ ).
SwiGLU โ The Modern Standard
Shazeer (2020) proposed SwiGLU, combining Gated Linear Units with Swish activation. Adopted by LLaMA, PaLM, Gemma, and others.
Why SwiGLU is superior:
- Gating mechanism: acts as a gate, controlling selective information flow
- Swish activation: is smooth unlike ReLU and passes small negative values, improving gradient flow
- SwiGLU uses 3 matrices (), so to maintain the same parameter count (different from the standard )
5. Output Layer
Apply RMSNorm to the final transformer layer's output, then project to vocabulary size via a linear transformation (unembedding). Convert to a probability distribution via softmax to predict the next token.
Many LLMs use weight tying: , sharing the embedding matrix with the output weight . This reduces parameters by and aligns the input/output semantic spaces.
Causal Mask in Decoders
In decoder-only models like GPT, a causal mask blocks attention to future tokens. Token at position can only attend to positions .
Implementation-wise, the upper triangular portion of the score matrix is set to , which becomes zero after softmax. This ensures consistency with autoregressive generation (left-to-right, one token at a time).
Full Transformer Block Data Flow
Now let's trace how all these components work together by stepping through the data flow of a single Transformer block.
Transformer Block Forward Pass
Trace data flow through a single Transformer block on "I love cats".
| I | 0.82 | โ0.15 | 0.44 | 0.91 |
| love | 0.33 | 0.78 | โ0.22 | 0.56 |
| cats | 0.61 | 0.42 | 0.73 | โ0.08 |
Self-Attention โ The Core of Transformers
Self-attention dynamically computes relevance scores between all tokens, producing context-aware representations.
Formula
- (Query): What each token is "looking for"
- (Key): What each token "contains" (how it wants to be found)
- (Value): The actual information each token passes along
- : Scaling factor to prevent dot products from growing too large, which would saturate softmax
Interactive Demo
Self-Attention Mechanism
Step through the self-attention computation on "The cat sat".
| The | 1.0 | 0.2 | 0.5 | 0.8 |
| cat | 0.3 | 0.9 | 0.1 | 0.6 |
| sat | 0.7 | 0.4 | 0.8 | 0.2 |
Why Self-Attention Is Revolutionary
- Long-range dependencies: Unlike RNNs where distant token information decays, self-attention directly accesses all tokens
- Parallel computation: RNNs process sequentially; self-attention uses matrix operations that GPUs parallelize efficiently
- Interpretability: Visualizing attention weights reveals what the model "focuses on"
Computational Complexity
Self-attention has complexity with respect to sequence length , which becomes problematic for long contexts. Active research addresses this:
| Technique | Complexity | Summary |
|---|---|---|
| Standard attention | Computes all token pairs | |
| Flash Attention | (memory ) | Optimizes IO efficiency, reduces HBM access |
| Multi-Query Attention | Shares K,V heads to speed up inference | |
| Grouped-Query Attention | Between MQA and MHA; used in LLaMA 2+ | |
| Ring Attention | Distributes context across devices |
Text Generation โ Next Token Prediction
LLM inference is simple: given a context (input token sequence), output a probability distribution over the next token, sample from it, append it, and repeat โ this is autoregressive generation.
Generation Strategies
Greedy decoding: Always pick the highest-probability token. Deterministic but tends toward monotonous output.
Temperature scaling: Divide logits by temperature before softmax:
- : Distribution sharpens โ more deterministic (good for factual tasks)
- : Distribution flattens โ more diverse/creative (good for creative writing)
- : Converges to greedy decoding
Top-k sampling: Keep only the top tokens by probability, zero out the rest, and renormalize.
Top-p (Nucleus) sampling: Keep tokens until cumulative probability exceeds (e.g., 0.95). The candidate count varies dynamically, making it more flexible than top-k.
Interactive Demo
Text Generation Demo
Watch how LLMs generate text one token at a time using temperature and top-k sampling.
Training Pipeline โ From Raw Model to Conversational AI
Stage 1: Pre-training
Goal: Acquire general language knowledge
Train on next-token prediction using a massive text corpus (web text, books, code) with trillions of tokens.
At this stage, the model can generate text continuations but cannot "answer questions" or "follow instructions."
Stage 2: Supervised Fine-Tuning (SFT)
Goal: Teach the model to follow instructions
Fine-tune on a dataset of high-quality (instruction, response) pairs (~100K examples). This teaches the model the format of "responding to user queries."
[Instruction] What is the capital of Japan?
[Response] The capital of Japan is Tokyo.Stage 3: RLHF (Reinforcement Learning from Human Feedback)
Goal: Align with human preferences
SFT alone can still produce factually incorrect or harmful outputs. RLHF improves this:
- Train a reward model: Human annotators rank multiple responses โ train a reward model
- PPO (Proximal Policy Optimization): Update the policy (LLM) to maximize the reward model's score, with a KL penalty to prevent excessive divergence from the original model
DPO (Direct Preference Optimization) simplifies RLHF by optimizing the policy directly from preference data without explicitly training a reward model. LLaMA 2 itself used PPO-based RLHF, but DPO has since been widely adopted in open models such as Zephyr and LLaMA 3.
Modern LLM Techniques
KV Cache
In autoregressive generation, recomputing all previous tokens' K and V at each step is wasteful. KV caching stores and reuses K, V from previous steps. This dramatically speeds up inference but increases memory consumption linearly with sequence length.
Quantization
Convert LLM parameters from 32/16-bit floating point to 4/8-bit integers to reduce memory and compute costs. Methods like GPTQ and AWQ can run 70B models on a single GPU with minimal accuracy loss.
MoE (Mixture of Experts)
Exemplified by Mixtral. Split FFN layers into multiple "experts," with a router selecting a few (e.g., 2 of 8) per token. Total parameters are large, but active parameters during inference are small โ enabling efficient scaling.
Context Length Extension
Early Transformers handled ~512 tokens; models today process over 1 million tokens. Techniques include RoPE frequency scaling, YaRN, and Ring Attention.
Summary
We've explored LLM internals through six pillars:
- History: Word2Vec โ Transformer โ BERT/GPT โ scaling laws enabling massive models
- Tokenization: BPE subword splitting converts arbitrary text to numerical sequences
- Transformer: N layers of self-attention + FFN + residual connections
- Self-attention: dynamically computes relevance across all tokens
- Text generation: Temperature, top-k, top-p sampling for stochastic autoregressive generation
- Training pipeline: Pre-training โ SFT โ RLHF/DPO transforms a "next-token predictor" into a "conversational AI"
The seemingly simple principle of "predict the next token" โ when scaled sufficiently โ produces the remarkable capabilities that power modern AI.
References
- A. Vaswani et al. Attention Is All You Need. NeurIPS 2017.
- J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers. NAACL 2019.
- T. Brown et al. Language Models are Few-Shot Learners. NeurIPS 2020 (GPT-3).
- J. Kaplan et al. Scaling Laws for Neural Language Models. 2020.
- J. Hoffmann et al. Training Compute-Optimal Large Language Models. NeurIPS 2022 (Chinchilla).
- L. Ouyang et al. Training language models to follow instructions with human feedback. NeurIPS 2022 (InstructGPT / RLHF).
- R. Rafailov et al. Direct Preference Optimization. NeurIPS 2023.