โ†All posts

The Architecture and Evolution of Large Language Models โ€” From Transformer to ChatGPT, an Interactive Deep Dive

An interactive, in-depth exploration of LLM evolution from Word2Vec to GPT-4, covering Transformer architecture, self-attention, text generation, and RLHF.

LLMTransformerNLPDeep LearningInteractive

Introduction

Since the advent of ChatGPT, Large Language Models (LLMs) have transformed society. Writing code, summarizing papers, answering complex questions โ€” all of this emerges from a remarkably simple principle: predicting the next token.

This article explains the complete picture of LLMs through six pillars:

  1. History and genealogy โ€” Evolution from Word2Vec to GPT-4
  2. Tokenization โ€” How text is converted to numbers
  3. Transformer architecture โ€” The engine of LLMs
  4. Self-attention โ€” The core algorithm of Transformers
  5. Text generation โ€” Temperature, top-k, and sampling strategies
  6. Training pipeline โ€” From pre-training to RLHF

The LLM Genealogy โ€” A Decade of Evolution

2013Word2VecWord embeddings2017TransformerAttention Is All You Need2018BERT / GPT-1Pre-training era begins2020GPT-3175B params, few-shot2022ChatGPTRLHF + Instruction tuning2024โ€“GPT-4o / Gemini / ClaudeMultimodal, reasoning

Let's trace the evolution of language models.

The Word2Vec Era (2013)

Mikolov et al. (2013) proposed Word2Vec, which converts words into fixed-dimensional vectors (embeddings). The famous example king - man + woman โ‰ˆ queen demonstrated semantic relationships. However, Word2Vec produced static embeddings that didn't consider context.

The Transformer Revolution (2017)

Vaswani et al. (2017) changed everything with "Attention Is All You Need." By replacing RNN's sequential processing with self-attention, they enabled parallel computation. This innovation became the foundation for all modern LLMs.

The Pre-training Era (2018โ€“)

Three architecture families evolved from the Transformer base:

Transformer2017 โ€” Vaswani et al.Encoder-onlyBERT (2018)Masked LMRoBERTa (2019)Robust BERTDeBERTa (2020)Disentangled attnEncoder-DecoderT5 (2019)Text-to-TextBART (2019)Denoising AEFlan-T5 (2022)Instruction tunedDecoder-onlyGPT-2/3 (2019โ€“20)Autoregressive LMLLaMA (2023)Open-weightGPT-4 / Claude (2023โ€“)Frontier modelsClassification, NER, embeddingsTranslation, summarizationChat, code generation, reasoning

Encoder-only (BERT family)

BERT (2018) is pre-trained with Masked Language Modeling (MLM), predicting masked tokens. By capturing bidirectional context, it excels at understanding tasks like classification, NER, and similarity computation.

Encoder-Decoder (T5 family)

T5 (2019) unifies all tasks as "text-to-text." Translation, summarization, and QA all share the same format.

Decoder-only (GPT family)

The GPT series uses autoregressive language models, generating one token at a time from left to right. As parameter counts increased, capabilities improved dramatically โ€” GPT-3 (175B parameters) demonstrated few-shot learning. Today's ChatGPT, Claude, and Gemini all descend from this lineage.

Scaling Laws

log(Compute / Parameters / Data)log(Loss)L(x) โˆ x^(โˆ’ฮฑ)Parameters (N)Data (D)Compute (C)Kaplan et al. 2020, Hoffmann et al. 2022

Kaplan et al. (2020) discovered that language model loss follows a power law with respect to parameters NN, data DD, and compute CC:

L(x)โˆxโˆ’ฮฑL(x) \propto x^{-\alpha}

Furthermore, Hoffmann et al. (2022) (Chinchilla paper) showed that parameters and tokens should be scaled equally. For example, a 70B parameter model needs approximately 1.4 trillion training tokens.

Tokenization โ€” Converting Text to Numbers

LLMs don't understand text directly โ€” they process tokens, numerical sequences representing text.

Input text"The cat sat on the mat"โ†“ BPE TokenizerTokens (subwords)TheID: 464ฤ catID: 3797ฤ satID: 3520ฤ onID: 319ฤ theID: 262ฤ matID: 2603

BPE (Byte Pair Encoding)

The most widely used tokenization method in modern LLMs is BPE (Sennrich et al., 2016).

Algorithm:

  1. Split text into character (or byte) units
  2. Find the most frequently adjacent pair in the corpus
  3. Merge that pair into a new token
  4. Repeat until vocabulary reaches target size
# Simplified BPE pseudocode
vocab = set(all_characters)
for i in range(num_merges):
    # Find the most frequent pair
    pair = most_frequent_pair(corpus)
    # Merge into new token
    new_token = pair[0] + pair[1]
    vocab.add(new_token)
    corpus = merge_in_corpus(corpus, pair, new_token)

GPT-4 has a vocabulary of approximately 100,000 tokens. For languages like Japanese with complex morphology, a single character often maps to more than one token.

Why Subwords?

  • No OOV problem: Unknown words can be represented as combinations of subwords
  • Vocabulary control: A middle ground between character-level (small vocab, long sequences) and word-level (large vocab, OOV issues)
  • Multilingual support: The same BPE tokenizer can handle multiple languages

Transformer Architecture

Let's examine the structure of the Transformer (decoder variant), the engine of LLMs.

ร— NInput Embedding + Positional EncodingMulti-Head Self-AttentionAdd & Layer NormFeed-Forward NetworkAdd & Layer NormLinear + Softmax โ†’ ProbabilitiesToken IDsNext token prediction

A Transformer block consists of 5 core components. Let's explore each in detail.

1. Input Embedding + Positional Encoding

Converts token IDs into high-dimensional vectors (e.g., 4096 dimensions). Unlike RNNs, Transformers process inputs in parallel, so positional information must be explicitly injected via positional encoding.

Sinusoidal Positional Encoding (Original)

Vaswani et al. (2017)'s Transformer used sinusoidal positional encoding:

PE(pos,2i)=sinโกโ€‰โฃ(pos100002i/d),PE(pos,2i+1)=cosโกโ€‰โฃ(pos100002i/d)\text{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \text{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

The heatmap below shows sinusoidal PE patterns. Lower dimensions (left) form high-frequency waves, while higher dimensions (right) form low-frequency waves.

Sinusoidal Positional EncodingPE(pos, 2i) = sin(pos / 10000^(2i/d)), PE(pos, 2i+1) = cos(pos / 10000^(2i/d))PositionDimension012345670123456789101112131415Blue = positive Red = negative (sin for even dims, cos for odd dims)High frequencyLow frequency

Why sinusoids? For any fixed offset kk, PE(pos+k)\text{PE}(pos + k) can be expressed as a linear function of PE(pos)\text{PE}(pos), making it easy for the model to learn relative positions.

RoPE (Rotary Position Embedding)

Modern LLMs (LLaMA, Gemma, Mistral, etc.) use RoPE. RoPE pairs vector dimensions and applies 2D rotations at position-dependent angles.

RoPE (Rotary Position Embedding)Pairs of dimensions are rotated by position-dependent anglespos = 0ฮธ = 0ยฐdโ‚€dโ‚pos = 1ฮธ = 30ยฐdโ‚€dโ‚pos = 2ฮธ = 60ยฐdโ‚€dโ‚q_m ยท k_n depends only on relative position (m โˆ’ n)โ†’ Naturally captures relative positional relationships

RoPE's key property: the dot product qmโ‹…knq_m \cdot k_n depends only on the relative position (mโˆ’n)(m - n). This naturally captures relative positional relationships without explicitly encoding absolute positions.

RoPE(x,pos)=(x0cosโก(ฮธโ‹…pos)โˆ’x1sinโก(ฮธโ‹…pos)x0sinโก(ฮธโ‹…pos)+x1cosโก(ฮธโ‹…pos)โ‹ฎ)\text{RoPE}(x, pos) = \begin{pmatrix} x_0 \cos(\theta \cdot pos) - x_1 \sin(\theta \cdot pos) \\ x_0 \sin(\theta \cdot pos) + x_1 \cos(\theta \cdot pos) \\ \vdots \end{pmatrix}

RoPE also enables position extrapolation (handling sequences longer than training), forming the basis for context length extensions like NTK-aware scaling and YaRN.

2. Multi-Head Self-Attention

The heart of the Transformer. We'll cover the mechanics in detail in the next section, but here let's understand its role in the overall architecture.

A single attention head can only capture one type of relationship pattern. Multiple heads learn different patterns in parallel โ€” subject-verb agreement, adjective-noun modification, coreference, etc.

Input X (d_model = 512)Head 1d_k = 128QยทKแต€/โˆšd_k โ†’ Vsubjโ€“verbHead 2d_k = 128QยทKแต€/โˆšd_k โ†’ Vadjโ€“nounHead 3d_k = 128QยทKแต€/โˆšd_k โ†’ VcorefHead 4d_k = 128QยทKแต€/โˆšd_k โ†’ VsyntaxConcat (d_k ร— h = 512)Linear W_O โ†’ OutputMulti-Head Output (d_model = 512)MultiHead(Q,K,V) = Concat(headโ‚, โ€ฆ, headโ‚•) ยท W_O
MultiHead(Q,K,V)=Concat(head1,โ€ฆ,headh)โ‹…WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \cdot W_O

where headi=Attention(QWQ(i),KWK(i),VWV(i))\text{head}_i = \text{Attention}(Q W_Q^{(i)}, K W_K^{(i)}, V W_V^{(i)}). GPT-3 uses 96 heads; LLaMA 2 70B uses 64 heads.

MQA / GQA โ€” Optimizing Inference Efficiency

Standard multi-head attention (MHA) gives each head independent K, V projections. To address the massive KV cache during inference:

MethodK,V HeadsModel ExamplesCharacteristics
MHA= Q headsGPT-3Best quality, high memory
MQA1PaLM, StarCoderAll K,V shared, fastest inference
GQAQ / gLLaMA 2, MistralBalance of quality and efficiency

GQA (Grouped-Query Attention) groups Q heads and shares K,V within each group. LLaMA 2 70B adopted GQA, achieving near-MHA quality with significantly faster inference.

3. Residual Connections & Layer Normalization

Each sublayer (self-attention and FFN) output is added to its input (residual connection), followed by normalization.

Why Residual Connections Are Essential

Without residual connections, backpropagation gradients exponentially decay (or explode) passing through NN layers. Residual connections provide a gradient shortcut path, allowing gradients to reach deep layers directly.

xl+1=xl+F(xl)x_{l+1} = x_l + F(x_l)

This enables training networks with 100+ layers (GPT-3 has 96 layers, LLaMA 2 70B has 80 layers).

Post-LN vs Pre-LN

Post-LN (Original)xSelf-AttentionAdd (residual)LayerNormFFNAdd (residual)LayerNormPre-LN (Modern)xRMSNormSelf-AttentionAdd (residual)RMSNormFFNAdd (residual)
Post-LN (Original)Pre-LN (Modern)
Norm placementAfter sublayerBefore sublayer
Norm methodLayerNormLayerNorm (GPT-3) โ†’ RMSNorm (LLaMA+)
Training stabilityRequires warmupStable; converges easily even at scale
Adopted byOriginal Transformer, BERTGPT-3, LLaMA, Mistral

RMSNorm (Root Mean Square Layer Normalization) is a simplified LayerNorm that skips mean subtraction:

RMSNorm(x)=xRMS(x)โ‹…ฮณ,RMS(x)=1dโˆ‘i=1dxi2\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma, \quad \text{RMS}(x) = \sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2}

It's computationally cheaper while matching LayerNorm's performance, making it standard in post-LLaMA models.

4. Feed-Forward Network (FFN)

A fully-connected network applied independently to each token. While attention captures "relationships between tokens," FFN performs "feature transformation of each token." FFN parameters comprise roughly 2/3 of a Transformer's total parameters โ€” a massive "knowledge store."

Standard FFN

The original Transformer (Vaswani et al., 2017) used ReLU; GPT-2 onward switched to GELU:

FFN(x)=GELU(xW1+b1)W2+b2\text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2

The intermediate dimension dffd_{\text{ff}} is typically 4ร—dmodel4 \times d_{\text{model}} (e.g., d=4096d = 4096 โ†’ dff=16384d_{\text{ff}} = 16384).

SwiGLU โ€” The Modern Standard

Shazeer (2020) proposed SwiGLU, combining Gated Linear Units with Swish activation. Adopted by LLaMA, PaLM, Gemma, and others.

SwiGLU Feed-Forward Networkx (d_model)x ยท Wโ‚ (d_ff)Swish(x) = x ยท ฯƒ(x)x ยท Wโ‚ƒ (d_ff)(gate)โŠ™element-wiseยท Wโ‚‚ โ†’ output (d_model)SwiGLU(x) = (Swish(x ยท Wโ‚) โŠ™ (x ยท Wโ‚ƒ)) ยท Wโ‚‚
SwiGLU(x)=(Swish(xW1)โŠ™(xW3))W2\text{SwiGLU}(x) = \bigl(\text{Swish}(x W_1) \odot (x W_3)\bigr) W_2

Why SwiGLU is superior:

  • Gating mechanism: xW3x W_3 acts as a gate, controlling selective information flow
  • Swish activation: Swish(x)=xโ‹…ฯƒ(x)\text{Swish}(x) = x \cdot \sigma(x) is smooth unlike ReLU and passes small negative values, improving gradient flow
  • SwiGLU uses 3 matrices (W1,W2,W3W_1, W_2, W_3), so dff=83dmodeld_{\text{ff}} = \frac{8}{3} d_{\text{model}} to maintain the same parameter count (different from the standard 4ร—4\times)

5. Output Layer

Apply RMSNorm to the final transformer layer's output, then project to vocabulary size โˆฃVโˆฃ|V| via a linear transformation (unembedding). Convert to a probability distribution via softmax to predict the next token.

Many LLMs use weight tying: WU=WETW_U = W_E^T, sharing the embedding matrix WEW_E with the output weight WUW_U. This reduces parameters by โˆฃVโˆฃร—d|V| \times d and aligns the input/output semantic spaces.

Causal Mask in Decoders

In decoder-only models like GPT, a causal mask blocks attention to future tokens. Token at position ii can only attend to positions 1,2,โ€ฆ,i1, 2, \ldots, i.

Causal Mask (Attention Matrix)Green = attend, Red = masked (-โˆž)Key โ†’ThecatsatontheQuery โ†’Thecatsatontheโœ“โˆ’โˆžโˆ’โˆžโˆ’โˆžโˆ’โˆžโœ“โœ“โˆ’โˆžโˆ’โˆžโˆ’โˆžโœ“โœ“โœ“โˆ’โˆžโˆ’โˆžโœ“โœ“โœ“โœ“โˆ’โˆžโœ“โœ“โœ“โœ“โœ“Token "sat" (row 3) can see "The", "cat", "sat" but NOT "on", "the"

Implementation-wise, the upper triangular portion of the score matrix is set to โˆ’โˆž-\infty, which becomes zero after softmax. This ensures consistency with autoregressive generation (left-to-right, one token at a time).

Full Transformer Block Data Flow

Now let's trace how all these components work together by stepping through the data flow of a single Transformer block.

Transformer Block Forward Pass

Trace data flow through a single Transformer block on "I love cats".

๐Ÿ“ฅEmbed+PE
โ†’
๐Ÿ“RMSNorm
โ†’
๐Ÿ”€Q,K,V
โ†’
๐ŸŽฏAttn W
โ†’
โœจAttn Out
โ†’
โž•Res+
โ†’
๐Ÿ“RMSNorm
โ†’
โšกFFN
โ†’
โž•Res+
โ†’
๐Ÿ“คOutput
Step 0: Input Embedding + Positional Encoding
Tokens:Ilovecats
X = Embed + PE
I0.82โˆ’0.150.440.91
love0.330.78โˆ’0.220.56
cats0.610.420.73โˆ’0.08
1 / 10

Self-Attention โ€” The Core of Transformers

Self-attention dynamically computes relevance scores between all tokens, producing context-aware representations.

Formula

Attention(Q,K,V)=softmaxโ€‰โฃ(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V
  • QQ (Query): What each token is "looking for"
  • KK (Key): What each token "contains" (how it wants to be found)
  • VV (Value): The actual information each token passes along
  • dk\sqrt{d_k}: Scaling factor to prevent dot products from growing too large, which would saturate softmax

Interactive Demo

Self-Attention Mechanism

Step through the self-attention computation on "The cat sat".

Tokens:Thecatsat
X (Input)
The1.00.20.50.8
cat0.30.90.10.6
sat0.70.40.80.2
1 / 8

Why Self-Attention Is Revolutionary

  1. Long-range dependencies: Unlike RNNs where distant token information decays, self-attention directly accesses all tokens
  2. Parallel computation: RNNs process sequentially; self-attention uses matrix operations that GPUs parallelize efficiently
  3. Interpretability: Visualizing attention weights reveals what the model "focuses on"

Computational Complexity

Self-attention has O(n2)O(n^2) complexity with respect to sequence length nn, which becomes problematic for long contexts. Active research addresses this:

TechniqueComplexitySummary
Standard attentionO(n2)O(n^2)Computes all token pairs
Flash AttentionO(n2)O(n^2) (memory O(n)O(n))Optimizes IO efficiency, reduces HBM access
Multi-Query AttentionO(n2)O(n^2)Shares K,V heads to speed up inference
Grouped-Query AttentionO(n2)O(n^2)Between MQA and MHA; used in LLaMA 2+
Ring AttentionO(n2/p)O(n^2 / p)Distributes context across devices

Text Generation โ€” Next Token Prediction

LLM inference is simple: given a context (input token sequence), output a probability distribution over the next token, sample from it, append it, and repeat โ€” this is autoregressive generation.

Generation Strategies

Greedy decoding: Always pick the highest-probability token. Deterministic but tends toward monotonous output.

Temperature scaling: Divide logits by temperature TT before softmax:

pi=expโก(zi/T)โˆ‘jexpโก(zj/T)p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
  • T<1T < 1: Distribution sharpens โ†’ more deterministic (good for factual tasks)
  • T>1T > 1: Distribution flattens โ†’ more diverse/creative (good for creative writing)
  • Tโ†’0T \to 0: Converges to greedy decoding

Top-k sampling: Keep only the top kk tokens by probability, zero out the rest, and renormalize.

Top-p (Nucleus) sampling: Keep tokens until cumulative probability exceeds pp (e.g., 0.95). The candidate count varies dynamically, making it more flexible than top-k.

Interactive Demo

Text Generation Demo

Watch how LLMs generate text one token at a time using temperature and top-k sampling.

Context:Thecatโ–Œ
sat
logit: 3.20%
is
logit: 2.80%
ran
logit: 1.50%
the
logit: 0.90%
very
logit: 0.40%
1 / 8

Training Pipeline โ€” From Raw Model to Conversational AI

Pre-trainingNext-token prediction on massive corpusTrillions of tokensSFTSupervised Fine-Tuning on instructions~100K examplesRLHF / DPOAlign with human preferencesReward model + PPO or DPODeploymentAPI / chat interfaceQuantization, serving

Stage 1: Pre-training

Goal: Acquire general language knowledge

Train on next-token prediction using a massive text corpus (web text, books, code) with trillions of tokens.

Lpretrain=โˆ’โˆ‘t=1TlogโกP(xtโˆฃx<t;ฮธ)\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P(x_t | x_{<t}; \theta)

At this stage, the model can generate text continuations but cannot "answer questions" or "follow instructions."

Stage 2: Supervised Fine-Tuning (SFT)

Goal: Teach the model to follow instructions

Fine-tune on a dataset of high-quality (instruction, response) pairs (~100K examples). This teaches the model the format of "responding to user queries."

[Instruction] What is the capital of Japan?
[Response] The capital of Japan is Tokyo.

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

Goal: Align with human preferences

SFT alone can still produce factually incorrect or harmful outputs. RLHF improves this:

  1. Train a reward model: Human annotators rank multiple responses โ†’ train a reward model
  2. PPO (Proximal Policy Optimization): Update the policy (LLM) to maximize the reward model's score, with a KL penalty to prevent excessive divergence from the original model
JRLHF=E[R(x,y)โˆ’ฮฒโ€‰KL(ฯ€ฮธโˆฅฯ€ref)]\mathcal{J}_{\text{RLHF}} = \mathbb{E}\bigl[R(x, y) - \beta \, \text{KL}\bigl(\pi_\theta \| \pi_{\text{ref}}\bigr)\bigr]

DPO (Direct Preference Optimization) simplifies RLHF by optimizing the policy directly from preference data without explicitly training a reward model. LLaMA 2 itself used PPO-based RLHF, but DPO has since been widely adopted in open models such as Zephyr and LLaMA 3.

Modern LLM Techniques

KV Cache

In autoregressive generation, recomputing all previous tokens' K and V at each step is wasteful. KV caching stores and reuses K, V from previous steps. This dramatically speeds up inference but increases memory consumption linearly with sequence length.

Quantization

Convert LLM parameters from 32/16-bit floating point to 4/8-bit integers to reduce memory and compute costs. Methods like GPTQ and AWQ can run 70B models on a single GPU with minimal accuracy loss.

MoE (Mixture of Experts)

Exemplified by Mixtral. Split FFN layers into multiple "experts," with a router selecting a few (e.g., 2 of 8) per token. Total parameters are large, but active parameters during inference are small โ€” enabling efficient scaling.

Context Length Extension

Early Transformers handled ~512 tokens; models today process over 1 million tokens. Techniques include RoPE frequency scaling, YaRN, and Ring Attention.

Summary

We've explored LLM internals through six pillars:

  1. History: Word2Vec โ†’ Transformer โ†’ BERT/GPT โ†’ scaling laws enabling massive models
  2. Tokenization: BPE subword splitting converts arbitrary text to numerical sequences
  3. Transformer: N layers of self-attention + FFN + residual connections
  4. Self-attention: softmax(QKT/dk)V\text{softmax}(QK^T/\sqrt{d_k})V dynamically computes relevance across all tokens
  5. Text generation: Temperature, top-k, top-p sampling for stochastic autoregressive generation
  6. Training pipeline: Pre-training โ†’ SFT โ†’ RLHF/DPO transforms a "next-token predictor" into a "conversational AI"

The seemingly simple principle of "predict the next token" โ€” when scaled sufficiently โ€” produces the remarkable capabilities that power modern AI.

References