The Architecture and Evolution of Large Language Models — From Transformer to ChatGPT, an Interactive Deep Dive

Introduction

Since the advent of ChatGPT, Large Language Models (LLMs) have transformed society. Writing code, summarizing papers, answering complex questions — all of this emerges from a remarkably simple principle: predicting the next token.

This article explains the complete picture of LLMs through six pillars:

History and genealogy — Evolution from Word2Vec to GPT-4
Tokenization — How text is converted to numbers
Transformer architecture — The engine of LLMs
Self-attention — The core algorithm of Transformers
Text generation — Temperature, top-k, and sampling strategies
Training pipeline — From pre-training to RLHF

The LLM Genealogy — A Decade of Evolution

Let's trace the evolution of language models.

The Word2Vec Era (2013)

Mikolov et al. (2013) proposed Word2Vec, which converts words into fixed-dimensional vectors (embeddings). The famous example king - man + woman ≈ queen demonstrated semantic relationships. However, Word2Vec produced static embeddings that didn't consider context.

The Transformer Revolution (2017)

Vaswani et al. (2017) changed everything with "Attention Is All You Need." By replacing RNN's sequential processing with self-attention, they enabled parallel computation. This innovation became the foundation for all modern LLMs.

The Pre-training Era (2018–)

Three architecture families evolved from the Transformer base:

Encoder-only (BERT family)

BERT (2018) is pre-trained with Masked Language Modeling (MLM), predicting masked tokens. By capturing bidirectional context, it excels at understanding tasks like classification, NER, and similarity computation.

Encoder-Decoder (T5 family)

T5 (2019) unifies all tasks as "text-to-text." Translation, summarization, and QA all share the same format.

Decoder-only (GPT family)

The GPT series uses autoregressive language models, generating one token at a time from left to right. As parameter counts increased, capabilities improved dramatically — GPT-3 (175B parameters) demonstrated few-shot learning. Today's ChatGPT, Claude, and Gemini all descend from this lineage.

Scaling Laws

Kaplan et al. (2020) discovered that language model loss follows a power law with respect to parameters $N$ , data $D$ , and compute $C$ :

L(x) \propto x^{-\alpha}

Furthermore, Hoffmann et al. (2022) (Chinchilla paper) showed that parameters and tokens should be scaled equally. For example, a 70B parameter model needs approximately 1.4 trillion training tokens.

Tokenization — Converting Text to Numbers

LLMs don't understand text directly — they process tokens, numerical sequences representing text.

BPE (Byte Pair Encoding)

The most widely used tokenization method in modern LLMs is BPE (Sennrich et al., 2016).

Algorithm:

Split text into character (or byte) units
Find the most frequently adjacent pair in the corpus
Merge that pair into a new token
Repeat until vocabulary reaches target size

# Simplified BPE pseudocode
vocab = set(all_characters)
for i in range(num_merges):
    # Find the most frequent pair
    pair = most_frequent_pair(corpus)
    # Merge into new token
    new_token = pair[0] + pair[1]
    vocab.add(new_token)
    corpus = merge_in_corpus(corpus, pair, new_token)

GPT-4 has a vocabulary of approximately 100,000 tokens. For languages like Japanese with complex morphology, a single character often maps to more than one token.

Why Subwords?

No OOV problem: Unknown words can be represented as combinations of subwords
Vocabulary control: A middle ground between character-level (small vocab, long sequences) and word-level (large vocab, OOV issues)
Multilingual support: The same BPE tokenizer can handle multiple languages

Transformer Architecture

Let's examine the structure of the Transformer (decoder variant), the engine of LLMs.

A Transformer block consists of 5 core components. Let's explore each in detail.

1. Input Embedding + Positional Encoding

Converts token IDs into high-dimensional vectors (e.g., 4096 dimensions). Unlike RNNs, Transformers process inputs in parallel, so positional information must be explicitly injected via positional encoding.

Sinusoidal Positional Encoding (Original)

Vaswani et al. (2017)'s Transformer used sinusoidal positional encoding:

\text{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \text{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

The heatmap below shows sinusoidal PE patterns. Lower dimensions (left) form high-frequency waves, while higher dimensions (right) form low-frequency waves.

Why sinusoids? For any fixed offset $k$ , $\text{PE}(pos + k)$ can be expressed as a linear function of $\text{PE}(pos)$ , making it easy for the model to learn relative positions.

RoPE (Rotary Position Embedding)

Modern LLMs (LLaMA, Gemma, Mistral, etc.) use RoPE. RoPE pairs vector dimensions and applies 2D rotations at position-dependent angles.

RoPE's key property: the dot product $q_m \cdot k_n$ depends only on the relative position $(m - n)$ . This naturally captures relative positional relationships without explicitly encoding absolute positions.

\text{RoPE}(x, pos) = \begin{pmatrix} x_0 \cos(\theta \cdot pos) - x_1 \sin(\theta \cdot pos) \\ x_0 \sin(\theta \cdot pos) + x_1 \cos(\theta \cdot pos) \\ \vdots \end{pmatrix}

RoPE also enables position extrapolation (handling sequences longer than training), forming the basis for context length extensions like NTK-aware scaling and YaRN.

2. Multi-Head Self-Attention

The heart of the Transformer. We'll cover the mechanics in detail in the next section, but here let's understand its role in the overall architecture.

A single attention head can only capture one type of relationship pattern. Multiple heads learn different patterns in parallel — subject-verb agreement, adjective-noun modification, coreference, etc.

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \cdot W_O

where $\text{head}_i = \text{Attention}(Q W_Q^{(i)}, K W_K^{(i)}, V W_V^{(i)})$ . GPT-3 uses 96 heads; LLaMA 2 70B uses 64 heads.

MQA / GQA — Optimizing Inference Efficiency

Standard multi-head attention (MHA) gives each head independent K, V projections. To address the massive KV cache during inference:

Method	K,V Heads	Model Examples	Characteristics
MHA	= Q heads	GPT-3	Best quality, high memory
MQA	1	PaLM, StarCoder	All K,V shared, fastest inference
GQA	Q / g	LLaMA 2, Mistral	Balance of quality and efficiency

GQA (Grouped-Query Attention) groups Q heads and shares K,V within each group. LLaMA 2 70B adopted GQA, achieving near-MHA quality with significantly faster inference.

3. Residual Connections & Layer Normalization

Each sublayer (self-attention and FFN) output is added to its input (residual connection), followed by normalization.

Why Residual Connections Are Essential

Without residual connections, backpropagation gradients exponentially decay (or explode) passing through $N$ layers. Residual connections provide a gradient shortcut path, allowing gradients to reach deep layers directly.

x_{l+1} = x_l + F(x_l)

This enables training networks with 100+ layers (GPT-3 has 96 layers, LLaMA 2 70B has 80 layers).

Post-LN vs Pre-LN

	Post-LN (Original)	Pre-LN (Modern)
Norm placement	After sublayer	Before sublayer
Norm method	LayerNorm	LayerNorm (GPT-3) → RMSNorm (LLaMA+)
Training stability	Requires warmup	Stable; converges easily even at scale
Adopted by	Original Transformer, BERT	GPT-3, LLaMA, Mistral

RMSNorm (Root Mean Square Layer Normalization) is a simplified LayerNorm that skips mean subtraction:

\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma, \quad \text{RMS}(x) = \sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2}

It's computationally cheaper while matching LayerNorm's performance, making it standard in post-LLaMA models.

4. Feed-Forward Network (FFN)

A fully-connected network applied independently to each token. While attention captures "relationships between tokens," FFN performs "feature transformation of each token." FFN parameters comprise roughly 2/3 of a Transformer's total parameters — a massive "knowledge store."

Standard FFN

The original Transformer (Vaswani et al., 2017) used ReLU; GPT-2 onward switched to GELU:

\text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2

The intermediate dimension $d_{\text{ff}}$ is typically $4 \times d_{\text{model}}$ (e.g., $d = 4096$ → $d_{\text{ff}} = 16384$ ).

SwiGLU — The Modern Standard

Shazeer (2020) proposed SwiGLU, combining Gated Linear Units with Swish activation. Adopted by LLaMA, PaLM, Gemma, and others.

\text{SwiGLU}(x) = \bigl(\text{Swish}(x W_1) \odot (x W_3)\bigr) W_2

Why SwiGLU is superior:

Gating mechanism: $x W_3$ acts as a gate, controlling selective information flow
Swish activation: $\text{Swish}(x) = x \cdot \sigma(x)$ is smooth unlike ReLU and passes small negative values, improving gradient flow
SwiGLU uses 3 matrices ( $W_1, W_2, W_3$ ), so $d_{\text{ff}} = \frac{8}{3} d_{\text{model}}$ to maintain the same parameter count (different from the standard $4\times$ )

5. Output Layer

Apply RMSNorm to the final transformer layer's output, then project to vocabulary size $|V|$ via a linear transformation (unembedding). Convert to a probability distribution via softmax to predict the next token.

Many LLMs use weight tying: $W_U = W_E^T$ , sharing the embedding matrix $W_E$ with the output weight $W_U$ . This reduces parameters by $|V| \times d$ and aligns the input/output semantic spaces.

Causal Mask in Decoders

In decoder-only models like GPT, a causal mask blocks attention to future tokens. Token at position $i$ can only attend to positions $1, 2, \ldots, i$ .

Implementation-wise, the upper triangular portion of the score matrix is set to $-\infty$ , which becomes zero after softmax. This ensures consistency with autoregressive generation (left-to-right, one token at a time).

Full Transformer Block Data Flow

Now let's trace how all these components work together by stepping through the data flow of a single Transformer block.

Transformer Block Forward Pass

Trace data flow through a single Transformer block on "I love cats".

📥Embed+PE

→

📏RMSNorm

→

🔀Q,K,V

→

🎯Attn W

→

✨Attn Out

→

➕Res+

→

📏RMSNorm

→

⚡FFN

→

➕Res+

→

📤Output

Step 0: Input Embedding + Positional Encoding

Tokens:Ilovecats

X = Embed + PE

I	0.82	−0.15	0.44	0.91
love	0.33	0.78	−0.22	0.56
cats	0.61	0.42	0.73	−0.08

Token IDs are first converted to vectors via the embedding matrix, then positional encoding is added. Each token becomes a d_model=4 vector (real LLMs use 4096+ dimensions).1 / 10

Self-Attention — The Core of Transformers

Self-attention dynamically computes relevance scores between all tokens, producing context-aware representations.

Formula

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

$Q$ (Query): What each token is "looking for"
$K$ (Key): What each token "contains" (how it wants to be found)
$V$ (Value): The actual information each token passes along
$\sqrt{d_k}$ : Scaling factor to prevent dot products from growing too large, which would saturate softmax

Interactive Demo

Self-Attention Mechanism

Step through the self-attention computation on "The cat sat".

Tokens:Thecatsat

X (Input)

The	1.0	0.2	0.5	0.8
cat	0.3	0.9	0.1	0.6
sat	0.7	0.4	0.8	0.2

Input embedding matrix X for "The cat sat" (each token is a 4-dim vector). We will compute Q, K, V matrices from this.1 / 8

Why Self-Attention Is Revolutionary

Long-range dependencies: Unlike RNNs where distant token information decays, self-attention directly accesses all tokens
Parallel computation: RNNs process sequentially; self-attention uses matrix operations that GPUs parallelize efficiently
Interpretability: Visualizing attention weights reveals what the model "focuses on"

Computational Complexity

Self-attention has $O(n^2)$ complexity with respect to sequence length $n$ , which becomes problematic for long contexts. Active research addresses this:

Technique	Complexity	Summary
Standard attention	$O(n^2)$	Computes all token pairs
Flash Attention	$O(n^2)$ (memory $O(n)$ )	Optimizes IO efficiency, reduces HBM access
Multi-Query Attention	$O(n^2)$	Shares K,V heads to speed up inference
Grouped-Query Attention	$O(n^2)$	Between MQA and MHA; used in LLaMA 2+
Ring Attention	$O(n^2 / p)$	Distributes context across devices

Text Generation — Next Token Prediction

LLM inference is simple: given a context (input token sequence), output a probability distribution over the next token, sample from it, append it, and repeat — this is autoregressive generation.

Generation Strategies

Greedy decoding: Always pick the highest-probability token. Deterministic but tends toward monotonous output.

Temperature scaling: Divide logits by temperature $T$ before softmax:

p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

$T < 1$ : Distribution sharpens → more deterministic (good for factual tasks)
$T > 1$ : Distribution flattens → more diverse/creative (good for creative writing)
$T \to 0$ : Converges to greedy decoding

Top-k sampling: Keep only the top $k$ tokens by probability, zero out the rest, and renormalize.

Top-p (Nucleus) sampling: Keep tokens until cumulative probability exceeds $p$ (e.g., 0.95). The candidate count varies dynamically, making it more flexible than top-k.

Interactive Demo

Text Generation Demo

Watch how LLMs generate text one token at a time using temperature and top-k sampling.

Context:Thecat▌

sat

logit: 3.20%

logit: 2.80%

ran

logit: 1.50%

the

logit: 0.90%

very

logit: 0.40%

Input context "The cat". The model's final layer outputs raw logits (scores) for the next token.1 / 8

Training Pipeline — From Raw Model to Conversational AI

Stage 1: Pre-training

Goal: Acquire general language knowledge

Train on next-token prediction using a massive text corpus (web text, books, code) with trillions of tokens.

\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P(x_t | x_{<t}; \theta)

At this stage, the model can generate text continuations but cannot "answer questions" or "follow instructions."

Stage 2: Supervised Fine-Tuning (SFT)

Goal: Teach the model to follow instructions

Fine-tune on a dataset of high-quality (instruction, response) pairs (~100K examples). This teaches the model the format of "responding to user queries."

[Instruction] What is the capital of Japan?
[Response] The capital of Japan is Tokyo.

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

Goal: Align with human preferences

SFT alone can still produce factually incorrect or harmful outputs. RLHF improves this:

Train a reward model: Human annotators rank multiple responses → train a reward model
PPO (Proximal Policy Optimization): Update the policy (LLM) to maximize the reward model's score, with a KL penalty to prevent excessive divergence from the original model

\mathcal{J}_{\text{RLHF}} = \mathbb{E}\bigl[R(x, y) - \beta \, \text{KL}\bigl(\pi_\theta \| \pi_{\text{ref}}\bigr)\bigr]

DPO (Direct Preference Optimization) simplifies RLHF by optimizing the policy directly from preference data without explicitly training a reward model. LLaMA 2 itself used PPO-based RLHF, but DPO has since been widely adopted in open models such as Zephyr and LLaMA 3.

Modern LLM Techniques

KV Cache

In autoregressive generation, recomputing all previous tokens' K and V at each step is wasteful. KV caching stores and reuses K, V from previous steps. This dramatically speeds up inference but increases memory consumption linearly with sequence length.

Quantization

Convert LLM parameters from 32/16-bit floating point to 4/8-bit integers to reduce memory and compute costs. Methods like GPTQ and AWQ can run 70B models on a single GPU with minimal accuracy loss.

MoE (Mixture of Experts)

Exemplified by Mixtral. Split FFN layers into multiple "experts," with a router selecting a few (e.g., 2 of 8) per token. Total parameters are large, but active parameters during inference are small — enabling efficient scaling.

Context Length Extension

Early Transformers handled ~512 tokens; models today process over 1 million tokens. Techniques include RoPE frequency scaling, YaRN, and Ring Attention.

Summary

We've explored LLM internals through six pillars:

History: Word2Vec → Transformer → BERT/GPT → scaling laws enabling massive models
Tokenization: BPE subword splitting converts arbitrary text to numerical sequences
Transformer: N layers of self-attention + FFN + residual connections
Self-attention: $\text{softmax}(QK^T/\sqrt{d_k})V$ dynamically computes relevance across all tokens
Text generation: Temperature, top-k, top-p sampling for stochastic autoregressive generation
Training pipeline: Pre-training → SFT → RLHF/DPO transforms a "next-token predictor" into a "conversational AI"

The seemingly simple principle of "predict the next token" — when scaled sufficiently — produces the remarkable capabilities that power modern AI.

References

A. Vaswani et al. Attention Is All You Need. NeurIPS 2017.
J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers. NAACL 2019.
T. Brown et al. Language Models are Few-Shot Learners. NeurIPS 2020 (GPT-3).
J. Kaplan et al. Scaling Laws for Neural Language Models. 2020.
J. Hoffmann et al. Training Compute-Optimal Large Language Models. NeurIPS 2022 (Chinchilla).
L. Ouyang et al. Training language models to follow instructions with human feedback. NeurIPS 2022 (InstructGPT / RLHF).
R. Rafailov et al. Direct Preference Optimization. NeurIPS 2023.