Chapter 2: Foundation Models for Social Text — Adaptation, Alignment, and Deployment

About This Chapter

The analytics pipeline that defined the first two chapters of this book was built on a simple architecture: turn words into sparse counts (TF-IDF), discover latent topics with probabilistic models (LDA), and train task-specific classifiers on labelled examples. That architecture worked. It produced publishable results and deployable systems. But it carried a hidden cost: every new task required its own labelled dataset, its own feature engineering decisions, and its own evaluation cycle. A brand monitoring operation tracking sentiment, named entities, topic membership, crisis signals, and competitor mentions needed, in effect, five separate models — each one fragile to domain shift and each one expensive to maintain.

The field has converged on a different architecture. A single foundation model — a large neural network pretrained on web-scale data — can replace twenty task-specific classifiers, generalise to tasks it has never explicitly seen, and continue to improve every time the underlying base model is upgraded. The migration from TF-IDF plus LDA to fine-tuned foundation models is not a marginal improvement; it is the same kind of transition as moving from manual ledgers to spreadsheets. The mechanism changes; the analyst’s job description changes with it.

This chapter develops the mechanics of that transition. We cover why certain models qualify as “foundation models” and what distinguishes the open-weight landscape of 2026. We derive, in full mathematical detail, the two techniques that make large-model adaptation practical at limited compute: LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA). We then turn to alignment — the process of transforming a raw pretrained model into one that follows instructions reliably — developing the mathematics of RLHF (Reinforcement Learning from Human Feedback), PPO (Proximal Policy Optimization), and the elegant simplification known as DPO (Direct Preference Optimization). Throughout, live demonstrations in pure numpy and scikit-learn illustrate the key mechanisms. Production-ready code for the full HuggingFace stack appears in non-executing blocks, clearly marked for use in Colab or a local GPU environment.

The 2024–2030 frontier is defined by three simultaneous shifts: adaptation cost dropping from millions of dollars to tens of dollars per task; alignment quality improving with synthetic rather than human feedback; and deployment moving from cloud APIs toward on-device inference. By the end of this chapter, you will have the conceptual and mathematical vocabulary to participate in that frontier — to read the current literature, to specify a fine-tuning workflow, and to evaluate what a vendor or colleague means when they claim their model has been “aligned.”

Reference

This chapter builds on the transformer foundations introduced in Chapter 3 of this book and assumes familiarity with the attention mechanism and the pretraining/fine-tuning/prompting paradigm. Key external references are: Bommasani et al. (2021), “On the Opportunities and Risks of Foundation Models” (Stanford CRFM); Hu et al. (2021), “LoRA: Low-Rank Adaptation of Large Language Models” (ICLR 2022); Dettmers et al. (2023), “QLoRA: Efficient Finetuning of Quantized LLMs” (NeurIPS 2023); Ouyang et al. (2022), “Training language models to follow instructions with human feedback” (InstructGPT); Rafailov et al. (2023), “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” (NeurIPS 2023). All python blocks in this chapter require transformers, peft, trl, bitsandbytes, and torch; these cannot run in the browser and are clearly marked throughout.

What Makes a Model a Foundation Model
Full Fine-Tuning vs. Parameter-Efficient Fine-Tuning
LoRA: Low-Rank Adaptation
QLoRA: Quantized Fine-Tuning on Consumer Hardware
Alignment — From Pretrained to Useful
RLHF and the PPO Objective
DPO: Direct Preference Optimization
Constitutional AI and RLAIF
Fine-Tuning Workflow in Practice
Distillation and Small Models
Quantization for Inference
Deployment: vLLM, Llama.cpp, Ollama, MLX
Mini Case Study — Brand-Voice Generation
Evaluation: Benchmarks to Production Metrics
Closing — The 5–10 Year Outlook

What Makes a Model a Foundation Model

The definition

The term foundation model was coined by Bommasani et al. (2021) in a 200-page report from the Stanford Center for Research on Foundation Models. The authors defined a foundation model as any model that (1) is trained at scale on broad data using self-supervision, and (2) can be adapted to a wide range of downstream tasks. Three properties jointly constitute the definition.

Scale — parameter count above roughly one billion, trained on at least hundreds of billions of tokens of web-scale text. Scale is not merely a quantitative property; it is qualitative. Models below roughly 1B parameters tend to specialise or fail gracefully. Above that threshold, a phase transition occurs: the model exhibits emergent capabilities — zero-shot reasoning, chain-of-thought, few-shot generalisation to entirely new task formats — that were not present at smaller scale and were not explicitly trained. Wei et al. (2022) documented dozens of such emergent abilities across arithmetic, symbolic reasoning, and natural language inference.

Unsupervised pretraining on web-scale data — the model learns representations from raw text, not from labels. The dominant objective for decoder models (GPT family, LLaMA, Mistral) is causal language modelling: maximise $\sum_t \log p_\theta(x_t \mid x_{<t})$ over a corpus of trillions of tokens. The model never receives a label; it simply learns to predict the next token. At scale, predicting the next token in a diverse corpus requires the model to implicitly learn grammar, facts, commonsense reasoning, and stylistic conventions — all without supervision.

Downstream adaptability — the same pretrained checkpoint can be adapted to dozens of different tasks: sentiment analysis, question answering, code generation, translation, summarisation, entity extraction. This adaptability is what gives foundation models their economic leverage: the pretraining cost is amortised across every downstream application. A model pretrained for $\$10{,}000{,}000$ but fine-tuned and deployed for fifty distinct tasks effectively costs $\$200{,}000$ per task in pretraining amortisation — a figure that will only fall as base models improve on the open-weight releases every six to twelve months.

The open-weight landscape as of 2026

The foundation model ecosystem has bifurcated into a closed-weight tier (OpenAI GPT-4o and o-series, Anthropic Claude 3.5 and 4 family, Google Gemini Ultra/Pro) and an open-weight tier that democratises adaptation.

The open-weight tier as of 2026 is anchored by five model families:

LLaMA 3 / LLaMA 4 (Meta AI). LLaMA 3, released April 2024, was the first open-weight model to challenge proprietary APIs on standard benchmarks. The 8B and 70B parameter variants became the dominant base for community fine-tuning. LLaMA 4 Scout (2025, 17B active parameters, mixture-of-experts with 16 experts, 10M token context) and LLaMA 4 Maverick (17B active / 400B total) extended the family to multimodal inputs.

Mistral / Mixtral (Mistral AI, Paris). Mistral 7B (2023) demonstrated that aggressive architectural choices — sliding window attention, grouped query attention — could produce a 7B model competitive with much larger predecessors. Mixtral 8×7B introduced sparse mixture-of-experts to the open-weight space: 8 expert FFN blocks per layer with 2 activated per token, giving 47B total parameters but only 13B active at inference time.

Qwen 3 (Alibaba Cloud). Qwen 3, released May 2025, is a hybrid thinking-mode architecture spanning 0.6B to 235B parameters. The 235B model uses a 22B-active mixture-of-experts design. Qwen 3 is the dominant open-weight choice for Chinese-language tasks and competitive on multilingual benchmarks.

Gemma 3 (Google DeepMind). Gemma 3, released March 2025, offers 1B, 4B, 12B, and 27B variants with a 128K token context window. It is optimised for on-device and single-GPU deployment, making it a practical choice for researchers without access to A100 clusters.

DeepSeek-V3 (DeepSeek AI, China). Released December 2024, DeepSeek-V3 is a 671B-parameter mixture-of-experts model (37B active) trained for $6 million — roughly 10–20× cheaper than comparable-scale proprietary training runs. Its reported cost efficiency triggered significant re-evaluation of scaling economics across the industry.

The strategic split between closed and open is fundamentally about business model. OpenAI, Anthropic, and Google monetise through API access and consumer products; releasing weights would commoditise their core asset. Meta, Mistral, and Alibaba treat open weights as a platform strategy: lowering the cost of AI adoption expands the ecosystem around their other products and recruits alignment research from the global community. DeepSeek’s transparency about training costs introduced a new pressure on the entire industry.

In practice: Meta LLaMA 3 release dynamics

When Meta released LLaMA 3 8B and 70B in April 2024, the HuggingFace Hub registered over 1,200 derivative fine-tuned models within 30 days of release — specialised variants for medical note summarisation, code generation, customer service scripting, and financial analysis. The speed of community adaptation illustrates the economics of the foundation model era: Meta absorbs a multi-billion-dollar pretraining cost; the community generates thousands of specialised models for free. For a brand analytics team, this means that a strong open-weight base model fine-tuned on your domain data is now a realistic alternative to an OpenAI API subscription — at substantially lower per-query cost once the fixed fine-tuning investment is made.

Full Fine-Tuning vs. Parameter-Efficient Fine-Tuning

The cost of full fine-tuning

Full fine-tuning updates every weight in the model. A transformer with $L$ layers, hidden dimension $d$, and attention dimension $d_a$ has, per layer, roughly four weight matrices of size $d \times d$ (the query, key, value, and output projections) and two FFN matrices of size $d \times 4d$ and $4d \times d$. Ignoring biases and layer norms, the total number of trainable parameters per layer is approximately:

\[\text{params per layer} = 4d^2 + 8d^2 = 12d^2\]

For LLaMA 3 8B, $d = 4096$ and $L = 32$, giving approximately $12 \times 4096^2 \times 32 \approx 6.4 \times 10^9$ parameters — the full 8B. During fine-tuning, the AdamW optimiser stores not only the weights but also the first and second moment estimates for each parameter, tripling the memory requirement. In mixed-precision (fp16/bf16) training, each parameter requires 2 bytes; Adam requires 2 additional 4-byte tensors per parameter:

\[\text{GPU memory} \approx N_\text{params} \times (2 + 4 + 4) \text{ bytes} = N_\text{params} \times 10 \text{ bytes}\]

For the 8B model, this is approximately 80 GB — exceeding the memory of a single A100 80GB GPU when activation checkpointing overhead is included. Full fine-tuning of a 70B model requires a cluster of 8–16 high-end GPUs. For a research group or a corporate analytics team without dedicated ML infrastructure, this is prohibitive.

The table below makes the cost hierarchy concrete for three widely used model sizes:

Model	Params	Full FT memory	LoRA (r=8) memory	LoRA trainable params
Mistral 7B	7.2B	~72 GB	~16 GB	~4.2M (0.06%)
LLaMA 3 8B	8.0B	~80 GB	~18 GB	~4.7M (0.06%)
LLaMA 3 70B	70.6B	~706 GB	~140 GB	~41M (0.06%)

The LoRA memory column assumes QLoRA (NF4 base, bf16 adapters), which reduces the base model footprint by roughly 4× from the full-FT baseline. The LLaMA 3 70B QLoRA case fits on a single 80GB A100 with gradient checkpointing.

The adapter insight

The key observation motivating parameter-efficient methods is this: most of the model’s knowledge is already in the pretrained weights. A downstream task typically requires adjusting the model’s behaviour in a low-dimensional way — shifting the sentiment of a financial text classifier, adapting the style of a tweet generator, focusing the attention of a question-answering model on domain-specific vocabulary. This adjustment does not require relearning everything; it requires a small, targeted intervention.

Adapter layers, introduced by Houlsby et al. (2019), operationalise this insight by inserting small trainable bottleneck modules inside each transformer block. The pretrained weights are frozen; only the adapter parameters are updated. Each adapter is a two-layer MLP: a down-projection $W_\text{down} \in \mathbb{R}^{d \times m}$ that reduces the hidden dimension from $d$ to a small bottleneck $m \ll d$, followed by a nonlinearity (typically ReLU or GELU), followed by an up-projection $W_\text{up} \in \mathbb{R}^{m \times d}$ that restores the dimension. A residual connection adds the original representation back to the adapter output:

\[h_\text{out} = h + W_\text{up} \cdot \text{ReLU}(W_\text{down} \cdot h)\]

With $m = 64$ and $d = 4096$, each adapter adds $2 \times 4096 \times 64 = 524{,}288$ parameters. Over 32 layers with two adapters each, the total addition is approximately 33M parameters — less than 0.5% of the 8B base model. Training cost drops proportionally.

Prompt tuning (Lester et al., 2021) takes a different approach: instead of modifying the model architecture, it prepends a small set of trainable “soft prompt” tokens to the input. These tokens have no fixed word meaning; they are learnable vectors in the embedding space. The rest of the model is frozen. Prompt tuning introduces as few as tens of thousands of parameters per task, but it underperforms adapter methods at small model scales and requires the model to be inference-time accessible (making it unsuitable for very large closed models).

Both adapters and prompt tuning are now mostly superseded in practice by LoRA, which achieves comparable or better performance with cleaner mathematics and no inference-time overhead.

LoRA: Low-Rank Adaptation

The key insight

Hu et al. (2021) observed an empirical regularity: the weight updates produced by full fine-tuning have low intrinsic rank. That is, when you train a pretrained model on a downstream task and compute $\Delta W = W_\text{fine-tuned} - W_\text{pretrained}$, the matrix $\Delta W$ can be well-approximated by a product of two low-rank matrices. The effective rank of the update is far smaller than the ambient dimension $d$.

If $\Delta W \approx BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with $r \ll d$, then instead of training $\Delta W$ directly (which has $d^2$ parameters), we can train $B$ and $A$ (which have $2dr$ parameters). The cost reduction is:

\[\frac{d^2}{2dr} = \frac{d}{2r}\]

With $r = 8$ and $d = 4096$, this is a $\frac{4096}{16} = 256\times$ reduction in trainable parameters per weight matrix. Across the full model, LoRA with $r = 8$ applied to the query and value projections of LLaMA 3 8B adds approximately 4.2M trainable parameters — 0.05% of the base model.

The forward pass

During forward computation, the original pretrained weight $W \in \mathbb{R}^{d \times d}$ is frozen — its values never change. The LoRA matrices $A$ and $B$ are trained. The forward pass for a hidden state $x \in \mathbb{R}^d$ is:

\[h = Wx + \underbrace{BAx}_{\text{LoRA correction}}\]

The pretrained path $Wx$ is computed as in inference; the LoRA correction $BAx$ is a rank-$r$ perturbation applied on top. At initialisation, $A$ is drawn from $\mathcal{N}(0, \sigma^2)$ and $B$ is set to zero, so the LoRA correction starts at zero — the model begins fine-tuning from the exact pretrained behaviour. This is important: it prevents the early training steps from destabilising the pretrained representations.

The LoRA scaling factor introduces a hyperparameter $\alpha$ (defaulting to $r$ in most implementations): the actual correction applied is $\frac{\alpha}{r} BAx$. Setting $\alpha = r$ recovers the original formulation; setting $\alpha = 2r$ amplifies the adapter’s influence. In practice, $\alpha$ and $r$ are chosen together to control the effective learning rate of the adapter relative to the base model.

Memory and storage

After fine-tuning, the LoRA correction can be merged into the base weight: $W_\text{merged} = W + BA$. The merged model is identical to a fully fine-tuned model at inference time — no extra compute or latency. This is the key advantage of LoRA over adapters: adapters add a sequential bottleneck at every inference step; LoRA’s correction can be folded away.

For multi-task deployment, multiple sets of LoRA weights can be stored and swapped: the same base model in GPU memory serves multiple fine-tuned personalities simply by adding the appropriate $B_i A_i$ correction at inference time. This is the architecture underlying services like Together AI and Predibase, which serve dozens of fine-tuned variants of a single base model with near-zero overhead per variant.

Predict: parameter savings across adapter ranks

Before running the cell below, estimate: for a weight matrix with $d = 512$ and LoRA rank $r = 16$, what fraction of parameters does LoRA train relative to full fine-tuning of that matrix? Now compute for $d = 4096$ and $r = 8$. Do these match the $256\times$ saving claimed above?

Interpretation. Read each row of the table as: “For a $d \times d$ weight matrix with this rank choice, LoRA trains this many parameters instead of $d^2$.” The $256\times$ figure cited earlier is for $d = 4096$ and $r = 8$ — read off from the table directly. The right chart shows that even at $r = 32$ (a generous rank for most fine-tuning tasks), LoRA trains less than 0.8% of the targeted weight parameters. The practical consequence: the adapter adds negligible GPU memory and negligible training time per iteration; all the cost reduction comes for free relative to the approximation quality.

Live demo: LoRA on toy linear regression

The cell below constructs a synthetic “pretrained” linear map $W_0 \in \mathbb{R}^{d \times d}$ and a true weight update $\Delta W^*$ with known rank. We then show that:

When $\Delta W^*$ has rank exactly $r^* \leq r$, LoRA with adapter rank $r$ recovers the update perfectly.
When $\Delta W^*$ has full rank, LoRA with small $r$ approximates but does not recover the update — and the approximation error is measured.

Interpretation. The left panel shows the low-rank scenario: when the true delta has rank 4, LoRA with $r = 4$ or $r = 8$ recovers it with near-zero relative error after convergence; $r = 2$ (underfit) achieves a meaningful but imperfect approximation. The right panel shows the full-rank scenario: LoRA with small $r$ cannot recover a truly full-rank delta, and increasing $r$ helps — but never reaches zero error unless $r = d$. The practical implication is exactly the empirical claim of Hu et al. (2021): fine-tuning updates in practice tend to be low-rank (intrinsic rank typically in the range 1–8), so LoRA’s 256× parameter savings come essentially for free.

In practice: Bloomberg BloombergGPT and domain LoRA

Bloomberg’s 2023 BloombergGPT paper (Wu et al., 2023) trained a 50B-parameter model from scratch on 363B tokens of financial text. The result was a model that outperformed general-purpose LLMs on financial NLP benchmarks (FPB, FiQA, NER) — but the training cost was estimated at several million dollars. The LoRA alternative, demonstrated by multiple subsequent papers, achieves 90–95% of BloombergGPT’s domain benchmark performance by LoRA-fine-tuning LLaMA 3 8B on a curated 5B-token financial corpus — at a training cost of approximately $200–500 on a single A100 80GB node. For practitioners, this arithmetic is decisive: LoRA-adapted open-weight models have made domain specialisation accessible to teams without hyperscaler compute budgets.

QLoRA: Quantized Fine-Tuning on Consumer Hardware

The memory barrier

LoRA reduces the number of trainable parameters from 8B to roughly 4M. But the base model’s weights still occupy GPU memory in fp16 or bf16 — the 8B model requires approximately 16 GB just for the frozen weights, before accounting for activations, the LoRA parameters, and the Adam moments for the LoRA parameters. On a single NVIDIA RTX 3090 (24 GB), a 13B-parameter model is out of reach with LoRA alone.

Dettmers et al. (2023) introduced QLoRA (Quantized LoRA) to break this barrier. The key idea: quantize the frozen base model to 4-bit precision to slash its memory footprint, while keeping the trainable LoRA adapters in full bf16. Gradients flow back through the quantized frozen weights via dequantization, and the LoRA adapters absorb the fine-tuning signal. The result: a 65B-parameter model can be fine-tuned on a single NVIDIA A100 80GB GPU — something that previously required at least 8 A100s running full fine-tuning.

NF4: Normal Float 4-bit quantization

The standard 4-bit integer (INT4) quantization maps floating-point values to 16 equally-spaced bins. This is suboptimal for neural network weights, which are approximately normally distributed. QLoRA introduces NF4 (Normal Float 4-bit), a quantization datatype whose bin boundaries are chosen to be the quantiles of a standard normal distribution.

Let $q_1 < q_2 < \cdots < q_{16}$ be the 16 quantiles of $\mathcal{N}(0,1)$ at levels $\left\{\frac{i-0.5}{16}\right\}_{i=1}^{16}$. Given a weight tensor $w \in \mathbb{R}^n$, QLoRA normalises to the range $[-1, 1]$ by dividing by the absolute maximum:

\[\hat{w}_i = \frac{w_i}{\max_j |w_j|}\]

The quantized code is then:

\[q(w_i) = \arg\min_{k \in \{1,\ldots,16\}} |q_k - \hat{w}_i|\]

Dequantization recovers an approximation:

\[\tilde{w}_i = q_{q(w_i)} \cdot \max_j |w_j|\]

NF4 is information-theoretically optimal for normally distributed data in the sense that it minimises the expected squared quantization error subject to a 4-bit constraint on uniformly distributed data. The practical benefit is a meaningful improvement in quantization error over INT4, without any change in storage.

Double quantization

QLoRA applies a further trick called double quantization: the quantization constants themselves (the per-block $\max_j |w_j|$ values, one per block of 64 weights) are stored in fp32, costing $\frac{32}{64} = 0.5$ bits per weight. Double quantization re-quantizes these constants in fp8, reducing this overhead to $\frac{8}{64} = 0.125$ bits per weight — a saving of approximately 0.37 bits per parameter, which across 8B parameters amounts to roughly 370MB of recovered GPU memory. The QLoRA paper reports that a LLaMA 65B model fine-tuned with QLoRA requires approximately 48 GB of GPU memory — fitting on a single A100 80GB with room for activations.

Production setup: QLoRA in HuggingFace

This block does not run in your browser

The code below requires bitsandbytes, peft, transformers, and torch. It assumes a CUDA-capable GPU with at least 24 GB VRAM (e.g., RTX 3090, A10G, A100). Run this in Google Colab Pro (A100 runtime) or a local GPU machine. The bitsandbytes library performs the NF4 quantization automatically during model loading.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

# ── Step 1: define 4-bit quantization configuration ────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,    # double quantization of constants
    bnb_4bit_quant_type="nf4",         # Normal Float 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,  # LoRA adapters in bf16
)

# ── Step 2: load base model in 4-bit ───────────────────────────
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=False,
)
model.config.use_cache = False

# ── Step 3: define LoRA configuration ─────────────────────────
lora_config = LoraConfig(
    r=8,                          # adapter rank
    lora_alpha=16,                # scaling factor alpha
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# ── Step 4: wrap the model with PEFT ──────────────────────────
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected output:
# trainable params: 4,194,304 || all params: 8,030,261,248 || trainable%: 0.0522%

What this achieves. The 8B model, in bf16, would require 16 GB for weights alone. After NF4 4-bit quantization with double quantization, the frozen base weights occupy approximately 5.5 GB. The LoRA adapters add roughly 32 MB in bf16. Total GPU footprint for fine-tuning (including activations with gradient checkpointing) fits inside 12 GB — bringing full LLaMA 3 fine-tuning within reach of a single consumer RTX 4090.

Alignment — From Pretrained to Useful

What pretraining does not give you

A model trained solely on the next-token prediction objective learns to complete text in the statistical style of its training corpus. Asked “How do I break into a car?”, a raw pretrained model will happily continue with a tutorial, because lock-picking discussions exist in its training data. Asked “What is the capital of France?”, it will sometimes answer “Paris” and sometimes continue with a passage about French history rather than stopping at the answer. Raw pretraining produces a powerful distribution-matching machine; it does not produce a helpful, honest, harmless assistant.

Alignment is the process of shaping a pretrained model’s behaviour to be consistent with human intent and human values. The term covers a wide range of techniques, from simple supervised fine-tuning on instruction-response pairs to sophisticated reinforcement learning from human feedback. The standard pipeline that emerged from OpenAI’s InstructGPT work (Ouyang et al., 2022) has three stages.

Stage 1 — Supervised Fine-Tuning (SFT). Construct a dataset of (instruction, desired response) pairs. Instructions cover a diverse range of tasks: question answering, creative writing, code generation, refusal of harmful requests. Desired responses are written by human contractors following a quality rubric. Fine-tune the pretrained model on this dataset using standard cross-entropy:

\[\mathcal{L}_\text{SFT} = -\sum_t \log p_\theta(y_t \mid x, y_{<t})\]

where $x$ is the instruction and $y$ is the desired response. SFT alone substantially improves instruction-following behaviour: the model learns the conversational turn structure and the expected output format. But SFT is limited by the quality and coverage of the demonstration dataset. Rare task types and subtle value trade-offs are hard to cover with a finite set of human-written examples.

Stage 2 — Reward Modeling. To go beyond explicit demonstrations, we collect preference data: for a given instruction $x$, ask human raters to rank multiple model completions from best to worst. Given a pair $(y_w, y_l)$ where $y_w$ is preferred over $y_l$, train a reward model $r_\phi(x, y)$ — a language model with a scalar output head — to satisfy:

\[r_\phi(x, y_w) > r_\phi(x, y_l)\]

The Bradley-Terry model gives the probability that $y_w$ is preferred:

\[p(y_w \succ y_l \mid x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\]

The reward model is trained by minimising the negative log-likelihood of the human preference labels:

\[\mathcal{L}_\text{RM} = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]\]

Stage 3 — Reinforcement Learning. Use the trained reward model as a proxy for human preference and fine-tune the SFT model to maximise expected reward. This is the RLHF stage, described in detail in the next section.

The three-stage InstructGPT pipeline is now the canonical template for proprietary model alignment. The scale required was significant: Ouyang et al. (2022) reported approximately 40,000 human-labelled instruction-response pairs for SFT and 33,000 human comparison labels for reward modelling. Running each stage required weeks of GPU time at the scale of the original GPT-3 (175B parameters). The practical consequence: RLHF as implemented in InstructGPT was economically accessible only to organisations with significant ML infrastructure budgets. This is precisely what motivated the simpler alternatives described in the DPO section.

RLHF and the PPO Objective

The policy gradient formulation

In the RL framing, the language model is a policy $\pi_\theta$ that maps an instruction (state) $x$ to a completion (action sequence) $y$. The reward signal comes from the reward model: $r(x, y) = r_\phi(x, y)$. The objective is to maximise expected reward:

\[\max_\theta \; \mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi_\theta(\cdot|x)}\left[r_\phi(x, y)\right]\]

Unconstrained, this objective is pathological: the policy will overoptimise the reward model, discovering inputs that produce high reward scores but degenerate text — repetitive, formulaic completions that exploit weaknesses in the reward model’s generalisation rather than genuinely satisfying human preferences. This is the classic reward hacking problem.

The standard fix is to add a KL divergence penalty that constrains the fine-tuned policy $\pi_\theta$ to stay close to the original SFT policy $\pi_\text{ref}$:

\[\max_\pi \; \mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi(\cdot|x)}\left[r_\phi(x, y) - \beta \, \text{KL}\!\left(\pi(\cdot|x) \;\big\|\; \pi_\text{ref}(\cdot|x)\right)\right]\]

The hyperparameter $\beta > 0$ governs the trade-off between reward maximisation and staying close to the reference. When $\beta = 0$, there is no constraint and reward hacking is unchecked. As $\beta \to \infty$, the policy cannot move from the reference — alignment has no effect. In practice, $\beta \in [0.01, 0.5]$ is calibrated on a validation set.

The KL term expands per-token:

\[\text{KL}(\pi(\cdot|x) \| \pi_\text{ref}(\cdot|x)) = \sum_t \sum_v \pi(v \mid x, y_{<t}) \log \frac{\pi(v \mid x, y_{<t})}{\pi_\text{ref}(v \mid x, y_{<t})}\]

summed over all tokens $t$ in the generation and all vocabulary items $v$.

PPO in language model fine-tuning

Proximal Policy Optimization (Schulman et al., 2017) optimises the KL-penalised objective by maintaining a value function $V_\psi(x, y_{<t})$ that estimates the expected future reward from the current state, computing advantage estimates $\hat{A}_t = r(x,y) - V_\psi(x, y_{<t})$, and updating the policy with a clipped surrogate objective that prevents large updates. Implementing PPO for language model fine-tuning requires at minimum four models simultaneously in GPU memory: the actor (policy being trained), the critic (value function), the reference model (frozen SFT checkpoint for KL), and the reward model. This is engineering-intensive and memory-intensive — PPO fine-tuning of a 7B model requires at least 2–4 A100 80GB GPUs.

The HuggingFace trl library provides a PPOTrainer that wraps these details. In practice, PPO is what was used to align the original InstructGPT and early ChatGPT models. Its complexity motivated the development of simpler alternatives, the most elegant of which is DPO.

Why PPO is hard: the four-model problem

A practical appreciation of why DPO was such a welcome simplification requires understanding the engineering overhead of PPO-based RLHF. At each training step, the PPO loop must:

Sample a batch of instructions $x$ from the SFT dataset.
Generate completions $y$ using the current actor policy $\pi_\theta$ — this requires a full autoregressive decoding pass.
Score each completion with the reward model $r_\phi(x, y)$ — a separate forward pass through a second model.
Compute the KL penalty by evaluating $\log \pi_\text{ref}(y|x)$ — a third forward pass through the frozen reference model.
Estimate advantages using the value function $V_\psi(x, y_{<t})$ — a fourth model, updated simultaneously.
Clip and update the actor and critic using the PPO surrogate loss.

Steps 2–4 alone require three simultaneous forward passes. In practice, the actor and reference are kept in the same GPU memory with the reference weights frozen; the reward model and critic occupy separate memory. For a 7B model, this typically requires 4 × A100 80GB GPUs. Maintaining numerical stability requires careful tuning of the PPO clip ratio $\epsilon$, the KL coefficient $\beta$, the learning rate, and the number of generations per update step. The trl library wraps this complexity, but debugging a diverging PPO run — where the policy collapses to repetitive outputs or the reward model is being hacked — requires hands-on expertise with reinforcement learning.

The key contrast with DPO: DPO replaces all of steps 2–6 with a single supervised forward pass on pre-collected (chosen, rejected) pairs. There is no online sampling, no reward model, no value function, and no PPO hyperparameters to tune. The cost is that DPO operates offline — it cannot explore new completions during training. In practice, this is rarely a limitation for domain fine-tuning: the SFT model is already a strong generator, and the preference dataset can be collected once from the SFT model before the DPO run begins.

DPO: Direct Preference Optimization

The surprising closed-form solution

Rafailov et al. (2023) made a remarkable observation: the optimal policy under the KL-penalised RLHF objective has a closed-form expression in terms of the reward function. Specifically, the solution to:

\[\max_\pi \; \mathbb{E}_{y \sim \pi(\cdot|x)}\left[r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_\text{ref}(y|x)}\right]\]

is:

\[\pi^*(y|x) = \frac{\pi_\text{ref}(y|x) \exp\!\left(\frac{1}{\beta} r(x,y)\right)}{Z(x)}\]

where $Z(x) = \sum_{y'} \pi_\text{ref}(y'|x) \exp\!\left(\frac{1}{\beta} r(x,y')\right)$ is the normalising partition function. Inverting this relationship, the reward can be expressed in terms of the optimal policy:

\[r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)\]

Now substitute this expression into the Bradley-Terry preference probability:

\[p(y_w \succ y_l \mid x) = \sigma\!\left(r(x, y_w) - r(x, y_l)\right) = \sigma\!\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\]

The partition functions $Z(x)$ cancel. This means the preference probability depends only on policy log-ratios, not on the intractable partition function. We can therefore train a policy $\pi_\theta$ directly by maximising the likelihood of the human preference labels under this model — no separate reward model required, no RL, no value function. The DPO loss is:

\[\mathcal{L}_\text{DPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right]\]

This is a binary cross-entropy loss over preference pairs. The model learns to assign relatively higher log-probability (compared to the reference) to preferred completions $y_w$ than to rejected completions $y_l$. No RL loop, no reward model, no PPO implementation required — just a dataset of (instruction, chosen, rejected) triples and a standard supervised training loop.

Live demo: DPO loss on a toy preference dataset

Before running the cell: the DPO loss function pushes $\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}$ to be positive and large. At convergence, what sign do you expect the average preference margin to take? And what initial value should the DPO loss start near if the policy begins identical to the reference?

Interpretation. As predicted, the margin begins near zero (the policy and reference are identical at initialisation, so both log-ratios are zero) and converges to a positive value — the model has learned a preference ordering. The initial DPO loss of $-\log(\sigma(0)) = \log 2 \approx 0.693$ confirms this. The loss drops monotonically as the policy learns to assign higher relative probability (relative to the reference) to chosen completions than to rejected ones. In the production setting, the “logits” would be the LLM’s log-probability of entire response sequences, and the gradient would be computed via backpropagation through the transformer. The mathematics is identical; only the scale differs.

DPO’s simplicity has made it the dominant alignment method for open-source models since 2024. The trl library provides a DPOTrainer that handles all bookkeeping; building a preference dataset and running DPO on a LoRA adapter requires roughly 200 lines of Python.

DPO variants and successors

DPO’s success prompted a wave of follow-on work that either tightened the theoretical analysis or addressed practical failure modes identified in deployment.

SimPO (Simple Preference Optimization, Meng et al., 2024) removes the reference model entirely. Instead of log-ratios against a frozen reference, SimPO uses the average log-probability of the completion as an implicit reward, normalised by sequence length to avoid length bias:

\[\mathcal{L}_\text{SimPO} = -\mathbb{E}\left[\log \sigma\!\left(\frac{\beta}{|y_w|} \log \pi_\theta(y_w|x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l|x) - \gamma\right)\right]\]

where $\gamma > 0$ is a margin hyperparameter. Eliminating the reference model halves GPU memory during alignment training — important for large models — and reportedly improves instruction-following scores on AlpacaEval.

KTO (Kahneman-Tversky Optimization, Ethayarajh et al., 2024) draws on prospect theory to motivate an asymmetric loss that treats upweighting preferred completions and downweighting dispreferred completions with different coefficients, matching the empirical observation that humans are more sensitive to losses than equivalent gains. This is useful when the preference dataset is heavily imbalanced (many rejected, few chosen) — a common situation when constructing safety-relevant datasets.

IPO (Identity Preference Optimization, Azar et al., 2024) identifies a theoretical failure mode in DPO: when the policy becomes highly confident (logits very large), the sigmoid in the DPO loss saturates and the gradient vanishes, stalling training. IPO replaces the sigmoid with a squared loss on the preference margin:

\[\mathcal{L}_\text{IPO} = \mathbb{E}\!\left[\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)} - 1\right)^2\right]\]

In practice, the choice between DPO, SimPO, KTO, and IPO makes a small but measurable difference on specific benchmarks. The consensus as of 2026 is: use DPO as the baseline; switch to SimPO if you need to avoid storing a reference model; consider KTO if your dataset is heavily imbalanced.

In practice: Anthropic Constitutional AI

Anthropic’s Constitutional AI (CAI, Bai et al., 2022) uses a set of written principles — a “constitution” — to generate AI preference feedback without human raters at scale. In the RLAIF (RL from AI Feedback) variant: the model is shown a harmful completion and asked, according to each constitutional principle, whether the completion is helpful/harmless/honest. The model’s own critiques generate synthetic (chosen, rejected) pairs that are then used to train a reward model, which is then used for RLHF. Anthropic’s Claude family is trained with a combination of human feedback and constitutional self-critique. The practical significance for practitioners: the bottleneck in alignment is increasingly not human labelling — it is the quality and coverage of the preference dataset, which can now be largely synthetic.

Constitutional AI and RLAIF

Using AI to generate preference labels

The most expensive component of the RLHF pipeline is human preference labelling. InstructGPT required approximately 40,000 human-labelled comparisons to train its reward model. At a realistic labelling cost of $1–5 per comparison, this is a $40,000–200,000 data cost per training run, before accounting for quality control and inter-annotator disagreement. For organisations that want to align models on proprietary domains — legal document drafting, financial advisory, medical note summarisation — collecting tens of thousands of domain-expert preference labels is often simply not feasible.

RLAIF (Reinforcement Learning from AI Feedback), formalised by Bai et al. (2022) and extended by Lee et al. (2023), replaces human labellers with an AI annotator. A critique model — typically a stronger or separately aligned LLM — is prompted to evaluate a pair of completions according to a written rubric and output a preference label. These AI-generated labels are used exactly as human labels would be: to train a reward model or, in the DPO setting, as (chosen, rejected) pairs.

The quality of RLAIF depends on the alignment and capability of the critique model. A critique model that is itself poorly aligned will generate biased preference labels. But at sufficient scale and with careful rubric design, AI feedback has been shown to produce reward models competitive with human-feedback reward models on standard benchmarks (Lee et al., 2023). The practical takeaway: for many domain-adaptation use cases, a well-prompted GPT-4 or Claude can generate the preference dataset needed to align a smaller open-weight model, dramatically reducing the data collection bottleneck.

The constitutional approach also introduces a natural mechanism for value specification: instead of relying on human raters’ implicit preferences (which are noisy and variable across raters), the alignment team writes explicit principles. This makes the alignment process auditable — a regulator or ethics board can inspect the constitution and evaluate whether the principles are adequate, in a way that inspecting 40,000 individual human preference labels is not. This auditability property is likely to become increasingly important as AI deployments face formal regulatory scrutiny under the EU AI Act and similar frameworks.

Fine-Tuning Workflow in Practice

The six-step process

The following workflow covers production-grade LoRA fine-tuning of an instruction-following base model. Each step has specific decisions that affect the final model’s quality, cost, and deployability.

Step 1 — Define the task and assemble labeled data. Specify the exact input-output format the fine-tuned model will produce. For a tweet sentiment classifier, this means deciding: will the output be a single label, a label plus confidence, a label plus rationale? For a brand-voice generator, specify the exact prompt template and the expected response format. Assemble a training set of (instruction, response) pairs — ideally 1,000 to 10,000 examples for LoRA fine-tuning, though 500 can be sufficient for narrow tasks. Data quality dominates data quantity: 500 excellent examples outperform 5,000 noisy ones.

Step 2 — Pick a base model. For English-dominant tasks, LLaMA 3.1 8B Instruct (instruction-tuned) or Mistral 7B Instruct v0.3 are the default starting points. For multilingual or Chinese tasks, Qwen 3 7B or 14B. For extremely resource-constrained deployment (edge, mobile), Gemma 3 4B. Prefer an already-instruction-tuned base (the “-Instruct” or “-Chat” variants) over a raw base model unless you plan to run the full SFT + alignment pipeline.

Step 3 — Tokenize and pack. Tokenize the dataset using the model’s own tokenizer. Pack short examples into fixed-length chunks (typically 2,048 tokens) to maximise GPU utilisation. Use an attention mask to prevent cross-contamination between packed examples. Apply a data collator that shifts labels by one position for causal language modelling. The packing step is often the most underappreciated source of training efficiency gains: unpacked training on short examples leaves the GPU 70–80% idle between sequences; packing brings utilisation to 95%+.

Step 4 — Choose LoRA hyperparameters. The key decisions are:

$r$ (rank): 4, 8, 16, or 32. Higher rank captures more complex adaptations but uses more memory. For narrow tasks (single sentiment label), $r = 4$ or $r = 8$ is typically sufficient. For broad style transfer, $r = 16$ or $r = 32$.
$\alpha$ (scaling): Commonly set to $2r$ or $r$. Acts as a learning rate multiplier for the adapter.
Target modules: Which weight matrices to adapt. Applying LoRA to q_proj, v_proj only (as in the original paper) is conservative. Applying to all linear projections (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj) maximises expressivity at moderate memory cost.
Dropout: 0.05–0.1 on the LoRA layers as regularisation.

Step 5 — Train with transformers.Trainer or axolotl. For standard supervised fine-tuning, use trl.SFTTrainer. For DPO alignment, use trl.DPOTrainer. Set gradient checkpointing to reduce activation memory. Use cosine learning rate scheduling with a warm-up of 3–5% of total steps. Monitor validation loss; stop when it plateaus or starts increasing.

Step 6 — Evaluate against a held-out benchmark. For classification tasks, report precision, recall, F1 by class, and a confusion matrix on a held-out test set. For generation tasks, run ROUGE and BLEU on a reference set, and if budget allows, use LLM-as-judge on a 100-sample random draw. Compare against the zero-shot baseline to quantify the improvement from fine-tuning.

The canonical SFT + DPO setup

This block does not run in your browser

This code requires trl>=0.8, peft>=0.7, transformers>=4.40, datasets, and torch. Run on Google Colab Pro (A100 runtime) or a machine with at least 24 GB VRAM.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig, DPOTrainer, DPOConfig
import torch

model_id  = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# ── Stage 1: SFT ──────────────────────────────────────────────
sft_dataset = load_dataset("json", data_files="sft_data.jsonl", split="train")
# Expected format: {"instruction": "...", "response": "..."}

lora_cfg = LoraConfig(
    r=8, lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05, bias="none",
)

sft_args = SFTConfig(
    output_dir="./sft_output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    dataset_text_field="text",
    max_seq_length=2048,
)

sft_trainer = SFTTrainer(
    model=model_id,
    args=sft_args,
    train_dataset=sft_dataset,
    peft_config=lora_cfg,
    tokenizer=tokenizer,
)
sft_trainer.train()
sft_trainer.save_model("./sft_adapter")

# ── Stage 2: DPO alignment ────────────────────────────────────
dpo_dataset = load_dataset("json", data_files="dpo_data.jsonl", split="train")
# Expected format: {"prompt": "...", "chosen": "...", "rejected": "..."}

dpo_args = DPOConfig(
    output_dir="./dpo_output",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    bf16=True,
    beta=0.1,                  # KL penalty coefficient
    max_length=2048,
    max_prompt_length=512,
)

dpo_trainer = DPOTrainer(
    model="./sft_adapter",    # start from SFT checkpoint
    ref_model=None,           # trl infers ref model from the base before LoRA
    args=dpo_args,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
    peft_config=lora_cfg,
)
dpo_trainer.train()
dpo_trainer.save_model("./dpo_adapter")

Distillation and Small Models

Why size matters at inference time

A fine-tuned LLaMA 3 70B model may achieve excellent quality on a domain task, but it costs roughly $2–3 per million tokens to serve on a cloud A100 cluster, with p50 latency around 800 ms per response. For a production system handling a million user interactions per day, that is a $2,000–3,000 daily compute cost and latency incompatible with real-time applications. Knowledge distillation transfers the capabilities of a large, expensive teacher model into a smaller, cheaper student model.

Response-level distillation

The simplest form is response-level distillation (also called “imitation learning” or “offline distillation”): run the teacher on a large unlabelled prompt set, collect the teacher’s outputs, and use these as supervised training data for the student. This requires no access to the teacher’s internal weights or logits — only its text outputs. The resulting dataset can then be used to SFT a smaller model (3B, 7B) using exactly the pipeline described in the previous section.

OpenAI’s paper “Is ChatGPT Good at STEM?” (Zhu et al., 2023) showed that a 7B model fine-tuned on 52,000 ChatGPT responses (the WizardLM dataset) substantially outperformed raw LLaMA on instruction-following benchmarks. The community has since produced dozens of datasets distilled from GPT-4, Claude, and Gemini that are freely available on HuggingFace for training smaller open-weight students.

Token-level distillation

More precise is token-level cross-entropy distillation: the student is trained to minimise the KL divergence between its own next-token distribution and the teacher’s:

\[\mathcal{L}_\text{distill} = \sum_t \text{KL}\!\left(p_\text{teacher}(\cdot \mid x, y_{<t}) \;\big\|\; p_\text{student}(\cdot \mid x, y_{<t})\right)\]

This requires access to the teacher’s logits at each decoding step — possible when both models are available locally (e.g., a 70B teacher and a 7B student running side by side), but not when the teacher is a closed-API model. Hinton et al.’s (2015) original distillation paper used a temperature $T > 1$ to soften the teacher’s probability distribution before taking the KL, amplifying signal from non-modal tokens:

\[p_\text{teacher}^T(v) = \frac{\exp(z_v / T)}{\sum_{v'} \exp(z_{v'} / T)}\]

At $T = 1$, this recovers the standard softmax. At $T > 1$, the distribution becomes flatter, revealing more of the teacher’s uncertainty about secondary candidates — a richer training signal for the student.

The combined distillation loss in practice often blends hard-label (ground truth) supervision with soft-label (teacher distribution) supervision:

\[\mathcal{L}_\text{combined} = (1 - \lambda) \mathcal{L}_\text{CE}(y_\text{true}) + \lambda T^2 \cdot \mathcal{L}_\text{distill}^T\]

where $\lambda \in [0, 1]$ and the $T^2$ factor compensates for the softening of gradients at high temperature. In practice, $\lambda = 0.5$ and $T = 2{-}4$ are typical starting points. The student trained with combined loss typically outperforms one trained on hard labels alone and one trained on soft labels alone.

In practice: Stitch Fix recommender distillation

Stitch Fix, the personalised styling service, runs a pipeline where a fine-tuned GPT-4 class model generates detailed style rationales and outfit recommendations for each customer profile. These rationales are expensive to produce at scale. Their reported approach (from engineering blog posts, 2024) uses GPT-4 as a teacher to generate style rationales for 500,000 customer profiles; these are then used to fine-tune a 7B open-weight model that produces comparable rationales at 1/20th the inference cost. The 7B student runs on-premise, keeping customer preference data out of third-party APIs — a critical privacy constraint for a fashion subscription service. This is the canonical template for enterprise distillation: closed-model teacher for high-quality data generation, open-weight student for cost-efficient, privacy-preserving inference.

Quantization for Inference

Why inference quantization matters

Training quantization (QLoRA) reduces the memory needed to fit the model during gradient computation. Inference quantization addresses a different problem: throughput and cost at serving time. A 7B model in bf16 occupies 14 GB of GPU memory. In INT8, it occupies 7 GB; in INT4, approximately 3.5 GB. Halving the model’s memory footprint roughly doubles the batch size that can be processed in parallel, doubling throughput and halving per-query cost.

INT8 quantization maps the continuous range of a weight tensor to 256 discrete levels. For a weight tensor $W$, the quantization-dequantization cycle is:

\[W_\text{int8} = \text{round}\!\left(\frac{W}{s}\right), \quad \tilde{W} = W_\text{int8} \cdot s\]

where $s = \frac{\max|W|}{127}$ is the per-tensor scale factor. The quantization error is $\epsilon = W - \tilde{W}$; its magnitude depends on the dynamic range of $W$ relative to the granularity of the 256 bins.

The maximum quantization error for INT8 with per-tensor scaling is bounded by $|\epsilon| \leq \frac{s}{2} = \frac{\max|W|}{254}$. For a typical transformer FFN weight matrix with $\max|W| \approx 0.5$, this is approximately 0.002 — smaller than the numerical noise introduced by bf16 arithmetic in many contexts. INT4 with 16 bins has a maximum error of $\frac{\max|W|}{14} \approx 0.036$ — roughly 18× larger, and the quality impact on downstream tasks becomes measurable.

Live demo: INT8 and INT4 quantization error and storage savings

Interpretation. INT8 quantization introduces a very small relative error (typically below 0.3% for well-behaved transformer weights) while halving storage. INT4 roughly doubles the error again while halving storage a second time — a total 8× reduction from fp32. For most inference tasks, INT8 degradation is imperceptible; INT4 introduces measurable but often acceptable quality loss, especially when combined with NF4’s normal-quantile binning. Production inference engines (vLLM, llama.cpp, TensorRT-LLM) implement quantization with per-channel or per-group scales, which achieve lower error than the per-tensor scheme shown here.

Deployment: vLLM, Llama.cpp, Ollama, MLX

The inference runtime landscape

A fine-tuned model weight file is a static artifact. The inference runtime is the software layer that loads those weights, manages memory, batches requests, and executes the forward pass efficiently. The runtime choice governs latency, throughput, hardware requirements, and deployment complexity. As of 2026, four runtimes cover the vast majority of practical deployment scenarios.

vLLM (UC Berkeley, 2023) is the dominant choice for high-throughput cloud serving. Its key innovation is PagedAttention: the KV (key-value) cache that accumulates during autoregressive generation is managed with a paging system analogous to virtual memory in operating systems. Instead of pre-allocating a contiguous block of GPU memory for the maximum sequence length (most of which sits unused), vLLM allocates KV cache memory in small non-contiguous pages and maps them as needed. This reduces KV cache fragmentation by ~70%, enabling 2–4× higher throughput than naive HuggingFace generation at the same hardware cost. vLLM supports tensor parallelism across multiple GPUs and serves an OpenAI-compatible REST API, making migration from the OpenAI API to a self-hosted model a matter of changing the endpoint URL.

llama.cpp (Georgi Gerganov, 2023) is the standard for CPU inference and quantized edge deployment. Written in pure C++ with no Python dependency, llama.cpp implements GGUF format models (a compact quantization container format) and can run a 7B INT4 model on a MacBook Pro M3 at 20–30 tokens per second — sufficient for interactive use. It supports GPU offloading (split the model across CPU RAM and GPU VRAM) and runs on Linux, macOS, and Windows. For deployments where cloud API costs are prohibitive and the application can tolerate 2–5 second response latency, llama.cpp on a well-specced laptop or mini server is a viable production setup.

Ollama is a developer-friendly wrapper around llama.cpp that simplifies model management: ollama pull llama3.1 downloads a GGUF model and serves it locally on a REST API in one command. Ollama is the recommended entry point for prototyping and for non-production deployments where operational simplicity outweighs throughput optimisation.

MLX (Apple, 2023) is a framework for on-device inference on Apple Silicon (M-series Macs and, prospectively, iPhone/iPad chips). MLX exploits the unified memory architecture of Apple Silicon — CPU and GPU share the same physical RAM — allowing large models to run without explicit memory transfers between CPU and GPU. A 7B model in INT4 runs at 35–50 tokens per second on an M3 Max MacBook Pro. For privacy-sensitive applications (healthcare, legal, finance) where data cannot leave the device, MLX-based deployment is increasingly the preferred architecture.

The table below summarises the trade-offs:

Runtime	Primary use case	Throughput (tok/s, 7B)	Latency profile	Hardware
vLLM	Cloud serving, multi-user	500–2,000 (batch)	Low per-token, high first-token	GPU (A10G, A100, H100)
llama.cpp	Edge, laptop, CPU	20–80 (single user)	Moderate, consistent	CPU / partial GPU
Ollama	Local dev, prototyping	20–60	Moderate	CPU / GPU
MLX	Apple Silicon, on-device	35–80	Low, consistent	Apple M-series

In practice: Bumble brand-voice systems

Bumble, the dating and social networking app, uses a tiered inference architecture for its in-app AI features as of 2025. Real-time features (opening line suggestions, profile review, live chat assistance) use quantized 7B models served via vLLM on cloud GPUs, targeting a p95 latency budget of 600 ms. For features that run asynchronously (weekly compatibility analysis, profile improvement reports), a 70B model is called via API, where latency constraints are relaxed to several seconds. A subset of features deployed in markets with strong data-sovereignty requirements (notably Germany and South Korea) runs Gemma 3 4B in INT8 via llama.cpp on edge servers co-located with regional data centres, avoiding cross-border data transfers entirely. This three-tier architecture — real-time GPU serving, async large-model API, privacy-sovereign edge inference — is increasingly the template for production AI deployments in regulated consumer markets.

Mini Case Study — Brand-Voice Generation

The business problem

A consumer brand has accumulated 5,000 high-engagement tweets spanning three years of its social media presence — a carefully curated corpus that reflects the brand’s voice: irreverent, warm, culturally aware, with a specific pattern of hashtag usage and emoji deployment. The social media team wants a model that produces new tweet drafts in this exact voice, given a topic brief. The brief might be: “Spring product launch, target audience 18–28 in Southeast Asia, emphasis on sustainability.”

The naive approach — prompting GPT-4 with “write a tweet in our brand voice” plus a few examples — produces recognisably generic output. The brand-voice idiosyncrasies that make the tweets distinctive (specific vocabulary, sentence rhythm, the consistent use of exactly two hashtags, the particular type of cultural reference) are not reliably captured in a handful of in-context examples. This is precisely the regime where fine-tuning outperforms prompting.

The business case is also straightforward. A social media manager at this brand produces approximately 15 tweets per week for approval. If a LoRA fine-tuned model generates three on-brand drafts per brief and the manager selects and edits one, the labour-per-tweet drops from 45 minutes (research + draft + revise) to approximately 12 minutes (brief + review + minor edit). For a team of five social media managers, this recaptures roughly 550 hours of creative capacity per year — time that can be redirected to strategy, community management, and campaign planning.

Step-by-step plan

Step 1 — Data preparation. The 5,000 tweets are split into training (4,200), validation (500), and test (300) sets. Each training example is formatted as an instruction-response pair:

Instruction: Write a tweet for the spring product launch.
             Target: 18-28, Southeast Asia. Theme: sustainability.
Response: [the actual high-performing tweet]

Tweets with fewer than 30 characters or more than 250 are filtered out. Replies and retweets are excluded. The result is approximately 3,800 usable training pairs.

Step 2 — Base model selection. LLaMA 3.1 8B Instruct is selected. It is instruction-tuned (so the SFT phase can be lightweight — we are adapting style, not teaching instruction-following from scratch), open-weight (no per-query API cost), and small enough to serve via vLLM on a single A10G GPU (24 GB VRAM) at reasonable cost.

Step 3 — LoRA configuration. Given the narrow task (style adaptation, not knowledge acquisition), $r = 8$ and $\alpha = 16$. LoRA is applied to q_proj and v_proj only (the Hu et al. recommendation for style tasks). Training for 3 epochs with a peak learning rate of $2 \times 10^{-4}$ and cosine decay.

Step 4 — Training. Using QLoRA (NF4, double quantization) to fit on a single A10G. Training 3,800 examples for 3 epochs takes approximately 25 minutes. Total cost on AWS spot instance: approximately $3.

Step 5 — Evaluation. Three evaluation axes: - BLEU-4 and ROUGE-L against a reference set of 300 held-out tweets (automated). - Brand-style rubric: a checklist of 8 observable brand conventions (hashtag count, presence of cultural reference, sentence structure, specific vocabulary patterns) scored by a human reviewer on 50 random outputs. - A/B engagement test: 20 AI-generated tweets posted alongside 20 human-written tweets; engagement rate (likes + reposts per impression) compared after 48 hours.

Step 5 is the decisive gate. Automated metrics can improve while qualitative brand fidelity degrades; the rubric and the engagement test catch this failure mode.

The production code pattern

This block does not run in your browser

The following snippet shows the inference path for the fine-tuned brand-voice model. It requires transformers, peft, and a CUDA-capable GPU. Run on Google Colab or a cloud GPU instance.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
import torch

base_model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
adapter_path   = "./brand_voice_lora_adapter"

# ── Load base model + adapter ──────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()   # merge LoRA into base weights for faster inference
model.eval()

# ── Inference ─────────────────────────────────────────────────
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=80,
    do_sample=True,
    temperature=0.85,
    top_p=0.92,
    repetition_penalty=1.1,
)

brief = "Spring product launch. Target: 18-28, Southeast Asia. Theme: sustainability."
prompt = f"Write a tweet for the following brief:\n{brief}\n\nTweet:"

output = generator(prompt, num_return_sequences=3)
for i, o in enumerate(output):
    generated = o['generated_text'].split("Tweet:")[-1].strip()
    print(f"\n--- Draft {i+1} ---")
    print(generated)

The merge_and_unload() call folds the LoRA correction into the base weights, eliminating any inference-time overhead. The fine-tuned model can then be served exactly like any standard HuggingFace model — with vLLM, Ollama, or llama.cpp.

Evaluation: From Benchmarks to Production Metrics

Academic benchmarks

The model evaluation literature has developed several standardised benchmarks that allow direct comparison across model families. Understanding what these benchmarks measure — and what they miss — is essential for interpreting vendor claims.

MMLU (Massive Multitask Language Understanding, Hendrycks et al., 2020) tests factual knowledge across 57 academic subjects (law, medicine, mathematics, history) via multiple-choice questions. A 70B model with domain fine-tuning typically scores 80–88% on MMLU. MMLU measures breadth of knowledge but not instruction-following, generation quality, or task adaptability. A model that scores 88% on MMLU may still produce poorly formatted outputs, hallucinate on domain-specific queries, or fail at multi-turn dialogue.

MT-Bench (Zheng et al., 2023) evaluates instruction-following and multi-turn reasoning across 80 challenging questions in eight categories (coding, math, writing, roleplay, extraction, reasoning, STEM, humanities). Each question has a follow-up; the model must maintain consistency across turns. MT-Bench uses GPT-4 as the judge, rating each response on a 1–10 scale. It measures capabilities that MMLU does not, but it is expensive to run (GPT-4 API calls) and subject to the judge’s biases.

AlpacaEval (Li et al., 2023) measures the fraction of model outputs that GPT-4 prefers over a reference model (currently GPT-4 Turbo). The metric is “win rate against the reference.” AlpacaEval is strongly correlated with human preferences on instruction-following tasks but, like MT-Bench, inherits the judge model’s biases.

The LLM-as-judge problem

LLM-as-judge evaluations — where a powerful model rates the outputs of a smaller one — have become the operational standard because they scale where human evaluation does not. But they carry known failure modes: length bias (longer responses are preferred even when they are less accurate); positional bias (the first response in a pairwise comparison is favoured); and sycophancy (the judge prefers responses that match its own style and knowledge, penalising correct but unconventionally phrased answers). Mitigation strategies include: randomising presentation order and checking for consistency, asking the judge to evaluate specific dimensions (accuracy, relevance, format) separately rather than as a holistic score, and calibrating the judge’s ratings against a human-labelled anchor set.

The deeper problem with LLM-as-judge is circularity: you are using a frontier model to evaluate a model trained to imitate frontier models. This means that stylistic choices that differ from the judge’s preferences — but may genuinely be better for the task — are systematically penalised. For brand-voice generation, where the correct output is idiosyncratic by design, LLM-as-judge is particularly unreliable. Use it only as a sanity check; rely on the brand-style rubric and engagement data for the authoritative evaluation.

Production metrics

For a deployed brand-analytics or content-generation system, the academic benchmarks are largely irrelevant. The metrics that matter in production are:

Cost per output token. For cloud API deployments, this is the direct invoice. For self-hosted deployments, it translates to GPU-hours per million tokens, which converts to a dollar figure given the instance cost. Benchmark your system’s cost per 1,000 words of useful output, and track it monthly — model providers routinely cut prices.

p95 latency. The 95th percentile of response latency measured across actual production traffic. Averages hide tail latency; the user experience is dominated by the worst 5%. For real-time applications, p95 must be within the interactive response budget (typically 1–3 seconds for a chat interface). Monitor separately for prompt lengths at the 10th, 50th, and 90th percentiles — latency scales non-linearly with context length for some architectures.

Hallucination rate on domain data. For fact-sensitive applications, manually audit a random sample of outputs monthly. Define hallucination operationally: a claim that cannot be verified in the source documents provided in the prompt or retrieved context. Track the rate over time; it tends to drift as the model is updated or as the distribution of inputs shifts.

Customer task success rate. The ultimate downstream metric: does the model’s output accomplish the task the user needed? For a brand-voice generator, this might be the fraction of AI-generated drafts that reach the social media calendar without requiring substantial editing. For a customer service bot, it is the fraction of tickets resolved without escalation. This metric is hard to measure automatically and requires periodic human audit, but it is the only metric that directly captures business value.

In practice: MT-Bench and its biases

When Mistral AI released Mistral 7B in September 2023, it reported MT-Bench scores that placed the 7B model above LLaMA 2 13B and competitive with LLaMA 2 34B — a remarkable result given the 2–5× size difference. Subsequent analysis showed that part of the gap was attributable to MT-Bench’s length bias: Mistral’s outputs tended to be longer and more elaborately structured, which the GPT-4 judge rewarded even when the content quality was comparable. The lesson is not that Mistral’s results were fraudulent — the model is genuinely strong — but that MT-Bench score differences below roughly 0.5 points (on a 1–10 scale) are not reliable indicators of quality differences. Always evaluate on your own task distribution before making deployment decisions based on published benchmark tables.

Closing — The 5–10 Year Outlook

The trajectory from Chapter 1 (TF-IDF, LDA, supervised classifiers) to this chapter is not merely a succession of better tools. It is a structural change in how text analysis is done and who can do it. In 2018, building a sentiment classifier for a new domain required labelled data, feature engineering expertise, and a machine learning background. In 2026, it requires a dataset of a few hundred labelled examples, access to a cloud GPU for a few hours, and familiarity with the LoRA + DPO workflow described in this chapter. By 2030, the extrapolation of current trends suggests the following.

Foundation models replace task-specific classifiers wholesale. The twenty-classifier architecture that was standard in enterprise NLP — one model per task, each trained independently, each breaking when the data distribution shifts — is being retired. A single fine-tuned 7B model, updated quarterly, handles sentiment, entity extraction, topic classification, crisis detection, and brand-voice generation from a unified architecture. Maintenance cost collapses; task generalisation improves automatically as the base model is upgraded.

Agents orchestrate fine-tuned models for end-to-end workflows. The agent architectures surveyed in earlier chapters — LLMs that use tools, plan multi-step processes, and produce structured outputs — will increasingly orchestrate networks of specialised fine-tuned models. An analytics agent might route incoming social media data to a multilingual sentiment model, then to a named-entity extractor, then to a topic classifier, then synthesise the results into an executive report — without human intervention. The analyst’s role shifts from implementing each of these models to designing the evaluation framework that certifies each component is working correctly.

Alignment shifts from RLHF to DPO-family and synthetic feedback. The PPO-based RLHF pipeline that aligned GPT-3 into InstructGPT was a research-scale operation requiring significant human labour for preference labelling. DPO and its successors (SimPO, KTO, IPO) have made the alignment step accessible with small preference datasets and standard fine-tuning infrastructure. Simultaneously, synthetic preference data — generated by constitutional AI and RLAIF pipelines — is reducing the bottleneck on human labelling. By 2028, aligning a domain-specific model to follow safe, helpful, domain-appropriate instructions will be a routine step in any fine-tuning workflow, not a research project.

The cost of intelligence drops approximately 10× every two years. From 2020 to 2026, the inference cost of GPT-3 quality fell roughly 100-fold; the cost of GPT-4 quality fell roughly 30-fold from 2023 to 2025. At this pace, classifying a million social media posts with a frontier-quality model will cost less than $1 by 2028. The strategic implication: competitive advantage will not accrue to organisations that can afford to run large models — everyone will be able to afford them. Advantage will accrue to organisations that have accumulated high-quality domain data, that have invested in evaluation frameworks capable of certifying model quality, and that have the operational discipline to systematically curate training datasets and preference pairs as their AI systems encounter the real world.

The analyst’s role shifts to data curation and eval design. Feature engineering — constructing TF-IDF matrices, tuning LDA topic counts, hand-crafting sentiment lexicons — required deep domain knowledge and statistical expertise. That work is now largely automated. The skills that are scarce and durable are: defining what “good output” means for a specific business context, building evaluation datasets that can detect quality degradation before it reaches users, curating training data that covers the long tail of edge cases, and designing preference pairs that encode the organisation’s values. These are, at their core, analytical and institutional knowledge tasks, not engineering tasks. The text analytics practitioner of 2030 is less a programmer and more a curator, evaluator, and domain expert — someone who knows enough about the models to direct them and enough about the business to judge their outputs.

The chapters of this book have traced a progression from sparse counts to dense embeddings, from bag-of-words to transformer attention, from hand-crafted sentiment lexicons to learned preference alignment. Each step widened the class of questions that could be asked of social text data, and each step shifted the analyst’s responsibility from building the mechanism to directing it. Foundation model adaptation is the current frontier of that progression — not its end point. The next decade will see this trajectory continue: more capable base models, cheaper adaptation, richer alignment, and increasingly autonomous agents that close the loop between observation and action in social media analytics without human intervention at each step.

Social Media Text: Pre-Processing Before Fine-Tuning

Why social text is different

Social media text is the most morphologically irregular corpus that any NLP system will encounter in practice. A tweet written in 2023 might contain: a URL, an @mention, three emoji, a branded hashtag, a retweet prefix, an abbreviation from internet slang, a single letter substituting for a word, and a sarcastic exclamation point at the end. Each of these features interacts with fine-tuning in specific ways that must be handled deliberately.

URLs in tweets contain no semantic content for sentiment or topic models, but they do consume tokens. A URL like https://t.co/xKb2mVq9Rp tokenizes to roughly 15–25 tokens — wasted context window on every example. Replace all URLs with a special token [URL] or remove them entirely, depending on whether URL presence is a relevant signal for your task (it often is for bot detection and spam filtering, less so for brand sentiment).

@mentions present a privacy and generalisation trade-off. Keeping them in the training data trains the model on a specific distribution of mentioned accounts; a model trained on data mentioning @HKUST will perform differently on data mentioning @NTU. For most downstream tasks, replace mentions with [USER]. Exception: when the identity of the mentioned brand is the classification signal (e.g., distinguishing complaints about @united from complaints about @delta), keep the entity and treat it as a feature.

Hashtags carry dense semantic content. #MeToo, #BlackLivesMatter, #WWDC2024 are not decoration; they signal topic, community, and temporality. The naive approach of stripping the # character and treating the hashtag as a regular word is usually correct. The segmentation problem — #AppleLaunchEvent must be read as “Apple Launch Event” not “appleLaunchEvent” — is handled automatically by most subword tokenisers because the camel-case boundaries produce recognisable subwords.

Emoji are natively supported by modern tokenisers (LLaMA, Mistral, Qwen all include emoji in their vocabulary). For sentiment analysis, emoji carry strong polarity signal: keep them. For tasks where emoji are irrelevant noise (named-entity recognition, intent classification in customer service), removing them reduces sequence length without hurting accuracy.

Retweet prefixes (RT @user:) indicate that the tweet is a repost, not an original statement. For sentiment classification, a retweet of a negative opinion is not necessarily an endorsement of that opinion — the retweeter may disagree. If your task requires original authorial stance, filter out retweets at the data collection stage.

The pre-processing pipeline in practice

The following sequence is the standard production pipeline for cleaning social media text before tokenizing for fine-tuning. The code is non-executing because it depends on emoji and regex packages that are not in the Pyodide package set; copy to a Colab notebook.

This block does not run in your browser

Run in Google Colab after pip install emoji regex. This function is used in the brand-voice generation case study’s data preparation step.

import re
import emoji

def clean_tweet(text: str,
                remove_urls: bool = True,
                replace_mentions: bool = True,
                remove_rt_prefix: bool = True,
                remove_emoji: bool = False) -> str:
    """
    Clean a tweet for fine-tuning.

    Parameters
    ----------
    text : raw tweet string
    remove_urls : replace t.co URLs with [URL] token
    replace_mentions : replace @user with [USER] token
    remove_rt_prefix : strip 'RT @user: ' retweet prefix
    remove_emoji : if True, strip emoji; if False, keep them (recommended for sentiment)

    Returns
    -------
    Cleaned tweet string.
    """
    if remove_rt_prefix:
        text = re.sub(r'^RT @\w+:\s*', '', text)

    if remove_urls:
        # Match http/https URLs including t.co shortened links
        text = re.sub(r'https?://\S+', '[URL]', text)

    if replace_mentions:
        text = re.sub(r'@\w+', '[USER]', text)

    if remove_emoji:
        text = emoji.replace_emoji(text, replace='')

    # Normalise whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Collapse repeated punctuation (e.g. '!!!' -> '!')
    text = re.sub(r'([!?.]){2,}', r'', text)

    return text

# ── Example ───────────────────────────────────────────────────
raw = "RT @brandX: Just launched! Check out our new sustainable line 🌱 #EcoFirst https://t.co/abc123 #SS26 @fashionista"
print("Raw:    ", raw)
print("Cleaned:", clean_tweet(raw))
# Output:
# Cleaned: Just launched! Check out our new sustainable line 🌱 #EcoFirst [URL] #SS26 [USER]

The cleaned text is then fed into the tokenizer at Step 3 of the fine-tuning workflow. A batch quality check — printing the first five tokenized examples and their token counts — is strongly recommended before starting training: it confirms that the template formatting is correct, the special tokens ([BOS], [EOS], system prompt wrapper) are in the expected positions, and that no example exceeds the maximum sequence length.

Appendix: Singular Value Decomposition and the Low-Rank Hypothesis

Why weight updates are approximately low-rank

The claim that fine-tuning updates have low intrinsic rank is empirical, but it has a theoretical motivation rooted in the structure of gradient updates in over-parameterised networks. Understanding it gives the practitioner intuition about when LoRA will work well and when increasing the rank is necessary.

Recall that a gradient descent step on a weight matrix $W$ has the form:

\[\Delta W = -\eta \, \frac{\partial \mathcal{L}}{\partial W}\]

where $\mathcal{L}$ is the loss and $\eta$ is the learning rate. The gradient $\frac{\partial \mathcal{L}}{\partial W}$ is a sum of outer products over the batch:

\[\frac{\partial \mathcal{L}}{\partial W} = \frac{1}{B} \sum_{i=1}^B \frac{\partial \ell_i}{\partial W} = \frac{1}{B} \sum_{i=1}^B \delta_i h_i^\top\]

where $\delta_i \in \mathbb{R}^d$ is the backpropagated error signal and $h_i \in \mathbb{R}^d$ is the input activation for the $i$-th example. Each term $\delta_i h_i^\top$ is a rank-1 matrix; the sum of $B$ rank-1 matrices has rank at most $\min(B, d)$. For a typical fine-tuning step with batch size $B = 4{-}16$ and $d = 4096$, each gradient step produces a matrix of rank at most 16 — far below the full rank of 4096. Accumulated over many steps, the total delta $\Delta W = \sum_t \Delta W^{(t)}$ has at most $T \times B$ nonzero singular values, but because many updates are correlated (the model is learning a coherent task), the effective rank tends to be much lower in practice. This is the geometric underpinning of Hu et al.’s empirical finding.

SVD of the weight update: a visualisation

The cell below constructs a rank-4 “true” weight update (as in the LoRA demo), runs 100 steps of gradient descent using batches of size 4, and visualises the singular value spectrum of the accumulated delta. The spectrum confirms that only a few singular values are large; the rest are negligible.

Interpretation. The log-scale plot on the left reveals the characteristic “elbow” of a low-rank matrix: the first four singular values are large and grow with training; beyond index 4, values drop sharply toward numerical noise. The right panel confirms that four singular values capture over 95% of the total spectral energy — consistent with the true rank of the update being four. In a real fine-tuning run on a 4096-dimensional transformer weight matrix, Hu et al. observed similarly steep decay: the first 4–8 singular values account for the bulk of the update, justifying LoRA’s choice of small $r$.

This analysis also suggests a practical diagnostic: after full fine-tuning on a validation task, compute the SVD of each weight delta and plot the cumulative spectral energy. The rank at which you capture 95% energy is a good estimate of the LoRA rank needed for that task. If this rank is above 32, reconsider whether LoRA is the right tool — a task requiring genuinely high-rank updates may benefit from adapter layers or full fine-tuning.

Appendix: Building a Preference Dataset for DPO

From raw data to (prompt, chosen, rejected) triples

The DPO workflow requires a dataset of triples $(x, y_w, y_l)$ where $x$ is an instruction, $y_w$ is a preferred completion, and $y_l$ is a rejected completion. Constructing this dataset is the most operationally intensive step of the alignment pipeline. Three approaches are common.

Approach 1 — Human ranking. Collect $k$ completions from the SFT model for each instruction. Show each pair to a human rater and ask: which completion is better according to your rubric? The rubric must be specific: for a brand-voice generator, it might be “which tweet better matches the brand’s tone, hashtag convention, and cultural references?” For a customer service bot, it might be “which response is more accurate, more concise, and more polite?” Human ranking is expensive but produces the highest-quality labels.

Approach 2 — LLM annotation (RLAIF). Replace human raters with a powerful LLM judge. For each pair $(y_1, y_2)$ given instruction $x$, prompt GPT-4 or Claude: “Which of the following responses better satisfies [rubric]? Response A: [y_1]. Response B: [y_2]. Reply with only ‘A’ or ‘B’.” Use majority voting over three independent calls to reduce variance. The resulting (chosen, rejected) labels are noisier than human labels but can be collected at scale for effectively zero marginal cost once the rubric is defined.

Approach 3 — Rule-based construction. For tasks where quality has an objective dimension, construct preference pairs automatically. For a summarisation task: $y_w$ = the extractive summary (high ROUGE, factually grounded); $y_l$ = a hallucinated paraphrase (low ROUGE, factually incorrect). For a code generation task: $y_w$ = code that passes all unit tests; $y_l$ = code that fails one or more. Rule-based construction is cheap and scalable but may not capture the subjective quality dimensions (style, tone, voice) that matter most for brand applications.

In practice, production DPO datasets are hybrid: rule-based construction for the easily measurable dimensions, LLM annotation for the subjective dimensions, and a small human-annotated validation set to catch systematic errors in the LLM judge.

Dataset size and quality guidelines

The DPO loss is stable with as few as 500–1,000 high-quality preference pairs for narrow tasks. For broad instruction-following, 5,000–20,000 pairs are typical. The following quality checks should be applied before training:

Label consistency: for each instruction, verify that no pair has $y_w = y_l$. Deduplicate instructions with identical chosen/rejected pairs.
Length balance: if the chosen completions are systematically longer than the rejected ones, the model will learn to prefer verbosity — not the intended signal. Check the mean and median lengths of chosen vs. rejected; if they differ by more than 20%, consider adding a length-normalisation step or using SimPO instead of DPO.
Rubric coverage: audit a random 100-pair sample against the written rubric. If more than 15% of the audited pairs are labelled inconsistently with the rubric (human or LLM annotator made an obvious error), the label quality is insufficient and the annotation process needs refinement.
Distribution coverage: verify that the instructions cover the full range of input types the deployed model will encounter. A dataset of 2,000 pairs all sampled from one product category will produce a model that aligns well on that category and poorly on others.

Prof. Xuhu Wan · HKUST · Modern AI Stack for Social Data · 2026 Edition

About This Chapter

Table of Contents

What Makes a Model a Foundation Model

The definition

The open-weight landscape as of 2026

Full Fine-Tuning vs. Parameter-Efficient Fine-Tuning

The cost of full fine-tuning

The adapter insight

LoRA: Low-Rank Adaptation

The key insight

The forward pass

Memory and storage

Predict: parameter savings across adapter ranks

Live demo: LoRA on toy linear regression

QLoRA: Quantized Fine-Tuning on Consumer Hardware

The memory barrier

NF4: Normal Float 4-bit quantization

Double quantization

Production setup: QLoRA in HuggingFace

Alignment — From Pretrained to Useful

What pretraining does not give you

RLHF and the PPO Objective

The policy gradient formulation

PPO in language model fine-tuning

Why PPO is hard: the four-model problem

DPO: Direct Preference Optimization

The surprising closed-form solution

Live demo: DPO loss on a toy preference dataset

DPO variants and successors

Constitutional AI and RLAIF

Using AI to generate preference labels

Fine-Tuning Workflow in Practice

The six-step process

The canonical SFT + DPO setup

Distillation and Small Models

Why size matters at inference time

Response-level distillation

Token-level distillation

Quantization for Inference

Why inference quantization matters

Live demo: INT8 and INT4 quantization error and storage savings

Deployment: vLLM, Llama.cpp, Ollama, MLX

The inference runtime landscape

Mini Case Study — Brand-Voice Generation

The business problem

Step-by-step plan

The production code pattern

Evaluation: From Benchmarks to Production Metrics

Academic benchmarks

The LLM-as-judge problem

Production metrics

Closing — The 5–10 Year Outlook

Social Media Text: Pre-Processing Before Fine-Tuning

Why social text is different

The pre-processing pipeline in practice

Appendix: Singular Value Decomposition and the Low-Rank Hypothesis

Why weight updates are approximately low-rank

SVD of the weight update: a visualisation

Appendix: Building a Preference Dataset for DPO

From raw data to (prompt, chosen, rejected) triples

Dataset size and quality guidelines