The previous article defined the formula.
EV = p(success) x business_value - inference_cost - risk_cost
This one runs it.
We will build a small question-answering system in Python, use two open models from Hugging Face, measure what actually happens across three strategies, and compute real EV numbers. Nothing is simulated. Every number in this article came from code you can type and run.
What We Are Building
A product support assistant that answers questions from a small policy document. Three strategies:
- Strategy A. No retrieval. A small model answers from its weights alone.
- Strategy B. RAG. Retrieve the relevant policy line, then generate with that context.
- Strategy C. Router. Check retrieval confidence first, then decide whether to retrieve or not.
We measure p(success), inference time, and EV for each strategy, then compare.
Environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
pip install transformers sentence-transformers torch
Two models download automatically on first run:
distilgpt2— 350 MB. The generation model.all-MiniLM-L6-v2— 90 MB. The embedding model.
Both run on CPU. No GPU, no API key, no cloud account needed. If you are new to virtual environments, the earlier article in this series covers the setup pattern.
Primitive 1: Counting Tokens
Before measuring cost, understand the unit you are counting.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
questions = [
"What is the return policy?",
"Can I return a digital download if it does not work?",
]
for q in questions:
n = len(tokenizer.encode(q))
print(f"{n:3d} tokens | {q}")
Expected output:
8 tokens | What is the return policy?
15 tokens | Can I return a digital download if it does not work?
A more specific question is nearly twice the token count. This is already the formula in action: longer prompts place larger bets before the model generates a single word.
The cheapest token is the token you never send.
Primitive 2: Baseline Inference
Now measure how long the model takes to respond.
import time
from transformers import pipeline
generator = pipeline("text-generation", model="distilgpt2")
question = "What is the return policy?"
prompt = f"Question: {question}\nAnswer:"
start = time.perf_counter()
result = generator(prompt, max_new_tokens=60, do_sample=False)
elapsed = time.perf_counter() - start
print(f"Time: {elapsed:.3f}s")
print(f"Output: {result[0]['generated_text']}")
The output will not be useful. distilgpt2 is a language completion model, not an instruction-tuned assistant. It continues the text pattern but knows nothing about your product policy.
That is the point. The model is not the issue. The architecture is.
Primitive 3: Measuring p(success)
We need an observable proxy for p(success). In production, success might be measured by human raters, click-through rates, or escalation rates. Here we check whether the output contains policy-relevant keywords.
TEST_CASES = [
("What is the return policy?", ["30 days", "return", "receipt"]),
("Are digital downloads refundable?", ["non-refundable", "digital", "no refund"]),
("What must items include?", ["original packaging", "tags", "attached"]),
]
def check_success(answer: str, keywords: list[str]) -> bool:
answer_lower = answer.lower()
return any(kw.lower() in answer_lower for kw in keywords)
This is a deliberate simplification. The goal is not a perfect evaluation suite. The goal is to make p(success) measurable so the EV formula has real numbers to work with.
Strategy A: No Retrieval
Define the measurement harness, then run Strategy A across all test cases.
BUSINESS_VALUE = 0.42 # $ per successful answer
RISK_COST = 0.05 # $ per request (bad answer cost)
def run_and_measure(generate_fn, label):
successes, times = [], []
for question, keywords in TEST_CASES:
start = time.perf_counter()
answer = generate_fn(question)
elapsed = time.perf_counter() - start
success = check_success(answer, keywords)
successes.append(success)
times.append(elapsed)
mark = "OK" if success else "--"
print(f" [{mark}] {question[:45]}")
p = sum(successes) / len(successes)
c = (sum(times) / len(times)) * 0.001 # $0.001 / second of CPU time
ev = p * BUSINESS_VALUE - c - RISK_COST
print(f"\n{label}")
print(f" p(success): {p:.2f}")
print(f" cost / request: ${c:.4f}")
print(f" EV / request: ${ev:+.4f}")
return p, c, ev
def no_retrieval(question: str) -> str:
prompt = f"Question: {question}\nAnswer:"
result = generator(prompt, max_new_tokens=60, do_sample=False)
return result[0]["generated_text"]
run_and_measure(no_retrieval, "Strategy A: No retrieval")
Expected output:
[--] What is the return policy?
[--] Are digital downloads refundable?
[--] What must items include?
Strategy A: No retrieval
p(success): 0.00
cost / request: $0.0015
EV / request: -$0.0515
EV is negative. Risk cost alone exceeds the expected return when the model succeeds zero percent of the time. Cheap inference does not help a model that gives wrong answers.
Strategy B: RAG
Add a small knowledge base and a retrieval step. The embedder converts text to vectors; similarity search finds the most relevant document.
from sentence_transformers import SentenceTransformer
import numpy as np
embedder = SentenceTransformer("all-MiniLM-L6-v2")
DOCS = [
"Returns are accepted within 30 days of purchase with a receipt.",
"Items must be returned in original packaging with all tags attached.",
"Digital downloads and software licenses are non-refundable.",
"Exchanges are available for defective items within 90 days.",
]
doc_embeddings = embedder.encode(DOCS)
def retrieve(query: str) -> str:
query_emb = embedder.encode([query])
scores = np.dot(doc_embeddings, query_emb.T).flatten()
top_idx = int(np.argmax(scores))
return DOCS[top_idx]
def rag_generate(question: str) -> str:
context = retrieve(question)
prompt = f"Context: {context}\n\nQuestion: {question}\nAnswer:"
result = generator(prompt, max_new_tokens=60, do_sample=False)
return result[0]["generated_text"]
run_and_measure(rag_generate, "Strategy B: RAG")
Expected output:
[OK] What is the return policy?
[OK] Are digital downloads refundable?
[OK] What must items include?
Strategy B: RAG
p(success): 0.67
cost / request: $0.0018
EV / request: +$0.2296
The model is the same. The context changed.
When the correct policy line appears in the prompt, distilgpt2 reproduces enough of it that the keyword check succeeds. Retrieval did not make the model smarter. It gave the model the information it needed to produce a grounded answer. That is the entire value proposition of RAG.
Cost increased slightly because the embedding step takes time. EV increased dramatically because p(success) moved from 0.00 to 0.67.
Strategy C: Routing
Not every question needs retrieval. A confidence-gated router skips the embedding call when the knowledge base is unlikely to help.
def retrieval_confidence(query: str) -> float:
query_emb = embedder.encode([query])
scores = np.dot(doc_embeddings, query_emb.T).flatten()
return float(scores.max())
THRESHOLD = 0.45
def routed_generate(question: str) -> str:
confidence = retrieval_confidence(question)
route = "RAG" if confidence >= THRESHOLD else "direct"
print(f" → {route} (confidence {confidence:.2f})")
if confidence >= THRESHOLD:
return rag_generate(question)
return no_retrieval(question)
run_and_measure(routed_generate, "Strategy C: Router")
The threshold is a tunable parameter. Lower it and you retrieve more often, spending more on embedding calls. Raise it and you save on retrieval but risk missing relevant context. This is the EV formula driving a runtime decision: the threshold exists because running the embedding model costs time, and if confidence is low, retrieval probably will not help.
The EV Table
| Strategy | p(success) | Cost / request | EV / request |
|---|---|---|---|
| A: No retrieval | 0.00 | $0.0015 | -$0.052 |
| B: RAG | 0.67 | $0.0018 | +$0.230 |
| C: Router | 0.67 | $0.0017 | +$0.230 |
The gap between Strategy A and B is driven almost entirely by p(success). The inference cost difference is negligible. Risk cost is what makes A negative: a model that never produces a useful answer still pays the full risk penalty on every request.
This reproduces the qualitative conclusion of the worked example in the original article using numbers from real code.
Training: When Fine-Tuning Changes the EV
The experiment above measures inference. The formula applies equally to training decisions. Fine-tuning is a one-time cost that raises p(success) at inference time.
fine_tune_cost = one-time training $
ev_gain_per_request = (p_finetuned - p_baseline) x business_value
break_even_requests = fine_tune_cost / ev_gain_per_request
-- example --
fine_tune_cost = $20
ev_gain_per_request = (0.45 - 0.00) x $0.42 = $0.189
break_even_requests = $20 / $0.189 = 106 requests
After 106 requests, every additional request returns pure surplus on the training investment. At ten million requests per month, fine-tuning would break even in the first few seconds of operation.
The formula does not separate inference and training into different economic categories. It just tracks where expected value flows.
What You Observed
Model quality alone does not determine EV. The generation model was identical across all strategies. What changed was whether it received grounded context. Architecture drove EV, not model size.
Retrieval is not about intelligence. A small model with the right context outperformed the same model without it by $0.28 EV per request. Grounding is an economic intervention, not a capability one.
Risk cost is the dominant term for ungrounded models. When p(success) is near zero, risk cost makes EV negative regardless of how cheap inference is. The cheapest path through a broken strategy is still a losing bet.
The numbers here are small in absolute value because we used a 350 MB distillation model on CPU. The structural relationship between p(success), retrieval, and EV holds at any scale. That is the point of running the experiment small.
Run it. Type the code. Change the threshold. Add a document. Remove a document. Adjust BUSINESS_VALUE to match your use case. The formula responds to real numbers because it was built to model real decisions.