Back to Insights
Inference Economics

The EV+ of GenAI

Agentic AI, poker theory, and the economics of inference. Production AI systems win by making profitable decisions repeatedly, not by using the most powerful model every time.

Dan Stativa

Want to audit the EV of your AI roadmap?

EV+ GenAI

Capability is not enough

An AI system can be impressive and still be economically wrong. The production question is whether each request creates enough expected value to justify retrieval, tokens, model calls, latency and risk.

  • Expected value model for inference
  • Worked EV calculations
  • Optimal routing strategy
  • Break-even cost analysis
  • Equilibrium play for GenAI systems
EV-positive GenAI inference calculation

There is something slightly misleading about the way the AI industry presents itself. Most demonstrations focus on intelligence: the prompt, the agent, the reasoning chain, the tool call, the workflow, the autonomy. The industry loves showing what a system can do.

But production systems are not evaluated only by capability. They are evaluated by economics.

An AI agent that performs beautifully once is not automatically a successful product. The real question is whether that same agent can execute ten million times per month without destroying the economics of the company operating it.

That makes modern AI engineering resemble professional poker more than science fiction.

Expected Value

In poker, one of the central ideas is Expected Value, usually abbreviated as EV.

Do not ask: will I win this hand? Ask: is this decision profitable over the long run?

You can lose an individual hand and still make the correct decision statistically. That is the philosophy behind EV-positive play.

Modern AI systems increasingly operate under the same logic. Every generated response consumes computational resources. Every token processed by a language model has a cost. Every extra second of latency affects user experience and infrastructure utilization.

The EV Formula for GenAI

For a production AI request, a useful simplified EV model is:

expected-value.model
EV = p(success) x business_value - inference_cost - risk_cost
TermMeaning
p(success)Probability the system produces an acceptable outcome
business_valueMargin, retention value, support deflection value, conversion value, or user utility created by the answer
inference_costModel calls, tokens, retrieval, vector search, orchestration, GPU/CPU time, and platform overhead
risk_costExpected cost of hallucination, escalation, bad UX, compliance exposure, or human correction
The important point: model quality is only one part of the equation. An answer can be impressive and still be EV-negative.

A Worked Example

Imagine a WhatsApp support assistant answering a product policy question.

Business value$0.42
Risk reserve$0.05
Volume10M/mo
Strategyp(success)CostCalculationEV
Small model, no retrieval0.55$0.140.55 x 0.42 - 0.14 - 0.05$0.041
RAG + mid model0.78$0.090.78 x 0.42 - 0.09 - 0.05$0.188
Large agentic workflow0.92$0.2150.92 x 0.42 - 0.215 - 0.05$0.121

The largest model is the most capable strategy in isolation. It has the highest success probability. But it is not the highest EV strategy.

The best strategy in this example is the middle route: good retrieval, enough model quality, controlled token use, and limited orchestration. This is the AI equivalent of not overbetting a medium-strength hand.

Dan Stativa

Find your highest-EV AI route

Tokens as Economic Units

In traditional software systems, engineers often think in CPU cycles, memory allocations, network overhead, or database queries. In generative AI systems, tokens become a primary economic unit.

A verbose system is not merely "more intelligent." It is placing larger bets. Long prompts, excessive context retrieval, recursive reasoning loops, and unnecessary agent interactions all increase inference cost.

The cheapest token is the token you never send.

Why Retrieval Matters

Many people imagine generative AI as a model simply "knowing things." Serious AI applications often work differently. When a user asks a question, the system searches for relevant information first. Only afterward does the model generate an answer using that retrieved context.

RAG changes the EV equation because it can increase p(success) while reducing both risk and token waste. But retrieval introduces its own strategic question: how much context should be retrieved?

Too little context and the model becomes inaccurate. Too much context and the system becomes expensive, slow, and noisy. Good chunking is not only an information retrieval problem. It is an economic optimization problem.

Optimal Strategy: Route by EV, Not by Ego

Optimal GenAI routing frontier

The optimal strategy is a frontier: route each request to the cheapest policy that preserves enough success probability.

A mature AI system does not use the most powerful model for every request. It routes.

routing.rule
use the cheapest policy whose expected value remains positive
Request classDefault routeEscalate whenWhy
FAQ / policy lookupRetrieval + small modelRetrieval confidence is lowMost value comes from grounding, not raw reasoning
Product comparisonRetrieval + mid modelAmbiguity or high purchase intentBetter synthesis can increase conversion value
Legal / compliance-sensitive answerRetrieval + constrained high-quality modelAlmost alwaysRisk cost dominates inference cost
Creative ideationMid modelUser asks for depth or noveltySuccess is subjective, token budget can be flexible
Agentic workflow with toolsGated planner + tool executorTask value exceeds orchestration costTool loops are expensive bets

Break-Even Inference

The break-even point is where EV equals zero:

break-even.calc
0 = p(success) x business_value - inference_cost - risk_cost

maximum_affordable_cost = p(success) x business_value - risk_cost

maximum_affordable_cost = 0.78 x $0.42 - $0.05
maximum_affordable_cost = $0.2776

Any strategy costing less than $0.2776 per request is EV-positive under these assumptions. But positive is not optimal.

StrategyEV per requestMonthly expected surplus
Small model, no retrieval$0.041$410,000
RAG + mid model$0.188$1,880,000
Large agentic workflow$0.121$1,210,000

The gap between the best-looking demo and the best economic strategy is $670,000 per month in this toy model. That is why inference architecture is product strategy.

Equilibrium Play

Equilibrium play of GenAI systems

As users, competitors and model providers adapt, durable advantage moves into retrieval, routing, telemetry and evaluation policy.

Poker also has the idea of equilibrium: a strategy that cannot be easily exploited when opponents adapt. GenAI markets develop their own equilibrium.

Users adapt. Competitors adapt. Model providers adapt. Application builders adapt. They route, cache, retrieve, compress, fine-tune, and evaluate.

In that environment, sustainable advantage does not come from using "AI" in the abstract. Everyone can call an API. Sustainable advantage comes from the policy around inference: retrieval quality, chunking discipline, routing thresholds, evaluation sets, caching behavior, latency budgets, fallback design, and knowing when not to call the model at all.

The New AI Engineer

The next generation of AI engineering will not be defined only by smarter models. It will be defined by economic efficiency, inference optimization, retrieval quality, throughput engineering, latency reduction, evaluation discipline, and systems architecture.

The modern AI engineer is becoming less of a prompt designer and more of a computational economist: someone who understands not only intelligence, but also the cost of intelligence.

In poker, the winner is rarely the player with the single most spectacular hand. The winner is usually the player capable of making profitable decisions repeatedly, managing risk carefully, understanding probabilities deeply, and surviving long enough for statistical advantage to compound.

Production AI systems are beginning to follow the same logic. The future belongs to systems that are not merely intelligent. It belongs to systems that are EV-positive.

Dan Stativa

Build for positive expected value