Sphere Partners

TL;DR

AI inference is priced by consumption, not by seat — making costs unpredictable by default. Large frontier models cost 17–25× more per token than small efficient models. Without per-user limits, per-team budgets, and model routing policies, a 250-person enterprise can spend 3–5× its intended AI budget by month two. Four controls fix this: per-user token limits, per-team monthly budgets, model access policies by team, and automated threshold alerts at 50/80/100%. A BYOK deployment keeps all cost visibility at the organisation level, with no markup and full provider-level controls.

25×

Cost difference per output token between GPT-4o and GPT-4o-mini at published 2025 pricing — model choice is the largest cost variable

70–80%

AI cost reduction achievable by routing 60% of queries from large frontier to small efficient models — with no quality loss on routine tasks

3 alerts

Budget threshold alerts at 50%, 80%, and 100% of monthly limit — delivered via email and webhook with deduplication per billing period

BYOK

Bring-your-own-keys deployment: all API costs billed directly to your accounts, full provider-level visibility, zero platform markup

Enterprise AI deployments fail financially in a predictable pattern. The platform launches, adoption is enthusiastic, and at the end of month one someone opens the API invoice and discovers the cost was three times the budget. There is no prior art for AI cost governance in most organisations — and the defaults of most AI platforms and APIs are entirely permissive.

Unlike SaaS software — where costs scale by seat, predictably — AI inference costs scale by consumption. Two employees can generate wildly different token volumes doing nominally similar work if one routinely uses large frontier models and the other uses small efficient ones. Neither employee is doing anything wrong. The cost governance problem is architectural, and it requires architectural solutions.

Why AI Costs Are Structurally Different From Software Costs

Traditional software is priced by seat or by tier — fixed or subscription costs that finance teams can model accurately. AI inference is priced per token: tokens in, tokens out, multiplied by the per-token rate for the model selected. For a 250-person organisation where every employee uses AI at different intensities, with different models, for tasks of different complexity, the resulting cost is genuinely difficult to forecast without active monitoring.

Three structural factors compound this unpredictability:

Model cost variance is extreme. GPT-4o output tokens cost approximately $15 per million at published 2025 pricing. GPT-4o-mini costs $0.60 per million output tokens. The same output generated by the two models costs 25× more on the frontier model. An employee who defaults to the most powerful available model for every task — a natural tendency when capability options are visible — generates dramatically more cost than one who matches model to task complexity.

Usage intensity varies by individual. A power user who runs large documents through AI daily might consume 50× more tokens than a light user who asks one or two questions per week. Without per-user limits, a small number of power users can consume the majority of an organisation's AI budget, leaving the majority of the organisation with no allocation.

Automated and scripted usage is unbounded by default. When employees build automations, scripts, or workflows on top of an enterprise AI platform, those automations can generate thousands of requests in minutes. A single misconfigured script running overnight can exhaust a month's token budget before 9 a.m.

Industry Context

"Infrastructure costs — including AI inference — are the biggest threat to sustainable generative AI deployment in enterprise. Companies that do not implement systematic cost governance in the first six months of deployment consistently find themselves in a position where they either overspend significantly or restrict access to the point where adoption collapses. Neither is the outcome anyone planned for."

— Andreessen Horowitz (a16z), "The Cost of AI Infrastructure in the Enterprise," a16z Research, 2024. The report identified cost governance as the leading operational challenge for enterprise generative AI programmes at scale.

The Model Cost Spectrum: Matching Capability to Task

Before implementing token budgets, the most impactful cost lever is model routing — matching each category of task to the model appropriate for it. The cost differential across model tiers makes this a significant financial decision, not just a technical preference.

A routing policy that reserves frontier models for genuinely complex tasks and routes everyday queries to mid-size or small models reduces cost by 70–80% for the routed volume, with no measurable quality difference for those task categories. For a 250-employee organisation running 100,000 queries per month, this routing decision is worth tens of thousands of dollars annually. For the carbon governance dimension of the same decision, see our guide to CSRD AI emissions reporting — model routing reduces both cost and carbon simultaneously.

The Four Cost Control Mechanisms

1. Per-User Monthly Token Limits

Every user account carries a monthly token limit — a ceiling on total tokens consumed in a rolling calendar month. Input and output tokens both count toward the limit. When a user approaches their limit, a visible progress indicator appears in the chat interface. When the limit is reached, the user receives a clear message and cannot generate further responses until the next billing period or an admin increases their allowance.

Token limits prevent the heavy-user problem: the 5% of employees whose usage accounts for 40–50% of total AI costs. They also create a natural incentive for users to think about model selection — a user who has consumed 80% of their monthly token limit is motivated to route routine queries to a smaller model to preserve budget for high-value tasks.

Admins can set different limits for different user tiers. Technical staff with high-volume needs receive higher limits. Light users get conservative defaults. Exception requests are handled through a simple admin approval workflow — no need to rebuild access controls from scratch for every edge case.

2. Per-Team Monthly Token Budgets

Teams have independent monthly token budgets tracked against actual consumption. The admin dashboard shows per-team: tokens consumed, estimated cost at current model pricing, percentage of budget used, and number of active users. Budget tracking is real time — the dashboard reflects the last query, not last night's batch update.

Team budgets enable departmental cost accountability. The finance team's AI budget is separate from the product team's. When a department approaches its budget, the department head receives an alert and can make a reallocation decision before the limit is reached — not after the invoice arrives. This transforms AI cost from a centralised IT expense into a measurable, per-department operational cost.

Chargeback reporting — showing each team's exact monthly AI spend — makes AI costs visible at the business unit level, enabling ROI conversations grounded in actual data rather than estimates.

3. Model Access Controls by Team

Model access controls determine which models each team can select. The control is configured at the team level during onboarding and can be updated at any time in the admin console.

A typical configuration: the legal and product teams have access to the full model range including large frontier models. The operations and support teams, whose tasks are predominantly routine, have access only to mid-size and small models. The IT team has access to all models plus any self-hosted or experimental models configured in the platform.

The governance decision about which model is appropriate for each team is made once, by the AI governance lead, and then enforced automatically at every interaction. Individual employees do not need to think about cost governance — the appropriate model range for their role is their available option set. For a broader framework on enterprise AI model selection strategy, see our enterprise AI platform selection guide.

Because Govarix supports any LLM through its multi-provider architecture — OpenAI, Anthropic, self-hosted, or any provider accessible via API — model access controls apply uniformly across all providers. A team restricted from GPT-4o is equally restricted from Claude 3 Opus if both are in the restricted tier. For a guide to multi-model and BYOK architectures, see our multi-model enterprise AI and BYOK guide.

4. Real-Time Budget Alerts

Budget alerts fire when consumption crosses configured thresholds. The standard configuration triggers alerts at 50%, 80%, and 100% of the monthly budget. Each alert is delivered through two channels:

Email: Configurable recipients per alert type. A 50% alert might go to the department head. An 80% alert might copy finance. A 100% alert notifies the CIO. Recipients are set per threshold, per team.
Webhook: Fire-and-forget POST to any webhook endpoint — Slack channel, Microsoft Teams channel, PagerDuty incident, or a custom operations dashboard. Payload includes team name, current consumption, budget total, percentage, and timestamp.

Deduplication prevents alert storms. Each threshold fires at most once per billing period regardless of how many subsequent queries push usage further above the threshold. An 80% alert fires once when the team first crosses 80% — it does not re-fire on every query thereafter.

Rate Limiting: The Real-Time Safety Valve

Monthly token limits control aggregate consumption. Rate limiting controls burst consumption — the scenario where an automated script or API misuse generates thousands of requests in seconds before any monthly limit can respond.

Govarix implements per-user rate limiting at 15 requests per minute. Any request beyond this limit is rejected with a clear error before the model processes it and before tokens are consumed. Rate limiting fires synchronously at the API layer — it is a prevention mechanism, not a cost reporting mechanism. A misconfigured automation that sends 1,000 requests in 60 seconds consumes 15 of them, not 1,000.

Rate limits are configurable per user tier. A legitimate high-volume technical user running batch processing workflows can be granted a higher rate limit through the admin console, with a documented business justification in the access log.

The Admin Cost Dashboard

Cost governance requires visibility. The Govarix admin cost dashboard provides real-time data across the organisation:

Dashboard Panel	What It Shows	Primary Use
Organisation Overview	Total tokens today / this month, estimated cost, active users	CFO / CIO executive view
Per-Team Breakdown	Token budget, usage, cost, % consumed, user count per team	Department cost accountability
Top Models by Cost	Ranked model usage by token spend, with cost per model	Model routing optimisation
Top Users by Consumption	Ranked user list by monthly tokens, with cost estimate	Heavy-user identification and limit adjustment
Budget Alert Status	Which thresholds have fired, for which teams, on which dates	Alert audit trail for finance review
Carbon Ledger	CO₂e per model, per team, running monthly total	CSRD ESRS E1 reporting data source

Model pricing is tracked at actual published rates per model. Every cost figure in the dashboard is calculated from the current pricing of the specific model used for each query — not from a blended average. Finance teams get numbers that match the provider invoice, not an approximation.

BYOK: Full Cost Visibility at the Provider Level

Govarix operates on a bring-your-own-keys model. Your organisation provides its own OpenAI, Anthropic, and other provider API keys. All token consumption is billed directly to your accounts at the provider's published rates — there is no Govarix markup on token costs.

BYOK creates a second layer of cost control at the provider level. OpenAI's API allows per-key spending caps. Anthropic's API provides usage dashboards per key. These provider-level controls sit above Govarix's platform-level controls — a spending cap on your OpenAI key is an absolute ceiling that no platform configuration can exceed.

The combination of provider-level key caps and Govarix platform-level team budgets creates defence-in-depth for cost governance. Govarix catches overruns at the team and user level. The provider key cap is the final backstop if any platform-level control fails to fire correctly.

BYOK Cost Governance Checklist

Set a monthly spending cap on each API key at the provider level (OpenAI, Anthropic)
Assign separate API keys to different environments (production vs. dev/test) so development usage does not count against production budgets
Configure Govarix team budgets at 80% of the corresponding provider key cap — leaving 20% headroom for legitimate overages before the hard stop
Set 80% and 100% threshold alerts in Govarix to fire before approaching the provider key cap
Review the top-models-by-cost dashboard weekly in the first two months to calibrate model routing policies against actual usage patterns

Building a Cost-Aware AI Culture

Technical controls are necessary but not sufficient for sustainable AI cost governance. The organisations that maintain healthy AI cost profiles over time also build cost awareness into how employees think about AI use.

Showing users their current token consumption in the chat interface is more effective than invisible limits. When a user can see "you have used 65% of your monthly AI budget," they make different model choices for routine tasks than they would if consumption were invisible. Transparency at the user level makes cost governance a shared responsibility rather than a central restriction.

Monthly team cost reports — showing each department its AI spend alongside its AI output metrics — create the conditions for ROI conversations. A team that spent €800 on AI last month and can show measurable time savings from that usage is in a fundamentally different position than one that cannot connect cost to value. For the framework connecting AI cost visibility to organisational ROI, see our analysis of the context tax and AI productivity recovery.

Frequently Asked Questions

How do you control AI costs in an enterprise deployment?
Four mechanisms working together: per-user monthly token limits capping individual consumption, per-team budgets with real-time tracking, model access controls routing teams to cost-appropriate models, and automated threshold alerts at 50/80/100% delivered via email and webhook. BYOK deployment adds provider-level spending caps as a final backstop.

How much cheaper are small AI models compared to large frontier models?
17–25× cheaper per token at 2025 pricing. GPT-4o output tokens cost approximately $15 per million; GPT-4o-mini costs $0.60 per million. For tasks that do not require frontier capability — summarisation, simple drafting, FAQ responses — routing to a smaller model reduces per-query cost by 80–95%.

What is a token budget in enterprise AI?
A monthly consumption ceiling assigned to a user or team. When the budget is reached, the user cannot generate further responses until the next billing period or an admin grants an exception. Token budgets turn AI usage from an unbounded variable cost into a predictable, plannable expense.

What is model routing in enterprise AI cost governance?
A governance policy that matches task complexity to model capability. Simple tasks go to efficient small models. Complex tasks are permitted on large frontier models. A team routing 60% of queries from large to small models reduces AI costs by 70–80% with no quality loss on routine tasks.

What is BYOK in enterprise AI?
Bring Your Own Keys — the organisation provides its own OpenAI, Anthropic, or other API keys. All token consumption is billed directly at provider rates with no platform markup, full cost visibility at the provider level, and the ability to set provider-level spending caps as an absolute ceiling above platform controls.

How do AI budget alerts work?
Alerts fire when team consumption crosses 50%, 80%, and 100% of the monthly budget. Delivery is via email to configurable recipients and via webhook to Slack, Teams, PagerDuty, or any endpoint. Each threshold fires at most once per billing period — deduplication prevents alert storms from repeated queries above the threshold.

Enterprise AI Cost Control: Token Budgets, Per-Team Limits, and Real-Time Budget Alerts