In a recent episode of the Dwarkesh Patel Podcast, Ilya Sutskever described today’s AI models with a word that immediately resonated with anyone who has tried to put them into real workflows: jagged. Not “bad.” Not “overhyped.” Jagged – capable in ways that still feel uncanny, yet unreliable in ways that feel disproportionate. Brilliant on a hard evaluation, strangely brittle on a nearby task. Impressive in a demo, costly in production.
That framing matters because it explains a tension many leaders feel but struggle to name: the technology is clearly powerful, investment is clearly massive, and yet the economy still mostly feels… normal. Under the headlines, the infrastructure buildout is real. J.P. Morgan Asset Management argues that in the first half of 2025, AI-related capex contributed 1.1% to U.S. GDP growth, and hyperscalers were projected to allocate $342B in capex in 2025. It’s a major macro signal.
And still, at the workflow level, the lived experience in many organizations is uneven: pockets of acceleration, pockets of friction, lots of waiting for “reliability” to catch up to “capability.”
Our view at Sphere is that this isn’t a contradiction. It’s what general-purpose technologies look like mid-diffusion. The first phase is infrastructure and abstract capex. The second phase is uneven productivity gains inside workflows. The third phase is broad behavioral change – when patterns harden, adoption becomes repeatable, and “AI-native” operating habits spread.
Where the story gets practical – where it becomes an enterprise roadmap rather than an observation – is here: jaggedness is a system problem. And the discipline that turns jagged capability into dependable output is increasingly a combination of:
- LLM observability (seeing what the system actually did, step-by-step, across prompts, retrieval, tools, latency, and failures),
- LLM evaluation / evaluation-driven development (measuring what matters before production punishes you),
- AI guardrails / GenAI guardrails (constraining behavior and catching failure modes),
- RAG for enterprise / grounded generation (anchoring output to trusted sources),
- agentic AI in the enterprise (orchestration + checkpoints, not “autonomy theater”),
- and generative AI governance / GenAI monitoring (because reliability and traceability are becoming procurement requirements, not nice-to-haves).
Sutskever’s “jagged” word is the right entry point because it names what teams see when they stop talking about models and start shipping systems.
Jaggedness is a market signal. LLM observability is the response
Jaggedness shows up the moment you ask a model to do work the way work actually happens: messy inputs, shifting requirements, multiple systems, partial context, and real costs for subtle errors. A system can score highly on standardized benchmarks and still fail in production, because benchmarks measure performance in controlled conditions while production measures end-to-end reliability under constraint.
This is why “doing well on evals” often translates to executive disappointment. In production, firms buy predictable throughput. They buy a system that can complete the task correctly, repeatedly, with acceptable risk.
When that doesn’t happen, the fix is rarely “a better prompt” alone. It’s usually visibility + measurement + constraints – and visibility is where LLM observability becomes foundational.
In classic software, observability means you can reconstruct what happened: traces, logs, metrics, errors. In LLM systems, you need that plus new primitives: the exact prompt, retrieved passages, tool calls, model outputs, token usage, refusal paths, safety filters, and user feedback. Some observability platforms describe this explicitly as tracing that captures prompts, responses, usage, latency, and tool/retrieval steps.
Without that visibility, you can’t debug jaggedness – you can only argue about it.
And jaggedness is not theoretical. We already have empirical evidence that the same “broad technology family” can produce radically different outcomes depending on task shape and environment.
A well-known study of a generative AI assistant deployed to customer support agents found substantial productivity gains – around 15% more issues resolved per hour – with benefits concentrated among less experienced workers.
But a 2025 randomized controlled trial focused on experienced open-source developers working in their own mature codebases found the opposite: allowing AI tools increased completion time by 19%, largely due to time spent prompting, waiting, verifying, and correcting.
Same era. Same category of tools. Different environments, different constraints, different failure costs – different outcomes.
That’s jaggedness made measurable. And it’s also why LLM observability is an adoption accelerant: it turns “we feel like it’s flaky” into “we can see exactly where it fails, how often, and why.”
“Benchmark reward hacking” becomes an enterprise failure mode
Sutskever also pointed to a dynamic that matters far beyond research labs: once you start optimizing models with post-training and reinforcement learning, improvement pressure can drift toward what’s easiest to score – benchmarks – rather than what users need in the wild. In RL language, that’s reward hacking: not malicious, just predictable optimization.
Enterprise teams see the equivalent as the demo trap: a prototype looks fantastic on curated examples, then collapses in production where inputs are messy, edge cases dominate, and the cost of being wrong is real.
This is exactly where LLM evaluation becomes more than “testing.” It becomes product discipline.
AWS’s generative AI operational guidance is blunt about the need to evaluate the quality and reliability of generated outputs using combinations of automated metrics, model-based evaluation, and human review. And in Amazon Bedrock, evaluation workflows explicitly include using one model “as a judge” to evaluate another – formalizing what many teams already do informally.
That’s the shape of evaluation-driven development: you treat model behavior like a product surface you can measure, regress, and improve – not an unpredictable artifact you demo once and hope holds.
The economic punchline is simple: jaggedness raises adoption costs because organizations must build extra layers – verification, review, monitoring, rollback paths – to make outputs dependable enough to ship or to base decisions on. Benchmarks can rise smoothly while real-world reliability stays uneven. That gap is where budgets and timelines disappear.
The AI adoption value gap: smart models, slow organizations
The most useful way to interpret “AI looks smarter than the economic impact would imply” is not “hype vs. reality.” It’s “capability vs. organizational readiness.”
Gartner has been unusually direct about why expectations dip: high failure rates in early proof-of-concept work and dissatisfaction with current GenAI results – while foundational model providers continue investing billions. That pattern is a reminder that organizations are social and technical systems. Even if the model is competent, the company might not be ready:
- data is fragmented or legally inaccessible,
- processes are undocumented (or political),
- risk and compliance require audit trails,
- incentives punish failure more than they reward experimentation,
- and behavior change takes time.
This is why the takeoff can feel slow until it doesn’t. Once a company crosses a threshold – cleaner data interfaces, reusable evaluation harnesses, stable “AI product” patterns – diffusion accelerates inside that firm, then across its competitors.
The “age of scaling” vs. the “age of research” is mirrored in enterprise GenAI implementation
When Sutskever talks about a shift from an “age of scaling” to an “age of research,” the deeper point is about what produces progress. Scaling levers still matter, but they don’t guarantee the next step-change in reliability.
Enterprise AI is living the same transition.
Phase 1: scaling access.
Roll out copilots. Run pilots. Give teams a model in chat. Prove there’s potential. This phase optimizes for exposure: “Can people use it?” not “Can we trust it?”
Phase 2: scaling reliability.
Once AI touches revenue, compliance, engineering systems, or customer-facing work, the question becomes: “Can we make it behave predictably in messy reality?” That’s when you stop needing more demos and start needing research-like discipline inside your org: measurement, feedback loops, error analysis, and systematic improvement.
This is the moment where “LLM observability” stops being tooling and becomes strategy.
Sutskever’s “value function” analogy is useful here. A value function is an intermediate signal telling you whether you’re on a good path before you reach the final outcome. In enterprise terms, the equivalent is intermediate signals that prevent you from discovering failure only at the end (or in production). For example:
- Intermediate correctness signals: unit tests, factuality checks against trusted sources, schema/format validation, policy checks, “did we cite the right document,” etc.
- Drift detection: dashboards that catch when answer quality, refusal rate, latency, or tool-call success changes after a model update or data change.
- Caution incentives: workflows that reward the system for asking for clarification, abstaining, or escalating to a human when confidence is low – instead of confidently guessing.
That is evaluation-driven development in practice: you build the rails that keep the system stable as the world changes.
Retrieve Before You Guess.
Verify Before You Act.
RAG + knowledge graphs cut hallucinations and risk.
Where value is made: guardrails, measurement, and production reliability
This is the part that tends to get under-described in public AI conversations. Everyone talks about capabilities. Fewer talk about the work required to make capability usable inside real operations.
From our perspective, productionizing generative AI is fundamentally a reliability program. You’re converting jagged capability into predictable throughput.
That conversion usually includes four pillars that reinforce each other:
1) LLM observability (the “what happened” layer).
Tracing is the core: structured logs of each request including the prompt, response, token usage, latency, and the retrieval/tools in between. Without traces, you can’t reproduce failures. Without reproducibility, you can’t improve.
2) LLM evaluation (the “how good was it” layer).
This is where teams stop arguing about vibe and start measuring behavior. AWS explicitly frames evaluation of generated outputs as a lifecycle practice, blending automated methods, model-based evaluation, and human review.
3) AI guardrails / GenAI guardrails (the “what is allowed” layer).
Guardrails define constraints and enforcement: content policies, sensitive-data redaction, safe completion behavior, and sometimes domain rules. Amazon Bedrock Guardrails, for example, is designed to provide configurable safeguards that can be applied across multiple foundation models, supporting consistent safety and privacy controls.
4) Grounding (the “what is true” layer).
In enterprise settings, the most pragmatic reliability lever is often RAG for enterprise – not because it’s fashionable, but because it constrains generation to trusted sources and makes correctness auditable. Grounded generation doesn’t eliminate errors, but it changes the failure surface: “hallucinated answer” becomes “retrieval gap,” which you can measure and fix.
These pillars are also why “LLM observability” is more than monitoring. It’s how you connect failures to root causes across the full system: retrieval quality, prompt construction, tool-call behavior, safety filters, and model regressions.
Example of architecture proposed by Nvidia
Agentic AI in the enterprise: orchestration + checkpoints, not autonomy theater
Agentic systems raise the stakes because they turn one model call into a chain of decisions: planning steps, tool invocations, retrieval, execution, and synthesis.
That’s also why agentic AI makes observability non-negotiable. You can’t manage what you can’t see – especially when the system is calling tools, touching systems of record, or taking actions that carry business risk.
McKinsey’s 2025 global survey captured the adoption pattern clearly: 23% of respondents report scaling an agentic AI system somewhere in their enterprise, with an additional 39% experimenting. That’s meaningful momentum – but it’s also early enough that many programs are still learning what “scaling” really costs.
This is where the “jaggedness” concept becomes operational: agentic value is not “the agent can do more.” It’s “the agent can do more with checkpoints.” Orchestration plus verification. Delegation plus control. Autonomy plus auditability.
And this is also where evaluation-driven development becomes a moat. If you can instrument, evaluate, and govern multi-step flows – if you can prove reliability at each checkpoint – you can scale agentic systems without betting the business on a demo.
Governance and GenAI monitoring: alignment talk becomes procurement requirements
One reason the impact can feel delayed is that many organizations won’t scale high-stakes use cases until governance expectations are clearer and enforceable.
In Europe, the EU AI Act timeline is a concrete forcing function. The European Commission’s implementation timeline shows phased applicability, including general-purpose AI obligations starting August 2, 2025, and a full roll-out foreseen by August 2, 2027. AI Act Service Desk Reuters also reported the Commission’s position that there would be no “stop the clock” pause to this rollout.
In practice, regulation turns “trust” from a vibe into a checklist: documentation, risk management, traceability, controls, monitoring, and accountability. That is why generative AI governance and GenAI monitoring are no longer optional add-ons. They’re becoming table stakes for procurement, especially in regulated industries and cross-border deployments.
The market filter is straightforward: it rewards teams that can build systems with traceability and controls – and punishes teams that only know how to produce clever outputs.
The physical layer: power, energy, and a hard ceiling on waste
There’s also a constraint that forces discipline whether or not an organization feels “ready”: energy.
The International Energy Agency reports that data centres accounted for about 415 TWh of electricity consumption in 2024 (~1.5% of global electricity) and projects consumption could rise to around 945 TWh by 2030 in its base case.
That matters because jaggedness is expensive. Every point of unreliability means retries, human review, duplicated work, and overprovisioned infrastructure. In a world where compute and power are strategic constraints, advantage goes to whoever can make models usefully correct with fewer attempts – which brings us back to measurement, evaluation, and the system-level engineering required to reduce error rates in the workflows that matter.
The 5 Pillars of Implementing a Successful AI Strategy
Download our latest e-book to learn how AI and data strategies can drive smarter decisions, higher efficiency, and stronger customer relationships.
From Model-Centric to Architecture-Centric: A New Competitive Edge
In the early days of the AI boom, many companies vied for the biggest, flashiest models – chasing the one with a trillion parameters or the latest from OpenAI’s release. But as enterprise AI matures, it’s becoming clear that the next competitive edge won’t be about who has the largest model; it will be about who has the smartest infrastructure and strategy around that model. In other words, context is the new scale. The integrity and design of your AI architecture will matter more than raw model horsepower.
Why? Because state-of-the-art models are increasingly accessible to all (through APIs or open source). What differentiates success stories is how they are applied. The bottleneck to AI performance in real business tasks is no longer the base model’s IQ – it’s whether the model is being fed the right information and parameters to apply that IQ effectively. An organization with a smaller model but a superior context-engineering pipeline can outperform one with the most advanced model but poor data integration. We see this in practice: a fine-tuned medium-sized model given high-quality, domain-specific context, can answer an expert question better than a giant generic model flying blind.
Leading enterprises are already shifting their focus this way. Instead of asking “Should we use GPT-4 or a competitor model?”, forward-looking teams are asking “How do we orchestrate our truth (our data and knowledge) into whichever model we use?” They are building “memory-first, purpose-built expert systems,” not model-first systems. This often involves taking an ensemble approach: using large general models for some tasks, but also training smaller domain-specific models (or using retrieval) for others – all coordinated by an overarching architecture. For instance, an AI workflow might use a general LLM for natural language understanding, but rely on a domain-specific rules engine or knowledge graph to ensure the answer is compliant and contextually correct. The “secret sauce” is in how those components interact.
There’s also an emerging idea of AI architecture integrity. This refers to having alignment and consistency across the AI system’s components. If one part of the system “knows” something, the rest of the system should not contradict it. Achieving this requires careful design. For example, a bank deploys an AI assistant across multiple channels (branch kiosks, mobile app, call center). If a customer asks a mortgage rate question in the mobile app and later asks the call center bot the same question, will they get consistent answers? They should – but that will only happen if both channels pull from the same context source and follow the same rules. If each was built in isolation (maybe by different vendors, with different knowledge bases), the answers might differ, undermining trust. Thus, a coherent architecture becomes a competitive advantage by delivering a unified, high-quality customer experience.
Security and IP considerations also make architecture critical. Companies have proprietary data that they cannot leak into third-party models. Those that devise architectures to use AI without exposing sensitive data (through on-prem deployments, encryption, federated learning, etc.) will have an advantage. For example, a healthcare provider that builds a secure medical GPT on its own patient data (fully compliant with HIPAA) can achieve insights no general model can, and do so safely – giving it a leg up in patient service and research. The integrity of how data flows through the AI system – from secure storage to model and back – becomes a selling point. It’s not just about being compliant; it’s about enabling AI to work with more valuable data because you’ve made the architecture trustworthy and robust.
Another aspect is adaptability. The business world changes rapidly – new products launch, regulations update, market conditions shift. An AI model is static unless retrained, but an AI architecture can be built to be dynamic, continuously pulling in new information. Those who set up pipelines for continuous ingestion of fresh data (say, automatically integrating new documentation or metrics into the AI’s context) will find their AI stays relevant and accurate far longer. In contrast, an organization that treated AI as a one-time model deployment might find its system’s knowledge stagnating. Adaptability is part of architecture: it’s designing for change.
Finally, consider measurement – an often-overlooked but vital part of AI systems. The competitive firms will be those who instrument their AI architecture to measure real outcomes (accuracy, resolution time, customer satisfaction) and feed that back into improvements. Rather than boasting about model size or benchmark scores, they’ll talk about how their AI reduced call handling time by 30% or increased cross-sell revenue by 10%, because their architecture was tailored to optimize those metrics. They will have set up the feedback loops needed to track these things. That kind of ROI-focused iteration is itself an advantage that comes from thinking architecture-first.
2026: The Year of Context – 2027: The Year of Coherence
All signs point to 2026 being the Year of Context in AI. Over the next year, we expect to see a major shift in the industry toward context-centric solutions. Companies that have been dabbling with AI pilots will refocus on building the data foundations and pipelines needed for context-rich AI. Gartner analysts and other experts are already stressing that within the next 12–18 months, “context engineering will move from an innovation differentiator to a foundational element of enterprise AI infrastructure.” In practical terms, this means that having a solid context strategy will be considered table stakes for any serious AI deployment. Much like mobile-first design became a given in the 2010s, context-first AI design will become the norm in the latter 2020s.
What will the Year of Context look like? We’ll see organizations:
- Standardize context pipelines – Firms will invest in tools and platforms that curate and deliver context to AI systems in a repeatable way. This could mean enterprise adoption of vector databases, unified knowledge graphs, or context broker services that sit between data sources and AI models. The emphasis will be on ensuring every AI application has access to the relevant, up-to-date information it needs.
- Break down data silos for AI – To provide rich context, data silos within companies must be bridged. 2026 will drive more integration projects: connecting CRM, ERP, HR, and other systems so AI agents can draw on a 360-degree view. The winners will treat “enterprise context unification” as a strategic initiative, not just an IT task. This may also spur more adoption of data fabric and mesh architectures aligned with AI needs.
- Elevate knowledge management – Content and knowledge teams will find themselves in the spotlight. There will be efforts to clean up and streamline knowledge bases, as the quality of AI output is directly tied to the quality of data it’s given. Companies might launch “knowledge spring cleaning” projects, archive obsolete content, and improve taxonomy design so AIs don’t get confused by ROT (redundant/outdated/trivial) data.
- Focus on context governance – Hand in hand with providing context, enterprises will set rules for context usage. For example, defining which sources are trusted for certain queries, or how an AI should flag uncertainty if context is insufficient. Auditability will be key: 2026’s context systems will increasingly log what information was used to generate each answer (for traceability and debugging). This responds to the call for “provenance-controlled inputs” to ensure safer, more reliable AI behavior.
- Vendor solutions will pivot – We can anticipate that AI solution providers will market “context-centric” features heavily. Already, Anthropic has been talking about “constitutional AI” and context management; OpenAI is working on fine-tuning and memory features. New startups will emerge promising to be the “context layer” for enterprises. And consulting companies like Sphere will emphasize frameworks to infuse context in all AI projects from day one.
If 2026 is about establishing context, 2027 will be the Year of Coherence. Coherence is the natural next step – it’s what you get when all your context pieces come together and remain consistent over time. An AI system that is truly coherent will deliver seamless experiences and insights that are almost indistinguishable from what a well-informed human team member would provide. Achieving coherence means not only having context, but maintaining continuity and consistency in how that context is applied. Here’s what we might expect in the Year of Coherence:
- End-to-End AI Workflows – By 2027, we’ll see AI agents capable of handling entire processes across multiple departments, coherently. For instance, an AI could handle an employee onboarding from IT setup to payroll enrollment to training scheduling, without dropping context or needing a human to bridge gaps. As Leena AI describes in their vision, stages 2 and 3 of autonomous operations involve AI agents owning complete workflows and even proactively optimizing them based on accumulated institutional memory. Coherence is when an AI doesn’t just answer questions but can carry out a chain of tasks with contextual awareness throughout.
- Multi-Agent Collaboration – Coherence refers to multiple AI agents working together without confusing each other or the user. If you have a team of specialized AI agents (one for finance, one for legal, one for customer support), coherence means they can pass context among themselves. The finance bot can call on the legal bot’s knowledge when needed, and they won’t contradict one another. We might see standard protocols for agent-to-agent communication (some early work, such as the “Agent Operating Protocol,” hints at that). The outcome will be a more orchestrated intelligence that feels like one coherent assistant, even if under the hood, it’s many components.
- Consistent Multi-Channel Experiences – By 2027, customers and employees should get coherent AI assistance whether they’re interacting via chat, email, phone IVR, or AR glasses. Coherence here means the AI remembers context across channels. If you told the chatbot something yesterday, the phone’s voice assistant today should know it. Achieving this will require unified back-end memory (back to context unification) and real-time state synchronization. The payoff is tremendous – truly personalized, frictionless service.
- Temporal Coherence and Learning – Continuity over time is another angle. Coherence implies the AI not only recalls past interactions but also learns and adapts. By 2027, the systems that have been in place since 2025/2026 will have accumulated a couple of years of interaction history and refinements. We’ll start seeing the compound benefits of this learning: AI recommendations getting more precise, responses becoming more aligned with company tone and policies (because the AI has effectively been “seasoned” with experience). Leaders will measure how often the AI can handle an issue this year vs last year without escalation—an upward trend indicating it’s becoming more coherent and capable.
- Enterprise Coherence = Strategic Alignment – At a higher level, coherence means that your AI and business strategies move in lockstep. By 2027, the enterprises that invested early in context and governance will find that their AI systems consistently drive towards their business goals (because they were designed to do so). The AI won’t feel like a pilot or a side project; it will be deeply embedded and coherent with business processes. This is where we anticipate finally seeing real ROI payoffs. As the initial quote that inspired this piece suggests, those leaders who treat context, consent, and continuity as first-class data will “finally see return on their AI investments.” By the end of 2027, they’ll have a coherent AI infrastructure that competitors who waited simply cannot catch up to easily. The gap becomes structural.
To use an analogy: 2026 is about assembling all the musical instruments (context sources) and tuning them. 2027 is about having them play in harmony, following the same score. The concert of enterprise AI will sound coherent, not like a chaotic rehearsal. And just as in music, when all sections play together, the result can be powerful. We foresee that organizations reaching the coherence stage will unlock efficiencies and innovations that were unattainable when AI was just a patchwork of pilot projects.
Jaggedness is the price of being early – and the roadmap for what comes next
So what does jaggedness really mean?
It means we’re dealing with systems that are already powerful enough to surprise us, but not yet robust enough to be trusted by default. The market isn’t irrational for hesitating. It’s responding to the true cost of unreliability.
The frontier opportunity isn’t only “who has the biggest model.” It’s “who can make AI reliably useful at scale” – and who can do it while power, regulation, and trust constraints tighten.
In that world, LLM observability is not a DevOps detail. It’s the precondition for scaling. It’s what turns jaggedness from an unavoidable annoyance into an engineering surface you can measure, reduce, and eventually make boring.
And boring – predictably correct, traceable, governed, monitored – is exactly what enterprise value looks like.