This passage is an extended excerpt from *Sources*, a newsletter created and written by Alex Heath, which provides in-depth analysis and sharp commentary on artificial intelligence and the broader technology sector. It is distributed exclusively to subscribers of *The Verge* once each week, giving readers privileged access to insights from inside the evolving world of tech innovation and corporate strategy.
In this edition, Heath relays a pointed message from Rohit Prasad, Amazon’s Senior Vice President of Artificial General Intelligence, who is urging the AI community to reconsider its fixation on model benchmarks and leaderboard scores. His blunt directive is simple yet provocative: stop chasing the numbers. In a conversation preceding Amazon’s announcements at the annual AWS re:Invent conference in Las Vegas, Prasad emphasized that the constant race to dominate benchmark charts has little to do with how AI performs in the messy complexity of the real world. “What I want,” he said, “is genuine, measurable utility. None of the current benchmarks reflect that reality.” He elaborated that, in his view, performance comparisons only have true meaning if all models are trained on identical data, and if the evaluation processes are fully segregated from any training exposure. That ideal, he noted, is far from being met today. As a result, evaluation metrics are becoming noisy, inconsistent, and increasingly poor indicators of true model capability.
This contrarian position runs against the grain of an industry that routinely celebrates every new model’s ascent on public leaderboards. For most AI labs, outperforming rivals by even a fraction of a point is treated as a victory worth publicizing. Amazon’s indifference, therefore, appears both philosophically distinct and strategically convenient, particularly given that its prior flagship model, Nova, was ranked only seventy-ninth on the popular LMArena benchmark at the time of Heath’s interview. Minimizing the importance of leaderboards might seem self-serving, but for Amazon, it also signals an alternative vision of what meaningful progress looks like—one that shifts attention away from raw numerical dominance toward applied, problem-specific success.
The heart of Amazon’s re:Invent announcements is Nova Forge, a new service that, according to the company, enables organizations to train customized AI models in ways that were previously feasible only for corporations willing to invest billions in infrastructure and expertise. The challenge Forge addresses is widely acknowledged: most businesses seeking to adapt general AI systems to their unique needs face an unattractive set of choices. They can attempt to fine-tune a closed proprietary model, but such access is often limited to superficial layers of modification. Alternatively, they can train an open-weight model, but without access to the original training data, they risk catastrophic forgetting—a phenomenon where the model becomes highly specialized on new data but loses its general reasoning or linguistic coherence. The final option, building a model entirely from scratch, demands extraordinary computational and financial resources, putting it out of reach for all but the largest players.
Forge promises to redefine that trade-off. Instead of those restrictive paths, it grants companies structured access to Amazon’s own Nova model checkpoints at multiple stages—during pre-training, mid-training, and post-training. This stepwise exposure allows businesses to embed their proprietary data during the earliest, most receptive phases of learning, a moment when, as Prasad explained, the model’s capacity to internalize domain-specific nuances is at its strongest. Rather than merely fine-tuning behavior at the end, Forge opens the opportunity to shape the system’s conceptual foundation itself.
Prasad described this as a democratization of frontier AI development, asserting that Forge effectively places the same powerful tools used by Amazon’s own research teams into the hands of its customers—at a fraction of the historical cost. In his words, “We have made it possible for you to develop models specifically tailored to your needs without rebuilding from the ground up.” The origin story of Forge mirrors a familiar Amazon pattern: a tool designed internally to solve Amazon’s own engineering problems later evolves into a full-fledged product offered to the world. Just as Amazon Web Services began as a way to manage the company’s internal retail infrastructure before becoming its most lucrative enterprise segment, Forge emerged because Amazon’s teams required an efficient way to enrich base models with their specialized knowledge.
One compelling example of this approach in action is Reddit’s collaboration with Amazon. Reddit has used Forge to construct custom safety and moderation models informed by over two decades—twenty-three years—of accumulated community data. Chris Slowe, Reddit’s Chief Technology Officer and its first employee, expressed almost childlike enthusiasm about the project, noting that one of the company’s senior engineers had been “like a kid in a candy shop” while experimenting with Forge’s capabilities. Last week, Reddit executed a continued pre-training job that yielded highly encouraging results. The company’s ambition is to consolidate several separate safety systems into a single unified model that deeply understands the cultural and behavioral subtleties of Reddit’s diverse communities. That includes grasping the intangible yet essential meaning of recurring moderation principles such as the ubiquitous refrain: “Don’t be a jerk.”
Slowe explained that an “expert” model derived from Reddit’s history will inherently comprehend the context behind such rules. In practical terms, that means the model should develop an intuitive, machine-learned sense of what constitutes courteous, boundary-respecting engagement versus toxic or disruptive conduct—a distinction that even humans often interpret differently across Reddit’s thousands of subcommunities. By developing in-house expertise through Forge, Reddit also gains a degree of operational independence. According to Slowe, the company now enjoys control over model weights, freedom from unannounced API changes, and full ownership of sensitive training data—advantages that are nearly impossible to secure when relying on third-party model providers. Reddit is already exploring extending this method to new products like Reddit Answers and other user-support features.
When Heath asked whether it concerned him that Amazon’s Nova model itself does not sit near the top of public benchmarks, Slowe’s answer was forthright. “In this context, what matters is the level of Reddit-specific expertness the model has achieved,” he said without hesitation. For him, and for Amazon by extension, the focus is shifting away from the pursuit of abstract superiority in generalized intelligence and toward control, domain precision, and reliability. Prasad and Amazon hope that more developers will follow that line of thinking—that they will value the ability to shape a model deeply tied to their operational needs over simply acquiring one that scores slightly higher on a comparative chart.
Viewed strategically, Amazon’s Forge initiative reflects a calculated bet on the commoditization of foundational AI models. Instead of competing in an increasingly crowded race dominated by OpenAI and Anthropic—contests largely determined by scale, computational power, and benchmark bragging rights—Amazon proposes another path. It positions itself as the indispensable infrastructure layer where businesses can build, personalize, and continuously refine models specialized for specific business realities. This philosophy is quintessentially AWS: prioritize scalable infrastructure over abstract intelligence, customization over raw general capability, and genuine applied value over symbolic leaderboard victory.
Whether Forge ultimately proves to be a paradigm-shifting innovation or primarily a savvy reframing of Amazon’s competitive posture will depend entirely on developer adoption and long-term outcomes. Amazon insists that the traditional model race—the one that equates progress with benchmark performance—is missing the point. If the company’s argument holds, the real scoreboard will migrate from public online leaderboards to boardrooms, production systems, and direct user experiences, where the measure of success will no longer be numerical rank but authentic, verifiable impact. In that future, what truly counts will be whether artificial intelligence produces tangible, durable utility for the organizations and people who use it.
Readers can follow related topics and authors from this report to discover more stories like this one in their personalized homepage feed or to receive email updates from *Sources* and Alex Heath, who continues to chart the dynamic intersection of AI innovation, corporate ambition, and technological evolution.
Sourse: https://www.theverge.com/column/836902/amazons-ai-benchmarks-dont-matter