Jun 2, 2026

QCon AI Boston 2026: From Token Maxing to Trust Engineering

I just got back from QCon AI Boston 2026, and if I had to summarize the conference in one sentence, it would be this: the industry has figured out how to generate code with AI, but trusting it, affording it, and scaling it are still wide open problems. Trust, cost, quality, organizational adoption, security, career identity. These were the real conversations happening in the talks and hallways.

Here’s my talk-by-talk breakdown of the two days.

Day 1

The Scale of What’s Happening

The conference opened with Martin Spier from OpenAI sharing the story of keeping ChatGPT fast as development accelerates. The numbers are staggering: 900 million weekly active users (roughly 11% of humanity), and the growth hasn’t slowed. But the talk wasn’t really about scale. It was about a fundamental shift in software engineering.

At OpenAI, developers are now “multi-threaded.” They work on 7-10 things in parallel using Codex, and PR volume has increased by 70%. But here’s the uncomfortable truth Martin highlighted: nobody fully understands all the code being shipped anymore. The abstraction layer is higher, and developers are delegating more to agents without knowing every detail of what’s going out the door.

His team’s response is to build “always-on” performance agents that continuously profile, detect regressions, and propose optimizations without human intervention. The reactive loop (find regression, profile, fix) and the proactive loop (continuously hunt for optimizations) both need to be automated. Martin’s parting wisdom: “If it compiles, great, but if it’s slow, your users will leave anyway. Performance is not just about GPUs, it’s about the entire request path.”

Context Engineering: The Real Work Behind AI Applications

Ricardo Ferreira from Redis gave one of the most practical talks of the conference. He built an Alexa skill (“My Jarvis”) backed by an LLM and hit every single context engineering problem you can imagine: LLMs don’t know what day it is, they’re stateless, conversations lose context after TTL expiration, vector search returns irrelevant memories, and costs grow exponentially with conversation length.

His journey through fixing these problems one by one was basically a crash course in context engineering: tool calling for time awareness, short-term vs. long-term memory with Redis, query compression for pronoun resolution, few-shot examples for behavioral guidance, reranking with Cohere for relevance, summarization for cost control, and semantic caching to avoid redundant LLM calls.

The key lesson: “Context engineering is not a feature you bolt on. It’s architecture.” Every fix he applied required deliberate, intentional design. And changing your model breaks your reranking calibration. Changing your tokenizer changes your cost profile. Nothing comes for free.

Democratizing Agent Building

Ben Maraney from Forter (slides) told the story of turning 200 people into agent builders in just two weeks, including non-technical analysts with backgrounds in law, psychology, and lab science. His approach was refreshingly pragmatic: sidestep the hard problems.

Three things made it work:

A single in-house MCP server (“Toolchain”) as a central tool registry. It went from 20 tools to nearly 100 during the sprint. Everyone could discover tools, create connections, and add new ones in a single repo.
Multiple agent platforms for different needs: LibreChat for no-code experimentation, template repos for customizable code-based agents, and Strands for event-driven non-interactive agents.
Removing organizational roadblocks: getting legal comfortable with AWS Bedrock (data stays in AWS), agreeing on high-risk vs. low-risk use cases with security, and proactively managing cost dashboards.

The best agents that came out of the sprint were the ones with deep access to company data. Their “Leila” agent, connected to code, config, and transaction data, knows the system better than any individual employee and is now used by customer support.

The funniest story: their incident response agent seemed brilliant at root cause analysis until they realized it was just finding and paraphrasing human-written postmortems from their docs.

Ben’s analogy for explaining agents to non-technical people is perfect: “Imagine hiring an intern from a good university, giving them instructions and tools, getting one task done, and then immediately firing them. Then hiring a new intern for the next task.”

Rust and AI: An Unexpected Love Story

Niko Matsakis (slides), one of the lead designers of Rust, made a compelling case that AI agents and Rust are a natural fit. The old objection to Rust (“it takes too long to learn”) evaporates when agents do the coding. As Greg Brockman put it: “Rust is a perfect language for agents given that if it compiles, it’s approximately correct.”

Niko’s Symposium project addresses the gap where LLMs don’t know about new libraries or use outdated versions. Library authors can ship “skills” alongside their crates to teach agents how to use them correctly. He also introduced the Agent Client Protocol (ACP) as a path toward portability across different agent tools.

The security angle is important too: with AI finding vulnerabilities faster, memory-safe languages like Rust become essential. As the Linux kernel maintainer said: “Rust will save Linux from AI.”

Building Your Own Multi-Agent Workstation

Robert Brennan (CEO, Open Hands) walked through why building your own custom AI development workstation matters. His key insight: agents are still new, nobody has figured out the optimal workflow, and every developer works differently. He built a custom workstation with a chat UI, vibe-coding interface, scheduled agents (daily repo summaries, tweet suggestions), voice input, and even a Spotify integration for music recommendations.

The most interesting technical advice: run agents in the cloud (not your laptop), inside Docker or your company’s VPC. Always-on cloud agents can respond to GitHub events, Slack messages, Datadog alerts, and scheduled triggers. And maintain forks of open-source agent platforms. With AI handling merge conflicts, long-lived forks are now feasible.

Autonomous SDLC at Roblox

Andrew Swerdlow from Roblox (slides) described their “Prompt to Prod” initiative: going from a prompt directly to production with no human intervention. The gap he identified is critical: we’ve solved typing (code generation), but we haven’t solved trust.

Their approach centered on three pillars:

Alignment guardrails: They extracted “exemplars” from 1.75 million code review comments (700K PRs over 3 years), institutional knowledge encoded as testable YAML rules. Their AI code reviewer now has 60-80% suggestion acceptance rate, compared to 55% for human reviewers.
Security and access: Sandboxed agents, just-in-time permissions, separate auditable identities for agents, no long-running secrets.
Rethinking metrics: Lines of code and PR counts are meaningless now. They measure feature velocity (22% increase in median features per engineer), agent quality via evals, and “long-running turn time,” which measures how long an agent can work autonomously before needing human input (current P99.9: only 2.1 hours; goal: 8 hours).

The biggest surprise: the hardest work wasn’t AI. It was plumbing. Adding CLIs, APIs, and MCP access to 18 different internal tools that had no programmatic interfaces. They used Playwright to convert UIs into CLIs. And they had to change policy: “Every code change requires a human review” is no longer sustainable when you’re producing 10x more code.

Day 2

The AI Maturity Model: Where Organizations Get Stuck

Lizzie Matusov (slides) opened Day 2 with a research-backed framework for AI maturity in engineering organizations. The core insight comes from the Theory of Constraints: speeding up code generation doesn’t help if your bottleneck is code review, testing, or deployment.

Her five stages: Ad-hoc Adoption, Assisted Development, Standardized Workflows, Supervised Automation, and End-to-End Autonomy. Most big enterprises are at stage 2-3. Growth-stage companies are at 3-4. Stage 5 is aspirational and risky because you’re giving agents the same autonomy as a senior engineer.

The most cutting observation was about measurement. Token maxing is the new “lines of code,” a dangerous vanity metric. She shared a quote from an engineer at a FAANG company who admitted to deliberately inflating token usage: asking AI questions already in the docs, prototyping features he’d never ship, and defaulting to agents even when manual work would be faster. Amazon just shut down its token leaderboard in response.

Her recommendation: measure outcomes using frameworks like SPACE (Satisfaction, Performance, Activity, Communication, Efficiency), not activity. When usage goes up but delivery outcomes don’t, that’s your signal of a bottleneck.

AI-First, Quality Always: The Red Hat Story

Catherine Weeks from Red Hat (slides) gave a grounded, practical talk about how Red Hat is approaching AI adoption across its engineering organization. Her framing was honest: executives want speed, but quality is what protects customers and brand reputation.

She described three layers of adoption:

Organizational: Automating the SDLC with tiger teams. Their RFE pipeline automation was a standout: 12 oversized RFEs automatically split into 44 well-scoped children, saving months of back-and-forth between PM and engineering.
Team: Enabling teams to bring their own context. Their QE tiger team published 15+ skills that went viral across the organization.
Individual: Encouraging experimentation. One engineer went from “YOLO mode” to building his own remote orchestrator so agents keep running during his commute.

Catherine’s most important message: make validation and testing your highest-priority investment, not your lowest. Build evaluation harnesses, think carefully about human-in-the-loop vs. human-on-the-loop, and treat AI like any other production software with sandboxing, guardrails, and deterministic CI gates.

Her Martin Fowler quote landed perfectly: “To go faster, you need higher quality.”

The AI Cost Crisis

Erik Peterson (CloudZero) delivered a reality check wrapped in dark humor. His company processes cloud and AI billing data, and what they’re seeing is alarming:

14x growth in GenAI spend in the last 5 months
80% of that appears to be overhead (R&D, internal usage, Claude Code)
One customer went from a $4M annual budget to $40M spent in a single quarter, and people got promoted, not fired
Token maxing is estimated at $5-15 billion in wasted spend industry-wide
Jevons paradox in full effect: tokens are 1000% cheaper, but enterprise spend is up 222%

His predictions are bold: by 2027, the primary buyer of cloud resources will be autonomous AI systems, not humans. The “fully autonomous cloud,” where AI writes, deploys, and manages code, will be reality by 2028.

The practical advice: stop measuring cost per token as a KPI. Instead, measure unit economics (cost per transaction, cost per feature, cost per customer outcome). Your CFO is already comparing engineer salaries to token costs on a spreadsheet. Get ahead of that conversation.

Closing Keynote: Predictions for QCon AI 2030

Meryem Arik closed the conference with predictions for 2030, drawing on her track record (she predicted the GPU inference bottleneck in 2023, and Nvidia’s stock went up 10x since).

Her predictions:

Token spend will be a massive pain point. Token usage by agents will multiply 24x by 2030. Every new generation of AI applications (chatbots, RAG, reasoning, agents, parallel agents) is more token-hungry.
Everyone will be a builder. The cost of software is trending toward zero. IT becomes a platform team enabling the rest of the organization, not the team that writes code.
Agents are making our buying decisions. Supabase grew to 7M users, almost all from agents choosing it as the default database. We’ll need new frameworks for managing implicit vendor lock-in.
Parallel agents will dominate. Linear agents are the current enterprise default, but parallel (wide) agents offer better latency, cost, and accuracy. This will bring massive infrastructure complexity.
AI regulation is coming. 50% of Americans are more worried about AI than excited. She predicts the 2028 US election will feature anti-AI sentiment, leading to real regulation by 2030.
Career leveling changes. Deep technical skills matter less. Product thinking, requirements engineering, and “conductor” skills (managing 5-10 agents simultaneously) become the key differentiators.

Her final prediction: “There won’t be a QCon AI in 2030. There will just be QCon. AI will be part of all software engineering.”

My Overall Takeaways

The bottleneck has shifted. Generating code is solved. The hard problems are now trust, cost management, organizational adoption, and quality assurance.
Context engineering is architecture. It’s not prompt engineering or a nice-to-have. It’s the core design challenge of production AI systems.
Quality over speed, always. Every speaker who’s actually shipping AI to production emphasized this. Start with evaluation harnesses, guardrails, and validation, not velocity metrics.
The cost conversation is coming for all of us. CFOs are comparing token costs to salaries. Measure unit economics, not tokens consumed.
Make it easy for people to build. The organizations winning with AI aren’t the ones with the best models. They’re the ones that removed friction for their people to experiment.
The plumbing matters more than the AI. Multiple speakers noted that the hardest work wasn’t AI at all. It was adding APIs, CLIs, tests, and observability to existing systems.