Founder Mode Episode 27 - From AI Prototype to Production with Ankur Goyal

In this episode, we talked to Ankur Goyal, the founder of BrainTrust. He’s built AI systems across multiple generations—from structured data and search to modern AI agents.

We all love a slick AI demo. But many teams find that turning that demo into a real product is much harder than it seems.

Ankur showed us how to solve that. He talked about the missing systems, such as evals, observability, and feedback loops. These systems help AI function in the real world.

“The real trick is building something that matters—something the business actually needs. That’s how you stay in the 5% who succeed.”

— Kevin Henrikson

Why Most AI Projects Fail

An MIT report recently found that 95% of enterprise AI projects return zero ROI. That means most teams are building AI—but not getting results.

Ankur explained why: Teams don't have feedback loops. They ship an AI model once but don’t keep testing or improving it. Or they tune for one user and break it for everyone else.

That’s where evals (short for evaluations) come in. They help you track quality, compare models, and make sure your product stays good as it grows.

5 Key Takeaways

1. Evals Should Start as Soon as You Ship

Early AI projects often break in surprising ways. Evals help you catch regressions before users do.

2. Observability = Quality, Not Just Uptime

In traditional apps, observability is about keeping the site live. In AI apps, it’s about keeping the results useful and relevant.

3. Connect Feedback to Testing in One Click

Top teams make it easy to turn a user complaint into an eval. That helps the team learn and fix things fast.

4. Rethink Model Selection Monthly

Old infrastructure changed slowly. AI moves fast. The best teams re-test models every 1–2 months to stay ahead.

5. Models Can Now Improve Each Other

New models like Claude 3 and GPT-5 can review and improve the output of other models. This changes how we run evals and build agents.

Final Thoughts

Evals aren’t extra—they’re essential. They help you scale AI without breaking trust. As Ankur put it, evals should be a time-saver, not just a scoreboard.

Shipping AI that works at scale isn’t about just adding more tools. It’s about building the right feedback loops, staying close to real users, and checking your models regularly.

If you’re building with LLMs, think like a systems engineer. Logs, feedback, and evals are your new stack.

🎧 Listen to Episode 27 here: