Any model in your CLI

MCP tools, secret models and an evals debate

Sep 09, 2025

The newsletter for ai builders of all levels. Mini-tutorials, tool reviews, and lay of the land from an exited founder turned investor and forever tinkerer.

Hey folks,

Over the weekend, the 'evals’ debate really took off on Twitter. Debates like this have a ton of “is a burger a sandwich?” nonsense, but the big question is: should you, as an AI product builder, have a testing process for how well your product performs on certain tasks? Of course, the answer is yes, and of course, the usual caveat applies: both too little or too much of this testing are bad. I plucked some snippets from people who wrote good stuff about evals this weekend:

Shreya (In defence of evals): Expertise lets you avoid static metrics. Dogfooding (using your own product) as an expert regularly and updating it based on the vibes is evals. To paraphrase this: “we don’t do evals” is mostly a misnomer like “our sci-fi movie has no CGI.”
Alex (Evals are a scam): You don’t outsource evals to someone with zero expertise about your product. You can use their tooling to measure something, but don’t let them tell you what to measure. Most evals companies sell logging, observability and complexity beyond that.
other posts:
- What are evals, and who needs them? - Ben Hylak, CTO Raindrop (A/B testing product)
- A/B testing can’t keep up with AI - Ankur Goyal, CEO Braintrust (evals product)

You can now Branch chats in ChatGPT. Click on the three dots after any response to create a copy of your chat up until that message, where you can talk about something different. I assume this will be very useful for hashing out a product idea in a chat and then branching it once you want the model to code/create PRDs, etc.

We keep hearing of the mass movements from Claude Code to Codex - but no doubt that’ll reverse with another Claude model. But if you want to get your hands on a new CLI tool that supports gpt/claude/gemini models - add a reply in this thread and I’ll get you access.

An early tester had this to say “am i allowed to tweet about the cli yet? it definitely feels much nicer than codex for gpt-5 agents”.

Kimi K2 has a new variant that’s better than Opus 4.1 on Terminal-Bench and as good as Sonnet 4 across other software engineering benchmarks.

OpenRouter has two new stealth models - Sonoma Sky Alpha and Sonoma Dusk Alpha. 2M context window and likely as good as Opus and Sonnet. Guess is that they are from xAI.

MCPs now have an official Registry - An open catalogue and API for publicly available MCP servers to improve discoverability and implementation. Smithery (portco) now supports MCPs hosted from anywhere. Listen to Henry (founder) talk about MCP.

How are you evaluating your AI outputs? Learn how the experts quickly and accurately evaluate AI using LLM judges. Enjoy 70 pages of content on how to automate evaluations using advanced techniques, including practical frameworks for building your own LLM judges. Get the free eBook!*

*sponsored

🌐 What I’m consuming

How to code with Droids - step-by-step guide for how artists, designers, writers, and more can create software.
How we built an interpreter for Swift.
Build an AI life co-pilot with Claude Code in 25 minutes.
Using linters to direct agents.
In the age of AI, young founders aren’t waiting to grow up.
The bear and bull case for local models in just 4 basic graphs.

⚙️ Tools to tinker with

Fenic - OSS PySpark-inspired DataFrame library for LLMs. Run semantic joins, batch inference, and transform markdown & transcript to insight.*
Operate - Precision-built CRM designed for sales and built for founders.
Whisper - Desktop AI that sees your screen and delivers everything proactively.
Notte - Build and deploy agents that work on the web without breaking.
Compound by Groq - Use open models with a complete set of agentic tooling, including web search, code execution, browser automation and more.
Oasis 2.0 - Re-skin Minecraft in real-time, 1080p, 30 fps.
NotebookLM got Flashcards, Quizzes, more report templates and new voices for Audio Overviews.
Rork for iOS - make apps from your phone (launched on Product Hunt)
List of mini tools for everyday work (by Simon Willison)

*sponsored

🥣 Dev dish

BuildKit 2.0 - shadcn for AI tools. Build AI tools and MCPs in minutes.
Twiggy - Let your cursor agent see your entire codebase's structure in real-time.
OpenAPI to MCP - Convert any server described with OpenAPI into an MCP endpoint!
Open-source example of an end-to-end vibe-coding platform. (demo)
SemTools - a toolkit for parsing and semantic search in the CLI. (read more)
Stagehand Agent (browser automation) can now use MCP tools. (examples)
Cursor will soon support custom /slash commands.
Codex CLI now has web search. Enable it with --search flag.

🍦 Afters

Story Arc Engine - Tweak parts of a story to see how plot changes trickle down to the rest of the story.
Nano-banana browser - Generate websites (screenshots) based on the URLs.
OpenAI is
a) planning to build a job platform to match talent with businesses using AI
b) backing a full-length animated film to tap into Hollywood.
sfcompute is hiring for systems & networking engineers.
I hate my friend - review of the “friend” pendant by Wired.
A new startup, Alterego, claims it can capture silent speech (mouthing) and let you take notes, reminders or talk to other people using their device.

Enjoy this newsletter? Forward it to a friend.

That’s it for today. Feel free to comment and share your thoughts. 👋

Find me on X, Linkedin, or Instagram
Read about me and ben’s bites

📷 thumbnail creds: @keshavatearth

Discussion about this post

Ready for more?