agents need context engineers

the lethal trifecta and a simple way to boost performance

Jun 17, 2025

I write a newsletter about startups and investing—for ai builders of all levels.

I record mini-tutorials, review tools I’m testing, share my insights from an exited founder turned investor.

Hey folks,

I’m just about ‘caught up’ from my week away and kids bday - so just the regular swarm of updates to cover today 🙃

I met Jeff for dinner a couple months ago and he’s one of the clearest thinkers on AI computing imo. He wrote a short piece; why AI is a new computer. Being a new computer means everything computer’s do today can be re-built (related pod with Marc Andreessen).

I’m still using Dia from The Browser Company. Nick works there and showed off their ‘script’ prototype - to allow you to take actions on sites e.g. play 4 youtube videos at once, hide the sidebar on twitter etc. Reminds me of Arc’s ‘boosts’ feature but better. They still need MCPs and tool-use but it’s easy to see where this ‘ai-native’ browser is going.

Anthropic and Cognition both put out posts on building multi-agent systems. Cognition said it’s hard for coding, Anthropic said it’s good for research. Both are worth a read, and one important thing to note is context engineering:

In 2025, the models out there are extremely intelligent. But even the smartest human won’t be able to do their job effectively without the context of what they’re being asked to do. “Prompt engineering” was coined as a term for the effort needing to write your task in the ideal format for a LLM chatbot. “Context engineering” is the next level of this. It is about doing this automatically in a dynamic system. It takes more nuance and is effectively the #1 job of engineers building AI agents.

What happens when agents have access to your private data, untrusted web content, and the ability to talk to external services (ie using MCPs)? A lethal trifecta for security. We need to take prompt injection seriously before it becomes the new computer virus nightmare of the 2000s. Here’s Karpathy’s take.

🔎 News worth knowing

Codex can now generate multiple responses for a single task. This "best of N" approach is a big deal. You see it in benchmarks all the time—a model's performance jumps if it gets 8 or 64 attempts at a problem. This is now rolling out directly in the product. And the rumour is that o3-pro is “best of 10 responses from o3”.
- The general trend of better performance with the best of N method is a signal that we're not hitting a wall with model progress; we can still squeeze out better performance and then distil that capability into a new model that gets it right on the first try.
- OpenAI also shipped some improvements to Projects and Search in ChatGPT. You can run Deep Research in a project now, and even in normal chats, ChatGPT references past chats from that project. And ChatGPT can search using images as well now.
Keynotes from the AI Engineer World Fair - the best place to get a grip on the current state of AI engineering.
Box, a leader in Intelligent Content Management, recently surveyed 1,300+ IT leaders to see what is truly happening in AI and packaged its findings into Box’s State of AI in the Enterprise report. Download the report to learn more about how AI is rapidly transforming businesses across industries.*

*sponsored

want to partner with us? Click here

🌐 What I’m consuming

A breakdown of bad AI writing patterns and what gets wrongly flagged as AI-generated.
If you’re still not on the Claude Code train, give this guide a read, but if you’re already burning tokens, here’s how to push it to its limits for more complex tasks.
How OpenAI's head of business products uses ChatGPT to save time at work.
Cursor’s CEO with Garry Tan. I like the part where Michael talks about niche software opportunities.
A conversation with the creators of the Model Context Protocol (MCP).
Why we want robots at work, but humans in art.
Future of work with AI agents, based on a study of 1500 workers across 104 occupations.
According to a new Gallup poll, the number of workers who say they use AI at work has nearly doubled in the past year.

⚙️ Tools I’m looking into

new.website - Text-to-app is hard, but text-to-website creation is relatively an easier problem. This tool is going after website creation, with the tricky parts where AI still struggles patched with features like built-in functional forms, Zapier integration, SEO, and a CMS.
Granola for Windows - the AI-note taking app that everyone loves and uses (incl me) is now available on Windows (the link will only show if viewing on a Windows device fyi)
Sketch - An agentic coding tool that runs in your terminal but also has a web UI. It understands your codebase and helps you get work done, with the best support for Go.
RunwayML just introduced a chat mode. You can now generate and edit images, add references, and even work with video through a conversational interface instead of clicking around.
Helio 02 - A new text-to-video model that's up there with Veo 3 and SeedDance, and it's especially good at image-to-video tasks. (try here)
Tool idea: Figma, but for data visualization. Upload a CSV, describe the design you want, and get a chart. A tool like Julius AI could do this, but the market is big enough for dedicated players.
Chorus - You can now import your chat history from ChatGPT and Claude into this macOS AI app. This is really interesting. Now, who is going to make a tool to get editable/portable memories out of your ChatGPT history?

🥣 dev dish

Native container support in macOS - This is a big deal because it could replace the need for tools like Docker for many developers (and vibe coders), simplifying local development workflows.
Claude Squad - A terminal app to manage multiple Claude Code instances in separate workspaces, so you can work on several tasks at once.
Zen MCP Server - An MCP server that lets Claude orchestrate other models like Gemini, o3, and anything on OpenRouter.
Kimi-Dev-72B - A 72B open-source model from Moonshot AI, specifically for resolving GitHub issues.
JAN nano - A tiny 4B parameter model built for deep search tasks using MCP.
A free course on how to build AI agents from Mastra AI, available via MCP.
miniDiffusion - re-implementation of Stable Diffusion 3.5 in pure PyTorch.
cursorkleosr - Memory for Cursor.

🍦 Afters

A few portfolio-related shout-outs:
- SpeedTrials.ai is holding an event for software engineers in SF on June 28th, to see how quickly developers can ship with AI.
- Pointer is building the interface between humans and software — a copilot that lives inside apps to guide, answer, and act. They just closed thier seed led by Amplify and great angels (incl me!), and are already helping fast-growing companies like Rho and Delve. Looking for a founding engineer to join them in SF and help change how people interact with software. More info here.
- General Bionix is building software for programming robots (cursor for robotics, if you will). They’re looking for a cracked engineer in SF. Looking for experience with deep robotic manipulation and 3D vision evidenced by work history, side projects and education. Reach out to Vaishak.
- Ali Rowghani (former Pixar, Twitter COO & YC growth program) is launching First Harmonic, an 8-week cohort to help early-stage companies nail their go-to-market strategy. Ali has been an awesome LP and I’ve recommended several portco founders join his program.
Dalton Caldwell is leaving YC to launch Standard Capital, an AI-native Series A firm, with Paul Buchheit and Brian Berg.
A16Z speedrun is coming to London on July 2nd. I might be there.
The Xeno Demo Day is giving $15,000 grants to autonomous AI agents in a four-week, AI-native program.
A member of our community is running free AI workshops for educators this summer.
Tensions are reportedly reaching a boiling point between OpenAI and Microsoft.

That’s it for today. Feel free to hit reply and share your thoughts. 👋

Enjoy this newsletter? Please forward to a friend.

Find me on X, Linkedin, or Instagram
Read about me and ben’s bites