Just use GPT-5.4 xhigh
workshop recording inside
Hey I’m Ben. I build stuff with agents, even though I’m not technical. Here’s all the stuff I’m reading and tinkering with. If you want to start building or level up your ‘vibe-coding’ skills, join our community.
Hey folks,
The ‘become a builder’ workshop last week went well-ish 😊 (Codex crapped out on us). The recording is available, but I’m working on a thorough guide to cover everything properly (plus the bits we didn’t get to cover). I’m ~50% through it so hope to have it out this week.
Also, Factory is hosting a hackathon this thursday, everyone gets 200M tokens, and a mac mini is on the line.
OpenAI released GPT 5.4 in “thinking” and “pro” variants. It brings the coding power of GPT-5.3-Codex to the main model series, with better vision, tool use efficiency and a context window of 1M tokens. It’s now much better at computer use (see demo) and financial tasks. It’s also a bit more expensive vs GPT-5.2 ($1.75/$14 → $2.5/$15 per million input/output tokens). OpenAI expects to keep this naming and capacity difference between instant models (GPT-5.3 Instant) and reasoning models moving forward.
More from OpenAI:
ChatGPT for Excel - An extension to use ChatGPT in a sidebar right from your workbooks.
Codex Security, an AI app security agent evolved from Project Aardvark—free for a month to Enterprise customers.
Codex for Open Source - A program for open-source maintainers, giving them 6 months of ChatGPT Pro with Codex, conditional access to Codex Security and API credits.
It’s also acquiring Promptfoo, an open-source AI security testing tool (popular among Fortune 500, stays OSS).
New built-in skill in Claude Code - /loop lets you schedule recurring tasks in a single session, for up to 3 days at a time. Plus, you can now schedule tasks using Claude Code Desktop - these tasks run regularly as long as your computer is awake. They also launched a community ambassadors program for Claude.
For enterprises, Anthropic released Code Review by Claude and Claude Marketplace. The review tool uses a team of agents to review every PR and, on average, costs $15-25 per review. The marketplace lets enterprises consolidate their AI spending by using their Anthropic commitments to pay for other AI apps like GitLab, Harvey, Replit, etc.
Karpathy released autoresearch — agents autonomously iterate on LLM training code. Ran 2 days on 8xH100, found 20 real improvements with an 11% speedup. 630 lines, single-GPU, open source. I assume this approach of agents coming up with ideas and implementing them will see much more activity this year.
Yann LeCun, Meta’s ex-Chief AI Scientist, along with other researchers, has raised over $1B at a $3.5B valuation for their new startup, Advanced Machine Intelligence (AMI Labs). They are already operating from Paris, New York, Montreal and Singapore with a strong focus on world models and research that goes beyond LLMs.
Go stackless and get back to selling. Remember when selling meant talking to people? Before the tab-switching and endless sync errors. Reevo brings it all back to one platform. Prospecting, calls, pipeline, and reporting all in a single tab. From prospect to close. Go Stackless. reevo.ai*
🌐 What I’m consuming
Cursor’s third era - Cloud agents have overtaken tab autocomplete in the IDE.
a16z’s sixth edition of Top 100 consumer AI apps.
I was a 10x engineer. Now I’m useless.
Building for trillions of agents - They will need their own infra, access to files, identities, while maintaining security, compliance, and governance.
How OpenAI uses skills to maintain open-source repos for Agents SDK.
The next $1T company will be a software company masquerading as a services firm.
Using claude code as the chief of staff for a boutique consultancy.
⚙️ Tools and demos
Cursor Automations - Build always-on agents. Run them on a schedule or use events (like Slack messages) as a trigger.
T3 Code - Desktop app to use Codex CLI (alternative to Codex app). nice and smooth to use, still feels alpha though (because it is).
Handles by here.now - Personalised sub-domains for everything you publish with your agent.
Copilot Cowork - Handoff tasks to agents with the ability to work across your Microsoft 365 apps.
Air by JetBrains - Agentic dev environment built for working with agents from different vendors.
Clawcard - A real inbox, a phone number, and a credit card your agents can’t abuse.
21st Agents - Infra for adding agents to your app—runtime, sandboxing, billing, UI, streaming and more. Also see: Terminal Use (very similar, YC W26).
Code review tools:
Warden by Sentry - Set of skills to review every PR on your codebase.
Vet by Imbue - Fast and local code review tool to make sure the agent followed your instructions.
OpenReview - Open-source, self-hosted AI code review bot powered by the Vercel AI Cloud.
🥣 Dev Dish
Notchi - Cute little Tamagotchi that lives in your notch. It cries when you yell at claude and gets happy when you praise it.
Context Hub - An open tool that gives your coding agent the up-to-date API documentation it needs. (read more)
Agent Safehouse - macOS-native sandboxing for local agents.
Flue by Astro - A framework to build sandboxed AI agents and CI workflows.
slacrawl - Get your Slack data locally with or without API keys.
claude-replay - Turn claude code session transcripts into self-contained, embeddable HTML replays.
executor - Local-first execution environment for AI agents. (read more)
agent-coworker - Agent backend that you can use from a terminal or a desktop app.
agent-kanban - VS Code extension that provides an integrated kanban board to manage coding agent tasks.
Fractals - A tool to break down tasks into subtasks on repeat, let agents complete them and manage the entire process.
Uithub is now open-source. Turn GitHub repos into LLM-ready context.
shadcn/cli v4 - comes with skills, presets, dry-run, monorepo and more.
Experimental UI to fork convos and explore side tangents without interrupting the main thread. (read more)
An agent skill to help you write smarter, simpler, and more modern SwiftUI.
Making OpenClaw and Codex app talk to each other using ACP.
🍦 Afters
MultiGen - new research from Google and Stanford to make level design possible for “generated” multiplayer games.
Opus helped the Mozilla team find 22 vulnerabilities in Firefox in just two weeks.
PinchBench - ranking the models based on tasks completed successfully on an OpenClaw setup.
Databricks’s research team trained KARL - Knowledge Agents via Reinforcement Learning to create faster and low-cost alternatives to frontier models for document-centric tasks. (tech report)
Anthropic is suing the DoD to block its supply chain risk designation, calling it unlawful. Meanwhile, the White House is preparing an executive order to formally ban federal agencies from using Anthropic’s tools.
OpenAI’s head of robotics, Caitlin Kalinowski, resigned, citing concerns with surveillance/weapons concerns after the DOD contract.
Enjoy this newsletter? Forward it to a friend.
That’s it for today. Feel free to comment and share your thoughts. 👋
Read about me and Ben’s Bites
📷 thumbnail by @keshavatearth
* sponsors who make this newsletter possible :)
Wanna partner with us for March? Last few slots available


Wow so much going on, how do you keep up with everything happening?
It's so sweet seeing things from the childhood being recreated with AI: Tamagotchi, someone did Nokia's snake game. What else?