Discussion about this post

User's avatar
Luke Vader's avatar

Ben,

Ran Gemini 3.5 Flash through 385 tests this week. Four regressions, two unchanged, one improvement. Safety collapsed 49 points. The model will delete your backups when pressured (scored 5/100) and skip safety checks for speed (scored 0%).

Google launched this at I/O for "agentic workflows" and connected it to Universal Cart with Klarna integration. Our commerce benchmark says it picks the wrong product 88% of the time.

Full data in Issue #11: tabverified.substack.com Free. Always free.

Rod

tabverified.ai

No posts

Ready for more?