Model Feel, Fast Tests, and Staying in Flow with AI Coding Agents
Alternatively titled: Why ‘feel’ and feedback loops matter more than benchmarks when coding with AI.
Hello fellow datanistas!
Most conversations about AI coding models obsess over benchmarks and pass rates. But if you spend real hours in the loop, you know: what actually shapes your day is the feel of working with the model, not just its numbers.
In this post, I dig into the qualitative side of using LLMs as coding agents—how their personality, feedback style, and the tools around them shape your flow, trust, and productivity. I’ll share what I’ve learned from bouncing between models, why the agentic harness matters as much as the model, and how I now pick my tools based on the phase of work, not just the leaderboard.
When you start using LLMs as coding agents, the experience quickly becomes about more than just accuracy or latency. It’s about how often you have to intervene, how much you trust the process, and whether you stay in flow or get derailed by weird breakage.
Two axes keep showing up for me: time horizon (long-horizon autonomy vs. short-horizon iteration) and personality/verbosity (how the model behaves when it’s wrong, how much it narrates, and whether it stays constructive or spirals into apology loops).
But there’s a third ingredient: the agentic harness. The tools and checks that let the agent verify its own work, and the feedback you get while it’s running. A good harness—one that gives you live traces and fast tests—often matters more than swapping models.
For example, when refactoring a big codebase, long-horizon models like Opus-4.5 or GPT-5.2 can generate plausible scaffolds, but they struggle with careful, incremental work. Short-horizon models (Sonnet, Minimax M 2.1, Composer-1) shine when you need to walk through changes step by step, watching traces and intervening early.
The real breakthrough for me wasn’t just picking the right model, but adding simple, fast tests (like Cypress reloads) to the harness. Suddenly, the agent could catch basic breakages immediately, and I could trust the loop again.
Personality matters too. Some models apologize endlessly when they mess up (looking at you, Gemini 2.5), while others stay upbeat and constructive. That emotional texture shapes the whole coding experience. Enthusiasm, it turns out, is a feature.
And don’t underestimate the illusion of speed: streaming feedback and live traces make the wait feel shorter and keep you in the loop. A spinner with no feedback? That’s a recipe for frustration, no matter how fast the model is on paper.
After enough hours, you start to build muscle memory for a model’s quirks. That comfort is sticky—and a subtle form of vendor lock-in. I try to stay fluent across models, so I don’t end up optimizing my workflow around one set of quirks and calling it productivity.
Now, I pick my tools based on the phase of work: long-horizon autonomy for scaffolding, short-horizon iteration for refactoring or debugging, and always, always improving the harness before blaming the model. Fast, agent-runnable tests have saved my sanity more than any leaderboard-topping model ever could.
If you want the full story, including concrete examples and code, check out the full post: Model Feel, Fast Tests, and AI Coding That Stays in Flow.
The feel of your AI coding agent—and the feedback harness around it—matters as much as raw model performance. Fast, agent-runnable tests and constructive feedback loops keep you in flow and make the whole system more trustworthy.
How do you choose between long-horizon autonomy and short-horizon iteration in your own coding workflows? Have you found a harness or feedback loop that changed the way you work with AI agents?
If this resonated, read the full post for more stories and code examples, and consider sharing your own experiences or subscribing for future deep dives.
Happy Coding,
Eric
P.S. My friend Zhaojie Zhang is hiring a Principal Scientist for Safety and Regulatory Data Insights! He’s an awesome guy to work with. Applications are here.

