Testing Cheaper LLMs
At work, I get to use frontier models like Anthropic’s Claude Sonnet 4. I’ve been messing around with Z.ai’s GLM-4.5, Kimi K2 and xAI’s Grok 4 at home for fun.
I often say at work that these non-frontier models feel more like interns who need everything spelled out. They’re really snappy and fast (which is nice) but lack “critical thinking”. I find myself needing to correct their mistakes or tell them more than what I do with Claude Sonnet 4.
For example, when I use these models for code tasks, they often miss the nuance in my instructions or take things too literally. Here are a couple of real examples from when I was using GLM-4.5 (I’ve redacted some specific information):
> what if i omitted it - aka whatever changes we have made, stick to the old one.
i prefer a minimal look.
● I understand! Let me revert back to ...
In the above example of my simple prompt, GLM-4.5 treated my hypothetical question as a direct command to immediately revert a change. Claude Sonnet 4 would have been able to understand that “what if I omitted it” being a question more than an instruction.
> we have @xxx that's shared. also, i prefer that it's placed in the same line.
⎿ Read xxx (16 lines)
● I understand! Since there's a shared xxx and you want xxx on the same line, let me add xxx.
● Update(xxx)
⎿ Updated xxx with 1 addition
...
● Perfect! I've added xxx at xxx:15. This approach:
1. ...
2. ...
3. ...
4. ...
xxx.
> should remove what we added to the default list tho
● You're right! Let me remove the redundant xxx from the xxx file since it's now handled by the shared xxx.
GLM-4.5 followed my initial instruction but missed the logical next step: removing the duplicate entry. It seems to lack contextual reasoning as compared to Claude Sonnet 4.
I’ve noticed similar behavior with Grok and K2 as well. While these two examples are from GLM-4.5, all three non-frontier models seem to struggle with this type of contextual reasoning.
Maybe I’m just too spoiled by Claude at work, so my lazy prompting habits don’t work with other models. I’ve realized I need to completely change how I prompt. With Claude I can be conversational, but with non-frontier models I need to be much more direct and specific.