The surprising capabilities of older AI Models

And what older models you should still consult

Apr 07, 2025

AI is a winner-take-all race. A day might come when it's a monopoly. But even today, the best model is what everyone wants to use, at any given time.

And it is usually the case that the smartest AIs are better at almost everything than any AI that came before them.

People usually get to know which is the smartest model through benchmarking sites such as LMArena, which ranks the models based on what users vote as the best model during their interactions with them -

As this essay goes out, Gemini 2.5 Pro is the best model, and if you've used it, you know that it is the case. It is remarkably more intelligent than any model that came before it, and if you've ever found a problem that an LLM couldn't solve before, especially problems in Math or programming, you should throw it at Gemini 2.5 Pro. Like lots of people have noticed, Gemini may just be able to solve it.

The problem with believing the benchmarks is that there exists a minority of people in the world with no morals, and no taste. Mark Zuckerberg is definitely on that list, and thus, despite being one of the firsts to popularise Open Source releases of frontier models, Meta's recent models are clearly not good. If you use it for more than a few seconds, you'll understand why it's not only bad, but why it shouldn't be anywhere in the top 5, let alone hold the #2 position.

The practice is called "Benchmark-Maxxing" where the AIs are specifically trained to do great on benchmarks during post-training, but it's not actually that great at anything else that doesn't reflect the benchmark's problems closely.

There are rumours that Meta's models have been doing exactly that, and that's why the models are bad, but the benchmark numbers are good -

Dunking on Zuck's moral failings, even though there is a lot of substance there, isn't the point of this argument. The point is that benchmarks should act as a filter for you to try the models yourself. The use cases of LLMs are so diverse that it's impossible to know exactly what your use case is, and how the model specifically performs for your use case.

It is likely going to be the case that the best model at any given time is also going to be the best model for your use case. The best model at any given time is likely the biggest recently launched model, it's likely the most intelligent model and thus, it likely outperforms the model you were using previously for that particular task.

There is no guarantee that the above is going to the be the case for your use case, but surprisingly, I’m writing this essay to convince you that older models, even models released last year may just be better for you than what’s the latest and greatest. The best way to test that is to just go and talk to the models I mention in this essay.

II.

Model diversity is now high enough that there are some models who are good at some things and bad at others, and the best model right now is worse at something a model from last year is still good at!

For a long time, I didn't believe that this was true. I believed that the most intelligent model should be able to beat every model that came before it at all tasks. For the most part, this is true. Bigger models are better than smaller models, which is why open source models haven’t caught up to the billion dollar training runs of big AI labs.

But I had a feeling that this might not be true the more I used LLMs. There was something different in some LLMs, something unique about them about specific things that were matched by none of the other LLMs, despite how better they were in every metric.

Now I finally have proof that the lingering feeling was right.

There's a new benchmark called EqBench which measures creative writing ability, emotional quotient, long-form writing , etc, and looking at this benchmark feels like it confirms what I've been suspecting all along.

You can visit the benchmarks yourself to see how the models are tested and how they're evaluated for different capabilities. To me, this benchmark feels like it should be as important as the benchmarks in LMArena, because models are simply not just supposed to be programming agents, but they should be widely useful and pleasant to talk to.

The rest of this essay will focus on three models that I think you should always check in with, for what your use case is, regardless of what it is.

You may be surprised at what you see, so keeping an open mind is a must. The first model we’ll talk about is what’s at the top of EQBench, and anyone who has extensively used this model can already guess what it should be.

III.

Claude 3.5 Sonnet.

Not Claude 3.7 Sonnet, which is supposed to be an improvement of Claude 3.5 Sonnet and while it’s good, it’s not as great as Sonnet 3.5 itself.

While everyone was using ChatGPT, the smartest people in the world were tweeting about always using Claude, going to it for life advice and treating it as a therapist.

For a long time, it was the best model for programming, and Claude 3.7, to their credit, is even better at programming than Claude 3.5.

But here, I'm talking specifically about Claude 3.5 Sonnet, and on EQ bench, it's at #1.

This means it has the highest emotional quotient of any model, despite being an older model from last year.

If you have ever used an AI to just talk through your feelings, or figure something out that required some emotional depth, even if it's just for brainstorming potential solutions, nothing comes close to Claude 3.5, in my experience.

GPT - 4.5, a larger model with more world knowledge actually might, as it’s evident from its similar higher ranking on the benchmark, but in my experience, GPT-4.5 seems too eager to please.

There was one moment in my life when I was asking both models on how to respond to someone that I was mildly upset with, and my immediate non-thoughtful response was to just ignore it, and GPT-4.5 gave me an intricate multi-day plan on how to properly execute this avoidance behaviour. Claude 3.5, a smaller model from a year ago, immediately told me that avoidance is not the right solution and that I should talk through the issue instead.

I do expect GPT 4.5 to be better here, as it's a joy to use as well as being objectively smarter than Sonnet, but I honestly don't know what the solution is.

One thing that definitely isn't the solution, for GPT-4.5 or any model that chooses to please you over offering you what's objectively the best solution even if it's difficult in the short term, is adding custom instructions like "Be disagreeable. Don't be afraid to disagree with me when you know I'm wrong." Yeah, this doesn't work at all, as it makes the model default-disagreeable where they disagree with everything. Thus, emotional depth and maturity seems like something custom instructions can't fix.

It’s a shame that Claude 3.5 Sonnet may never be open source, because Anthropic, despite having the talent and taste to build such a great model, is also ideologically an AI safety organisation who believes the AIs will kill everyone on Earth someday, so I wouldn’t hold my breath on them making any releasing any open source models anytime soon.

What should be surprising from the benchmark is that an OpenSource model is at #3, and that’s actually my recommendation for the other model that you should consider really checking out -

IV.

DeepSeek R1.

This is a gem of a model, available for very cheap, and simply reading its reasoning traces is fascinating enough on its own, but it's also #1 at creative writing!

It's not the best LLM for correctness if you're looking for that (but it's still the best open source model), but it's incredibly fun to talk with! It’s #1 in creative writing on EQBench.

There's a natural playfulness to it that's not there in any other LLM, and reading its reasoning traces feels like actually reading someone's mind. If you've heard someone mention that LLMs are just predicting tokens, have them read through R1's reasoning traces. Yes, the model overthinks, a lot, but it still feels like someone deliberately reasoning through to arrive at an answer.

Maybe this uniqueness comes from all the chinese data it has been trained on, but it feels oddly powerful for its size, and it feels oddly more intelligent than what it should be - even if it doesn’t get everything right.

Grok 3 (with thinking) is the model to use if you're actually looking for intelligent answers with reasoning traces all visible, but these aren't very fun to read as Grok, despite being one of the best models right now, still suffers from producing content that "feels" like slop - overusing words that LLMs generally overuse. I don’ think this “slop” problem is going away anytime soon, so getting used to it is a good decision as it’s highly likely people will soon be reading so much AI generated content, that they’ll be speaking with similar “GPT-isms” as well.

Anyway, Claude 3.5 and DeepSeek, alongside your model of choice (which should be Gemini 2.5 Pro right now, because of how great it is) should be on your virtual council of advisors. This is not true for tasks like making a mermaid diagram of your codebase, or any programming problem, or a logical puzzle or anything that requires raw intelligence - at which point Gemini 2.5 Pro, OpenAI's o1 Pro & Grok 3 (with thinking) should be your go-to.

But if it's anything that requires emotional maturity, or just playful creativity - like asking a model for personal advice, how to deal with a conflicting situation, or simply what the next possible plot lines or story beats could be in your story, a council of models consisting of Claude 3.5, DeepSeek and Gemini 2.5 Pro should be there. I'd also be remiss if I didn't at least mention GPT-4.5 and GPT-4o, as the latter is still getting updated to be more intelligent & the former is a genuinely big model with more world knowledge than most LLMs on this list.

If you do want to check out some AI models to see how unique their personality and capabilities can be, I’d recommend you check out Claude Opus & GPT-4 (Turbo) as they’re still uniquely interesting on their own, even if they’re not as intelligent as even smaller open source models today.

The best way to use these models is to use OpenRouter. You can even buy credits there using crypto, which makes them particularly unique in the AI space. (Fun fact : The founder of OpenRouter was also the cofounder of OpenSea!). If you rather use your mobile to talk to LLMs, use Pal Chat on iOS with your OpenRouter key and you can talk to any model on the go.

Finally, keep an open mind!

What works for my use case may not work for yours, and the models I mentioned here may not be what you're looking for. At the time of writing this, these are my picks but I'm sure a year from now, things might look radically different. But what I’ve been observing is that humans, naturally being tribal, tend to stick to one model and only move from the best model to the next, while disregarding every other model.

As the pace of AI accelerates, it’s a much better strategy to consult a council of AIs for problems you face, and only then make the informed decision of which model (or ideally, groups of diverse models) is better for you.

Bright Mirror

Discussion about this post