Maxim Saplin

Posted on Jan 23 • Edited on Jan 29

131

Deepseek R1 vs OpenAI o1

#ai #machinelearning #llm #genai

Deepseek R1 is out - available via Deepseek API or free Deepseek chat. If you are following LLM/Gen AI space you've likely seen titles, read posts, or watched videos praising the model: 671B MoE model, open weight, lots of info on training process. It challenges OpenAI's reasoning models (o1/o1-mini) across many benchmarks at a fraction of a cost... There are even smaller "distilled" versions of R1 available for the local run (via llama.cpp/ollama/lmstudio etc.).

I have been stress-testing models with LLM Chess since Autumn, and so far, none of the "reasoning" (or "thinking") models impressed me (except OpenAI's o1). Right away, I launched the benchmark, but I had to wait for a few days to collect enough data (it seems that the API was throttled; it was extremely slow).

LLM Chess simulates multiple games of a random bot playing against an LLM. Thousands of prompts, millions of tokens, every game is unique (unlike most evals which have fixed sets of prompts/pass conditions). Several metrics are collected and aggregated across multiple runs. The performance of a model is evaluated in reasoning (% of Wins/Draws) and steer-ability/durability (how often a model fails to follow instructions or drops out of the game due to multiple erroneous replies).

Reasoning Models

Before o1, LLMs couldn’t beat a random player in chess. GPT-4o? Zero wins. Claude 3.5 - Zero. They’d either collapse early or drag games into a 200-move game limit (with an automatic Draw assigned).

Then came o1. OpenAI’s "reasoning" models broke the record:

o1-preview: 46.67% wins
o1-mini: 30% wins

Other "reasoning" models? After the o1 release in late 2024 came the controversy around OpenAI's secrecy... The hidden "reasoning" tokens discussion (invisible yet billed) and how people were banned since OpenAI suspected they tried to crack their secrets. At the time we have seen attempts to reproduce the success of o1 with AI labs introducing "reasoning" models. E.g. Qwen's QwQ, Sky T1. Even Google released their experimental Gemini Thinking model in December 2024.

None of the alternative "reasoning" or "thinking" models came close to OpenAI models - they struggled even with basic instruction following drowning in verbosity, dropping out of the game loop after just a few moves: games lasted for 2 to 14 moves on average. Take for example a non-reasoning old and out-of-fashion GPT-4 Turbo which lasted on average 192 moves (before losing to the random Player due to a checkmate :).

Those late-2024 non-OpenAI reasoning models happened to be surrogates. This has set my expectations of R1 low...

R1

Deepseek's reasoning model turned out to be the real deal. It did score a meaningful number of wins while maintaining a modest number of mistakes.

Model	Wins	Draws	Mistakes	Tokens/move
o1-preview	46.67%	43.33%	3.74	2660
o1-mini	30.00%	50.00%	2.34	1221
Deepseek-R1	22.58%	19.35%	18.63	4585

Mistakes - # of LLM erroneous replies per 1000 moves

R1 did good, but not great. Pay attention to how few Draws it had compared to o1 models. That's due to R1 breaking the protocol, violating prompt instruction, OR hallucinating illegal moves (and hence scoring a loss). It struggles with the instruction following and is susceptible to prompt variations falling out of the game loop randomly.

For reference here are the top non-reasoning models as of January 2025:

Model	Wins ▼	Draws	Mistakes	Tokens/move
anthropic.claude-v3-5-sonnet-v1	6.67%	80.00%	0.27	80.42
gpt-4o-2024-11-20	4.23%	87.32%	0.15	50.58
gpt-4-turbo-2024-04-09	0.00%	93.33%	0.00	6.03
anthropic.claude-v3-opus	0.00%	83.33%	1.61	72.86

Reasoning Models - a League of their Own

Besides a significant number of wins, the reasoning models maintained a positive average material difference. Material count in a chess game is a weighted score of all the pieces (i.e. a Pawn being 1 unit of material and a Queen being 9). Each player starts the game with material count of 39. The eval calculates the difference in material at the end of each game - if a player loses more pieces than it captures, the difference will be negative. Other non-reasoning models (and reasoning "surrogates") typically have their material diff negative or around 0 (if they fail to progress in the game breaking the loop).

Here's the number for average material diff at the end of the game:

Model	Material Diff	Avg Game Duration (moves)
o1-preview-2024-09-12	9.99	124.8
o1-mini-2024-09-12	10.77	142.73
deepseek-reasoner-r1	10.83	91.77
anthropic.claude-v3-5-sonnet-v1	-4.48	183.38
gpt-4o-2024-11-20	-8.23	189.72
qwq-32b-preview@q4_k_m	-0.07	7.97
gemini-2.0-flash-thinking-exp-1219	0.00	2.33

Distilled R1

I have also tested a few quantized versions of Distilled R1. What Deepseek did was fine-tune several smaller (70B, 14B, 8B etc.) Qwen 2.5 and Llama 3.1 models using the outputs of the full-size R1 model. Supposedly they should have gained reasoning skills. There's also a special <think></think> section in the output keeping all the reasoning tokens isolated from the final answer (something important the earlier thinking models missed).

They didn't do well:

Model	Wins ▼	Draws	Mistakes	Tokens
deepseek-r1-distill-qwen-32b@q4_k_m	0.00%	0.00%	727.27	2173.83
deepseek-r1-distill-qwen-14b@q8_0	0.00%	0.00%	1153.85	3073.06

Besides, I noticed these models sometimes had failed to properly open and close the think tags (missing the openinig <think>).

P.S>

Google has also dropped an update to Gemini Thinking the day after the R1 release!

It did much better than the December version! At least it is now steerable and can last for ~40 moves in a game. They have also added the separation of the thinking part not bloating the response with the reasoning tokens. And, it is also a thinking surrogate...

Model	Wins ▼	Draws	Mistakes	Tokens
gemini-2.0-flash-thinking-exp-01-21	0.00%	6.06%	5.97	17.77
gemini-2.0-flash-thinking-exp-1219	0.00%	0.00%	1285.71	724.54

Curiously, most of the game drop outs happened due to server error (e.g. some copyright filters) OR getting empty completions - there're definitely stability issues with the model.

Top comments (21)

MarkAurit • Jan 29

I performed a moderately complex, real-life query against the usa stock market using ChatGPT and DeepSeek. I only had one measurement of success: an accurate return of information. ChatGPT did extremely well - it added a chart (not requested) and much more information than just the closing price; in other words, about what you get on Yahoo Finance stock queries. DeepSeek merely replied with "that information isnt in my system yet". I realize that DeepSeek is a tool in its relative infancy, and in time it will be just as powerful and useful as ChatGPT. But as of now, it is useless to me.
Query: what is the stock market price of AAPL

Sweet Knives • Jan 29

I used that same query in deepseek and this is the answer I get:

As of the latest available data in the search results (dated January 28, 2025), the closing stock price of Apple Inc. (AAPL) was $238.26, reflecting a +3.65% increase from the previous trading session . During extended trading hours on January 28, the price slightly fluctuated to $238.34 (+0.04%) . Pre-market data for January 29, 2025, shows a slight decline to $237.34 (-0.39%) as of 4:29 AM EST .

Key Details:

Recent Performance:
- The stock reached a daily high of $240.19 and a low of $230.81 on January 28, 2025 .
- Over the past month, Apple’s stock has faced volatility, with a 1-month decline of -6.78% but a 1-year gain of +23.82% .
Market Context:
- Apple’s market capitalization stands at $3.58 trillion, maintaining its position as one of the largest publicly traded companies .
- Analysts have expressed mixed sentiments, with concerns about iPhone demand in China and slower-than-expected adoption of its AI features ("Apple Intelligence") .
Upcoming Catalyst:
- The company is set to release its December 2024 quarter earnings on January 30, 2025, which could significantly impact the stock price depending on iPhone sales and AI-related updates .

For real-time updates, refer to financial platforms like Yahoo Finance or Google Finance.

sounds stupid but... are you sure you activate web search?

Rana Masoud • Jan 29

What I've experienced is that if you ask something Political regarding China especially, you'll observe that it gives an a Biased Answer.

Whereas for CHAT GPT, even if you ask the same for America, observe neutral replies.

I think it won't last long if it stays Biased towards a country or developer.

Mikhail • Jan 29

As a user I am annoyed. The Deepseek server is always busy.

AmtrakGuy2 • Jan 29

Glad to know that I'm not the only one lol

Ryan Pretorius • Jan 30

Host it locally :-) No more issues

Mikhail • Jan 31

Thanks a lot :). I am also disappointed by the quality of the answers as compared even to GPT-o4

Reeizer • Jan 30

^^^^ THIS

Better performance /= stability

I have the same problem with "always busy"

Peter Truchly • Jan 27

My first thought was: why would anyone even try to play chess with LLM? There are better "algorithms" for that.
But yes, especially in todays world where everybody is expecting AGI with every new LLM why it shouldn't play chess after all?

Where I see the problem (and a limitation) of current approach is the misuse of general purpose reasoning capabilities of LLMs, which are undeniably there but only in emergent form. What would average person do if confronted with this task? Most of us would just use some software, smaller portion would implement their own software and only handful (of chess masters) would play by themselves.

Unless we equip these 'AI' models with complete set of tools, environment and ability to use workflow patterns to model and execute a workflow designed for a given task, results are going to be quite disappointing, at least for the nearby future.

AmtrakGuy2 • Jan 29

For me currently, DeepSeek has been extremely unhelpful and annoying. The servers are always busy. I understand that DeepSeek is still in its early stages, but the developers should have tested their product before publishing it. In my personal opinion, ChatGPT is actually the best tool to use professionally, and it has become my favorite. I do hope that DeepSeek continues to advance though. Maybe some day, DeepSeek will be just as powerful, if not more powerful, than ChatGPT. Cheers to a promising future!

Maxim Saplin • Jan 29

If the only complain is Deepseek's web app availability (not the quality of responses) - it's hard to blame the folks for not burning millions to serve the avalanche of user requests, for free. Don't you think so?)

AmtrakGuy2 • Jan 30

Comment hidden by post author

Maxim Saplin • Jan 30 • Edited

Can you share more detail about SQL injections that you tried with DeepSeek (presumably chat.deepseek.com)? What, when, how..

wuya666 • Feb 2

It seems DeepSeek has many great talents in AI development/optimization, but doesn't care much about good web developers. The way how it pops up lots of Cloudflare CAPTCHA boxes all across the website at times shows they are not even trying to make good quality web frontend and backend...

AmtrakGuy2 • Feb 5 • Edited

I tried the SQL Injections on January 31, 2025. I just used SQLite to automatically connect to one of DeepSeek's servers, and it actually worked. And yes, it was chat.deepseek.com

wuya666 • Feb 2

i think the DeepSeek folks are just lacking good web developers, unlike all the brilliant AI developers they have. They really need to recruit some better web developers, the way how dozens of Cloudflare CAPTCHA boxes can pop up in a single page just shows bad web developers at work

AmtrakGuy2 • Feb 5

That is the least of their worries. Really, they need to actually hire some trustworthy folks that can actually provide good security for all of the sites. In my opinion, they should focus on security first, then they can start fixing the product itself.

Vinayak Mishra • Jan 28

Nice post.. Had a question for you after seeing this. How good is deepseek than others w.r.t hallucinations, as yesterday night I was reading a content piece on LLM hallucination detection..

Maxim Saplin • Jan 28

Assuming that a hallucinations include cases when the model does not what it is asked for (e.g. skips an acton) or decides to make move that is not legal - R1 is quite bad. You can tell this by large number of mistakes and low average game duration (you can see that by hovering over a row in the leaderboard)

Oscar • Jan 29

To be honest, I'm just happy there's a contender for "best AI model" that isn't made by a multi billion dollar tech corporation from the US (well, Deepseek's company might be worth billions now lol). All that being said, for the few times that I actually utilize AI in my own work, I still use ChatGPT.

Juana Daniel • Jan 29

This is such an interesting topic! I'm really happy to hear about it. I absolutely Love it, and you should definitely check out more topics like this.

View full discussion (21 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more

DEV Community