DEV Community

Cover image for Deepseek R1 vs OpenAI o1
Maxim Saplin
Maxim Saplin

Posted on • Edited on

119 7 4 3 6

Deepseek R1 vs OpenAI o1

Deepseek R1 is out - available via Deepseek API or free Deepseek chat. If you are following LLM/Gen AI space you've likely seen titles, read posts, or watched videos praising the model: 671B MoE model, open weight, lots of info on training process. It challenges OpenAI's reasoning models (o1/o1-mini) across many benchmarks at a fraction of a cost... There are even smaller "distilled" versions of R1 available for the local run (via llama.cpp/ollama/lmstudio etc.).

I have been stress-testing models with LLM Chess since Autumn, and so far, none of the "reasoning" (or "thinking") models impressed me (except OpenAI's o1). Right away, I launched the benchmark, but I had to wait for a few days to collect enough data (it seems that the API was throttled; it was extremely slow).

LLM Chess simulates multiple games of a random bot playing against an LLM. Thousands of prompts, millions of tokens, every game is unique (unlike most evals which have fixed sets of prompts/pass conditions). Several metrics are collected and aggregated across multiple runs. The performance of a model is evaluated in reasoning (% of Wins/Draws) and steer-ability/durability (how often a model fails to follow instructions or drops out of the game due to multiple erroneous replies).

Reasoning Models

Before o1, LLMs couldn’t beat a random player in chess. GPT-4o? Zero wins. Claude 3.5 - Zero. They’d either collapse early or drag games into a 200-move game limit (with an automatic Draw assigned).

Then came o1. OpenAI’s "reasoning" models broke the record:

  • o1-preview: 46.67% wins
  • o1-mini: 30% wins

Other "reasoning" models? After the o1 release in late 2024 came the controversy around OpenAI's secrecy... The hidden "reasoning" tokens discussion (invisible yet billed) and how people were banned since OpenAI suspected they tried to crack their secrets. At the time we have seen attempts to reproduce the success of o1 with AI labs introducing "reasoning" models. E.g. Qwen's QwQ, Sky T1. Even Google released their experimental Gemini Thinking model in December 2024.

None of the alternative "reasoning" or "thinking" models came close to OpenAI models - they struggled even with basic instruction following drowning in verbosity, dropping out of the game loop after just a few moves: games lasted for 2 to 14 moves on average. Take for example a non-reasoning old and out-of-fashion GPT-4 Turbo which lasted on average 192 moves (before losing to the random Player due to a checkmate :).

Those late-2024 non-OpenAI reasoning models happened to be surrogates. This has set my expectations of R1 low...

R1

Deepseek's reasoning model turned out to be the real deal. It did score a meaningful number of wins while maintaining a modest number of mistakes.

Model Wins Draws Mistakes Tokens/move
o1-preview 46.67% 43.33% 3.74 2660
o1-mini 30.00% 50.00% 2.34 1221
Deepseek-R1 22.58% 19.35% 18.63 4585

Mistakes - # of LLM erroneous replies per 1000 moves

R1 did good, but not great. Pay attention to how few Draws it had compared to o1 models. That's due to R1 breaking the protocol, violating prompt instruction, OR hallucinating illegal moves (and hence scoring a loss). It struggles with the instruction following and is susceptible to prompt variations falling out of the game loop randomly.

For reference here are the top non-reasoning models as of January 2025:

Model Wins ▼ Draws Mistakes Tokens/move
anthropic.claude-v3-5-sonnet-v1 6.67% 80.00% 0.27 80.42
gpt-4o-2024-11-20 4.23% 87.32% 0.15 50.58
gpt-4-turbo-2024-04-09 0.00% 93.33% 0.00 6.03
anthropic.claude-v3-opus 0.00% 83.33% 1.61 72.86

Reasoning Models - a League of their Own

Besides a significant number of wins, the reasoning models maintained a positive average material difference. Material count in a chess game is a weighted score of all the pieces (i.e. a Pawn being 1 unit of material and a Queen being 9). Each player starts the game with material count of 39. The eval calculates the difference in material at the end of each game - if a player loses more pieces than it captures, the difference will be negative. Other non-reasoning models (and reasoning "surrogates") typically have their material diff negative or around 0 (if they fail to progress in the game breaking the loop).

Here's the number for average material diff at the end of the game:

Model Material Diff Avg Game Duration (moves)
o1-preview-2024-09-12 9.99 124.8
o1-mini-2024-09-12 10.77 142.73
deepseek-reasoner-r1 10.83 91.77
anthropic.claude-v3-5-sonnet-v1 -4.48 183.38
gpt-4o-2024-11-20 -8.23 189.72
qwq-32b-preview@q4_k_m -0.07 7.97
gemini-2.0-flash-thinking-exp-1219 0.00 2.33

Distilled R1

I have also tested a few quantized versions of Distilled R1. What Deepseek did was fine-tune several smaller (70B, 14B, 8B etc.) Qwen 2.5 and Llama 3.1 models using the outputs of the full-size R1 model. Supposedly they should have gained reasoning skills. There's also a special <think></think> section in the output keeping all the reasoning tokens isolated from the final answer (something important the earlier thinking models missed).

They didn't do well:

Model Wins ▼ Draws Mistakes Tokens
deepseek-r1-distill-qwen-32b@q4_k_m 0.00% 0.00% 727.27 2173.83
deepseek-r1-distill-qwen-14b@q8_0 0.00% 0.00% 1153.85 3073.06

Besides, I noticed these models sometimes had failed to properly open and close the think tags (missing the openinig <think>).

P.S>

Google has also dropped an update to Gemini Thinking the day after the R1 release!

It did much better than the December version! At least it is now steerable and can last for ~40 moves in a game. They have also added the separation of the thinking part not bloating the response with the reasoning tokens. And, it is also a thinking surrogate...

Model Wins ▼ Draws Mistakes Tokens
gemini-2.0-flash-thinking-exp-01-21 0.00% 6.06% 5.97 17.77
gemini-2.0-flash-thinking-exp-1219 0.00% 0.00% 1285.71 724.54

Curiously, most of the game drop outs happened due to server error (e.g. some copyright filters) OR getting empty completions - there're definitely stability issues with the model.

Do your career a big favor. Join DEV. (The website you're on right now)

It takes one minute, it's free, and is worth it for your career.

Get started

Community matters

Top comments (14)

Collapse
 
markaurit profile image
MarkAurit

I performed a moderately complex, real-life query against the usa stock market using ChatGPT and DeepSeek. I only had one measurement of success: an accurate return of information. ChatGPT did extremely well - it added a chart (not requested) and much more information than just the closing price; in other words, about what you get on Yahoo Finance stock queries. DeepSeek merely replied with "that information isnt in my system yet". I realize that DeepSeek is a tool in its relative infancy, and in time it will be just as powerful and useful as ChatGPT. But as of now, it is useless to me.
Query: what is the stock market price of AAPL

Collapse
 
sweet_knives_92756f790208 profile image
Sweet Knives

I used that same query in deepseek and this is the answer I get:

As of the latest available data in the search results (dated January 28, 2025), the closing stock price of Apple Inc. (AAPL) was $238.26, reflecting a +3.65% increase from the previous trading session . During extended trading hours on January 28, the price slightly fluctuated to $238.34 (+0.04%) . Pre-market data for January 29, 2025, shows a slight decline to $237.34 (-0.39%) as of 4:29 AM EST .

Key Details:

  1. Recent Performance:

    • The stock reached a daily high of $240.19 and a low of $230.81 on January 28, 2025 .
    • Over the past month, Apple’s stock has faced volatility, with a 1-month decline of -6.78% but a 1-year gain of +23.82% .
  2. Market Context:

    • Apple’s market capitalization stands at $3.58 trillion, maintaining its position as one of the largest publicly traded companies .
    • Analysts have expressed mixed sentiments, with concerns about iPhone demand in China and slower-than-expected adoption of its AI features ("Apple Intelligence") .
  3. Upcoming Catalyst:

    • The company is set to release its December 2024 quarter earnings on January 30, 2025, which could significantly impact the stock price depending on iPhone sales and AI-related updates .

For real-time updates, refer to financial platforms like Yahoo Finance or Google Finance.

sounds stupid but... are you sure you activate web search?

Collapse
 
rana_masoud_3841f726be272 profile image
Rana Masoud

What I've experienced is that if you ask something Political regarding China especially, you'll observe that it gives an a Biased Answer.

Whereas for CHAT GPT, even if you ask the same for America, observe neutral replies.

I think it won't last long if it stays Biased towards a country or developer.

Collapse
 
host510 profile image
Mikhail

As a user I am annoyed. The Deepseek server is always busy.

Collapse
 
amtrakguy2 profile image
AmtrakGuy2

Glad to know that I'm not the only one lol

Collapse
 
reeizer profile image
Reeizer

^^^^ THIS

Better performance /= stability

I have the same problem with "always busy"

Collapse
 
ghostdev26 profile image
Ryan Pretorius

Host it locally :-) No more issues

Collapse
 
peter_truchly_4fce0874fd5 profile image
Peter Truchly

My first thought was: why would anyone even try to play chess with LLM? There are better "algorithms" for that.
But yes, especially in todays world where everybody is expecting AGI with every new LLM why it shouldn't play chess after all?

Where I see the problem (and a limitation) of current approach is the misuse of general purpose reasoning capabilities of LLMs, which are undeniably there but only in emergent form. What would average person do if confronted with this task? Most of us would just use some software, smaller portion would implement their own software and only handful (of chess masters) would play by themselves.

Unless we equip these 'AI' models with complete set of tools, environment and ability to use workflow patterns to model and execute a workflow designed for a given task, results are going to be quite disappointing, at least for the nearby future.

Collapse
 
ai_joddd profile image
Vinayak Mishra

Nice post.. Had a question for you after seeing this. How good is deepseek than others w.r.t hallucinations, as yesterday night I was reading a content piece on LLM hallucination detection..

Collapse
 
maximsaplin profile image
Maxim Saplin

Assuming that a hallucinations include cases when the model does not what it is asked for (e.g. skips an acton) or decides to make move that is not legal - R1 is quite bad. You can tell this by large number of mistakes and low average game duration (you can see that by hovering over a row in the leaderboard)

Collapse
 
kurealnum profile image
Oscar

To be honest, I'm just happy there's a contender for "best AI model" that isn't made by a multi billion dollar tech corporation from the US (well, Deepseek's company might be worth billions now lol). All that being said, for the few times that I actually utilize AI in my own work, I still use ChatGPT.

Collapse
 
amtrakguy2 profile image
AmtrakGuy2

For me currently, DeepSeek has been extremely unhelpful and annoying. The servers are always busy. I understand that DeepSeek is still in its early stages, but the developers should have tested their product before publishing it. In my personal opinion, ChatGPT is actually the best tool to use professionally, and it has become my favorite. I do hope that DeepSeek continues to advance though. Maybe some day, DeepSeek will be just as powerful, if not more powerful, than ChatGPT. Cheers to a promising future!

Collapse
 
maximsaplin profile image
Maxim Saplin

If the only complain is Deepseek's web app availability (not the quality of responses) - it's hard to blame the folks for not burning millions to serve the avalanche of user requests, for free. Don't you think so?)

Collapse
 
juana_daniel_096 profile image
Juana Daniel

This is such an interesting topic! I'm really happy to hear about it. I absolutely Love it, and you should definitely check out more topics like this.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay