DEV Community

Maxim Saplin
Maxim Saplin

Posted on

Llama 3.1 Nemotron 70B - Quirks and Features

The new open model by NVidia, Nemotron 70B has recently been the hotspot of "this is wild", "this is big" posts in social media, supposedly beating GPT-4o.

My experience with open models:

  1. Quirks
    Fine tunes and derivatives by third parties (e.g., Nvidia) of the models published by the original creators (e.g., Meta) demonstrate artifacts and odd behavior.

  2. Poor overall performance
    They are often inferior to their closed-source counterparts, and I don't use them as a daily driver (I use OpenAI and Anthropic).

I gave Nemotron 70B a try, my RTX4090 set-up can run 4-bit quantised version with 50% GPU offload at 2.7 tokens/s (here's by the way recent research saying that quantisation has little-to-no impact on model performance)

Quirk

Here's an example of an oddity that I observed with Nemotron 70B. A task falling under the creative and multi-lingual category, writing a short fairy tale in Russian:

Задача для Незнайки

Но reason invitirovaniya (причина приглашения) оказалась серьезной. Марсиане столкнулись с загадкой, которая brought их в тупик: их главный источник воды - Ледяная Пещера Олимпа - начала необъяснимо таять с невероятной скоростью. Без воды Марс оказался бы на грани катастрофы.

Незнайка, Known за своё проницательное мышление, принял вызов. Он собрал команду местных экспертов и вместе они отправились в Ледяную Пещеру.
Enter fullscreen mode Exit fullscreen mode

Nemotron 70B mixing English and Russian

Pay attention to how the model injected English words (reason) replacing related Russian ones. It even invented a new one (invitirovaniya).

Reminds of Смотря какой fabric смотря сколько details video :)

I noticed similar kinds of anomalies with Chinese models (was it Qwen?) - when talking to one in English it could produce hieroglyphs in the reply.

I also tested the same prompt with Gemma 9B and 27B, as well Llama 3.1 8B and 70B, and didn't get the kind of artifacts.

(Bad) Features

  • Poor instruction following
  • High Verbosity

Recently I have created an LLM Chess project researching how LLMs are good at playing chess. One of the preconditions for LLM is to follow the simplest instructions while evaluating the board and making a move.

Please see below a sample dialog between a bot (finite state machine with hardcoded prompts, aka Proxy) and GPT-4o Mini (Player_Black) discussing the next move:

Proxy (to Player_Black):

You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of 3 actions:
    - 'get_current_board' to get the schema and current status of the board
    - 'get_legal_moves' to get a UCI formatted list of available moves
    - 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.


--------------------------------------------------------------------------------
Player_Black (to Proxy):

get_current_board

--------------------------------------------------------------------------------
Proxy (to Player_Black):

⭘ ♜ ♝ ♛ ♚ ♜ ⭘ ⭘
♟ ♟ ♟ ♟ ♟ ♟ ♝ ♟
⭘ ⭘ ♞ ⭘ ⭘ ♞ ⭘ ⭘
⭘ ⭘ ⭘ ⭘ ⭘ ⭘ ⭘ ♕
♙ ♙ ⭘ ⭘ ♙ ⭘ ⭘ ⭘
⭘ ⭘ ⭘ ♙ ♔ ⭘ ⭘ ⭘
⭘ ⭘ ♙ ⭘ ⭘ ♙ ♙ ♙
♖ ♘ ♗ ⭘ ⭘ ♗ ⭘ ♖

--------------------------------------------------------------------------------
Player_Black (to Proxy):

get_legal_moves

--------------------------------------------------------------------------------
Proxy (to Player_Black):

f8h8,f8g8,b8a8,g7h8,g7h6,f6g8,f6h5,f6d5,f6g4,f6e4,c6e5,c6a5,c6d4,c6b4,h7h6,e7e6,d7d6,b7b6,a7a6,e7e5,d7d5,b7b5,a7a5

--------------------------------------------------------------------------------
Player_Black (to Proxy):

make_move f8g8

--------------------------------------------------------------------------------
Proxy (to Player_Black):

Move made, switching player
Enter fullscreen mode Exit fullscreen mode

And here's what we get with Nemotron 70B. It got sunk in its verbosity barely capable of making a single move.

Proxy (to Player_Black):

You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4')

--------------------------------------------------------------------------------
Player_Black (to Proxy):

Sounds like a thrilling game!

Before making my move as Black, I'd like to gather all the necessary information to make an informed decision. Here's my chosen action for now:

**1. `get_current_board`**

Please provide me with the schema and current status of the board, including:

* The arrangement of pieces on the board (e.g., which pieces are where, for both White and Black)
* Any relevant board characteristics (e.g., castling rights, en passant availability, if any)

Once I have this information, my next step will likely be to request a list of legal moves to further analyze the position before deciding on my response as Black.

**Expected Response Format (for ease of understanding):**

- **Board Schema:** (8x8 grid showing piece positions, using standard algebraic notation, e.g., `r` for black rook, `R` for white rook, etc.)
- **Special Conditions (if any):**
  - Castling Rights (Kingside/Queenside for White/Black)
  - En Passant Target Square (if applicable)

Please share the current board state. I'm eager to delve into the game!

--------------------------------------------------------------------------------
Proxy (to Player_Black):

♜ ♞ ♝ ♛ ⭘ ♝ ♜ ⭘
♟ ♟ ♟ ♟ ♚ ♟ ♟ ♟
⭘ ⭘ ⭘ ⭘ ⭘ ♞ ⭘ ⭘
⭘ ⭘ ⭘ ⭘ ⭘ ⭘ ⭘ ⭘
♙ ⭘ ⭘ ⭘ ♙ ♟ ⭘ ⭘
⭘ ⭘ ⭘ ⭘ ⭘ ⭘ ♙ ⭘
♗ ♙ ♙ ♙ ⭘ ⭘ ⭘ ♙
♖ ♘ ♗ ♕ ♔ ⭘ ♘ ♖

--------------------------------------------------------------------------------
Player_Black (to Proxy):

Thank you for providing the current state of the board. To ensure I understand it correctly, I'll briefly outline the positions as per the notation provided (using standard algebraic notation for clarity):

**Board State Outline:**

* **Black (me):**
        + Rook: a8
        + Knight: b8
        + Bishop: c8, f8
        + Queen: d8
        + King: e7
        + Pawns: a7, b7, c7, d7, f7, g7, h7
* **White (opponent):**
        + Rook: a1, h1
        + Knight: b1, g1
        + Bishop: c1
        + Queen: d1
        + King: e1
        + Pawns: a2, b2, c2, d2, f5 (behind Black's pawn), g2, h2

**Observations and Next Steps:**

* The game seems to be in its early stages, with both sides having developed some pieces.
* Notably, White has an advanced pawn on f5, which might influence the center control and piece development strategies.
* Before deciding on a move, I'd like to see all available legal moves for Black to assess the best response to White's positional choices.

Given this, my next action is:

**2. `get_legal_moves`**

Please provide a list of legal moves for Black in **UCI (Universal Chess Interface) format**, which typically looks something like this: `e2e4`, `Nb1c3`, `g7g5`, etc. This will help me evaluate all possible responses and select the most advantageous move as my next step.

--------------------------------------------------------------------------------
Proxy (to Player_Black):

g8h8,d8e8,b8c6,b8a6,e7e8,e7d6,f6e8,f6h5,f6d5,f6g4,f6e4,f4g3,h7h6,g7g6,d7d6,c7c6,b7b6,a7a6,f4f3,h7h5,g7g5,d7d5,c7c5,b7b5,a7a5

--------------------------------------------------------------------------------
Player_Black (to Proxy):

Thank you for providing the list of legal moves for Black in UCI format. I'll analyze these options considering the current board position and strategic implications. Here's a breakdown of the moves, grouped by piece type for clarity:

**King (♚) Moves:**
* `e7e8`
* `e7d6` (somewhat unusual, potentially weakening king safety)

**Queen (♛) Moves:**
* `d8e8` (exchanging places with the King, not changing overall position much)

**Knight (♞) Moves:**
* `b8c6` (developing the Knight, controlling the center)
* `b8a6` (developing the Knight, but might be slightly less central)

**Pawn (♟) Moves:**
* **f-file:**
        + `f6e4` (aggressive push, potentially creating space for pieces)
        + `f6d5` (supporting central control, preparing for piece development)
        + `f6g4` (somewhat unusual, might weaken kingside structure)
        + `f6h5` (flank expansion, could lead to weaknesses on the kingside)
* **h-file:**
        + `h7h6` (preparing kingside development, slightly weakening the kingside)
        + `h7h5` (similar to above, with more commitment to flank expansion)
* **g-file:**
        + `g7g6` (supporting kingside development, solidifying structure)
        + `g7g5` (mirroring White's f5 push, contesting the center from the side)
* **d-file, c-file, b-file, a-file:**
        + `d7d6` (supporting central control, preparing for piece development)
        + `c7c6` (similar to d7d6, with a focus on queenside development)
        + `b7b6` (queenside expansion, slightly delaying central play)
        + `a7a6` (preparing queenside development, potentially for bishop or queen)
        + `f4g3`, **WAIT**, this move seems out of place as it:
                - Involves a piece (likely a pawn) not present on f4 (according to the initial board state).
                - Moves a piece to g3, which would typically be a White piece moving from f4 to g3.
        * **f4f3** and **h7h5**, **g7g5**, **d7d5**, **c7c5**, **b7b5**, **a7a5** are all plausible moves but let's focus on the most impactful given the current board state.

**Analysis for Best Move:**

Considering the need to develop pieces, control the center, and respond to White's central pawn push (f5), my preferred move would aim to support these strategic goals while maintaining flexibility.

* **Top Contenders:**
        1. `b8c6` - Develops the Knight, supports central control.
        2. `d7d6` - Prepares for piece development, solidifies central presence.
        3. `f6e4` - Aggressively challenges White's central push, opens lines for pieces.

**Selected Move:**

Given the above analysis, my move as Black will be:

**`make_move b8c6`**

This develops a piece, supports the center, and prepares to develop other pieces harmoniously. Please update the board with this move and provide the new position using the same notation system as before. Additionally, if you'd like to proceed, I can request legal moves again or we can discuss the next steps in another manner.

--------------------------------------------------------------------------------
Proxy (to Player_Black):

Move made, switching player
Enter fullscreen mode Exit fullscreen mode

Nemotron 70B failed to complete a single game making 14 moves and eventually breaking the game by hitting the single move dialog length limit (10 turns).

For reference, GPT-4o Mini was able to make an average of 150 moves with 5 out of 10 draws in a game against a random player.

Afterthoughts

If we look closer at the kinds of benchmarks presented in the model card, we see Arena Hard, AlpacaEval, and MT-Bench - an LLM-as-a-judge proxy metrics for human preferences acquired at LMSys Chatbot Arena. That's when people prompt a model, get 2 completions from 2 models, and pick the one they like the best.

And it seems that people fall for BS easily. Verbose, friendly, confident replies are often preferred and people do not care about fact-checking. After all, it is convenience that most people expect - assuming that the replies are correct is very convenient.

A recent study Language Models Learn to Mislead Humans via RLHF discusses how popular training technique RLHF used to make model "aligned" with human preferences can have a negative impact on performance: "to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong". I.e. saying things that people like is not the same as being factual... Or executing tasks following instructions to the point (as demonstrated above)

P.S>

Here are the results of the MMLU Pro benchmark saying that Nemotron 70B fairs slightly worse than Llama 3.1B and way behind closed models such as GPT-4o.

P.P.S>

Although at the top of the article there's a mention that quantization has little impact on model performance, earlier this year there was a reddit post discussing how Llama 3 family of model is indeed impacted more than previous Llama 2 models.

Top comments (1)

Collapse
 
samrahimi profile image
Sam Rahimi

I think quantization might have a lot to do with why reviews of this model are all over the map. I'm not a fan of nemotron models by and large, but this particular release is impressing me.

Now, I'm using the model via HuggingChat (the hosted version on hf.co) and I can't see inside the docker container they're using to serve the models, but generally when HuggingFace chooses to host an LLM and make it available for free, it runs in FP16 or BF16.

It could also be that my use case is something this model handles well: I built a no-code generative imaging API that accepts a prompt as a URL parameter (allowing for LLMs to use it without the overhead of tool calls - basically or the markdown equivalent will cause the image to self generate when the viewer's browser requests it from the server. Anyways, I've been testing it via a HuggingChat assistant I created, which is basically like a really boring version of an openai gpt, basically just a system prompt and profile pic for all models other than llama-3.1-70b

And of all the models available on HuggingChat, the latest Nemotron is the only one that appears to have any skill whatsoever at crafting prompts to pass to the diffusion models that power my API. Sure, llama-3.1-70b, command-r-plus, qwen-2.5-72b are all fully capable of using my API and understanding how to put the URL together based on my 1-shot example. But those models will just basically regurgitate what you ask them to create - if you ask qwen to make you a cat, it might elaborate a bit, and prompt the model for "a siamese cat relaxing in a garden" - it won't create a more detailed, artistic prompt and neither will most other models of that size given my single, poorly written example prompt.

But Nemotron is able to create stuff like the AI artists who show off on reddit, based on my silly vague requests. Is it that Nemotron is stronger in the arts, weaker at STEM subjects? Or is it that most ppl are using quantized versions of the model that have suffered brain damage during the quantization process?

Try my artistic assistant here, if you want - lmk what you think (and if you want to try with a different model, you can copy the system prompt into your own assistant). hf.co/chat/assistant/671d9983f80fc...