QwQ-32B: Embracing the Power of Reinforcement Learning

QwQ-32B: Embracing the Power of Reinforcement Learning(qwenlm.github.io)

480 points by nwjsmith 1 year ago | 169 comments

antirez 1 year ago |

Note the massive context length (130k tokens). Also because it would be kinda pointless to generate a long CoT without enough context to contain it and the reply.

EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't believe I was near 130k -- using ollama with fp16 model). I asked one of my test questions with a coding question totally unrelated to what it says:

<QwQ output> But the problem is in this question. Wait perhaps I'm getting ahead of myself.

Wait the user hasn't actually provided a specific task yet. Let me check again.

The initial instruction says:

"Please act as an AI agent that can perform tasks... When responding, first output a YAML data structure with your proposed action, then wait for feedback before proceeding."

But perhaps this is part of a system prompt? Wait the user input here seems to be just "You will be given a problem. Please reason step by step..." followed by a possible task? </QwQ>

Note: Ollama "/show info" shows that the context size set is correct.

anon373839 1 year ago | |

> Note: Ollama "/show info" shows that the context size set is correct.

That's not what Ollama's `/show info` is telling you. It actually just means that the model is capable of processing the context size displayed.

Ollama's behavior around context length is very misleading. There is a default context length limit parameter unrelated to the model's capacity, and I believe that default is a mere 2,048 tokens. Worse, when the prompt exceeds it, there is no error -- Ollama just silently truncates it!

If you want to use the model's full context window, you'll have to execute `/set parameter num_ctx 131072` in Ollama chat mode, or if using the API or an app that uses the API, set the `num_ctx` parameter in your API request.

antirez 1 year ago | | |

Ok, this explains why QwQ is working great on their chat. Btw I saw this thing multiple times: that ollama inference, for one reason or the other, even without quantization, somewhat had issues with the actual model performance. In one instance the same model with the same quantization level, if run with MLX was great, and I got terrible results with ollama: the point here is not ollama itself, but there is no testing at all for this models.

I believe that models should be released with test vectors at t=0, providing what is the expected output for a given prompt for the full precision and at different quantization levels. And also for specific prompts, the full output logits for a few tokens, so that it's possible to also compute the error due to quantization or inference errors.

wizee 1 year ago | |

Ollama defaults to a context of 2048 regardless of model unless you override it with /set parameter num_ctx [your context length]. This is because long contexts make inference slower. In my experiments, QwQ tends to overthink and question itself a lot and generate massive chains of thought for even simple questions, so I'd recommend setting num_ctx to at least 32768.

In my experiments of a couple mechanical engineering problems, it did fairly well in final answers, correctly solving mechanical engineering problems that even DeepSeek r1 (full size) and GPT 4o did wrong in my tests. However, the chain of thought was absurdly long, convoluted, circular, and all over the place. This also made it very slow, maybe 30x slower than comparably sized non-thinking models.

I used a num_ctx of 32768, top_k of 30, temperature of 0.6, and top_p of 0.95. These parameters (other than context length) were recommended by the developers on Hugging Face.

zamadatix 1 year ago | | |

I always see:

  /set parameter num_ctx <value>

Explained but never the follow up:

  /save <custom-name>

So you don't have to do the parameter change every load. Is there a better way or is it kind of like setting num_ctx in that "you're just supposed to know"?

flutetornado 1 year ago | | |

My understanding is that top_k and top_p are two different methods of decoding tokens during inference. top_k=30 considers the top 30 tokens when selecting the next token to generate and top_p=0.95 considers the top 95 percentile. You should need to select only one.

https://github.com/ollama/ollama/blob/main/docs/modelfile.md...

Edit: Looks like both work together. "Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)"

Not quite sure how this is implemented - maybe one is preferred over the other when there are enough interesting tokens!

hbbio 1 year ago | |

"My first prompt created a CoT so long that it catastrophically forgot the task"

Many humans would do that

AustinDev 1 year ago | |

I tried the 'Strawberry' question which generated nearly 70k words of CoT.

moffkalast 1 year ago | | |

I think you guys might be using too low of a temperature, it never goes beyond like 1k thinking tokens for me.

nicman23 1 year ago | | |

lol did it at least get it right?

ignorantguy 1 year ago | |

Yeah it did the same in my case too. it did all the work in the <think> tokens. but did not spit out the actual answer. I was not even close to 100K tokens

freehorse 1 year ago | |

If you did not change the context length, it is certain that it is not 2k or so. In "/show info" there is a field "context length" which is about the model in general, while "num_ctx" under "parameters" is the context length for the specific chat.

I use modelfiles because I only use ollama because it has easy integration with other stuff eg with zed, so this way I can easily choose models with a set context size directly.

Here nothing fancy, just

    FROM qwq
    PARAMETER num_ctx 100000

You save this somewhere as a text file, you run

    ollama create qwq-100k -f path/to/that/modelfile

and you now have "qwq-100k" in your list of models.

smallerize 1 year ago | |

From https://huggingface.co/Qwen/QwQ-32B

Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required.

GTP 1 year ago | | |

Sorry, could you please explain what this means? I'm not into machine learning, so I don't get the jargon.

tsunego 1 year ago | |

Can’t wait to see if my memory can even acocomodate this context

k_sze 1 year ago | |

Oddly, the Chinese LLM host SiliconFlow only makes it available with 32k context, which is even smaller than their DeepSeek-R1 offering.

codelion 1 year ago | |

that's interesting... i've been noticing similar issues with long context windows & forgetting. are you seeing that the model drifts more towards the beginning of the context or is it seemingly random?

i've also been experimenting with different chunking strategies to see if that helps maintain coherence over larger contexts. it's a tricky problem.

orbital-decay 1 year ago | | |

Neither lost-in-the-middle nor long context performance have seen a lot of improvement in the recent year. It's not easy to generate long training examples that also stay meaningful, and all existing models still become significantly dumber after 20-30k tokens, particularly on hard tasks.

Reasoning models probably need some optimization constraint put on the length of the CoT, and also some priority constraint (only reason about things that need it).

gagan2020 1 year ago |

Chinese strategy is open-source software part and earn on robotics part. And, They are already ahead of everyone in that game.

These things are pretty interesting as they are developing. What US will do to retain its power?

BTW I am Indian and we are not even in the race as country. :(

dr_dshiv 1 year ago |

I love that emphasizing math learning and coding leads to general reasoning skills. Probably works the same in humans, too.

20x smaller than Deep Seek! How small can these go? What kind of hardware can run this?

daemonologist 1 year ago | |

It needs about 22 GB of memory after 4 bit AWQ quantization. So top end consumer cards like Nvidia's 3090 - 5090 or AMD's 7900 XTX will run it.

be_erik 1 year ago | |

Just ran this on a 4000RTX with 24gb of vram and it struggles to load, but it’s very fast once the model loads.

Ey7NFZ3P0nzAe 1 year ago | |

A mathematician once told me that this might be because math teaches you to have different representations for a same thing, you then have to manipulate those abstractions and wander through their hierarchy until you find an objective answer.

samstave 1 year ago | |

>I love that emphasizing math learning and coding leads to general reasoning skills

Its only logical.

Leary 1 year ago |

To test: https://chat.qwen.ai/ and select Qwen2.5-plus, then toggle QWQ.

daemonologist 1 year ago |

It says "wait" (as in "wait, no, I should do X") so much while reasoning it's almost comical. I also ran into the "catastrophic forgetting" issue that others have reported - it sometimes loses the plot after producing a lot of reasoning tokens.

Overall though quite impressive if you're not in a hurry.

huseyinkeles 1 year ago | |

I read somewhere which I can't find now, that for the -reasoning- models they trained heavily to keep saying "wait" so they can keep reasoning and not return early.

rahimnathwani 1 year ago | |

Is the model using budget forcing?

Szpadel 1 year ago | | |

I do not understand why to force wait when model want to output </think>.

why not just decrease </think> probability? if model really wants to finish maybe or could over power it in cases were it's really simple question. and definitely would allow model to express next thought more freely

rosspackard 1 year ago | | |

I have a suspicion it does use budget forcing. The word "alternatively" also frequently show up and it happens when it seems logically that a </think> tag could have been place.

manmal 1 year ago |

I guess I won’t be needing that 512GB M3 Ultra after all.

UncleOxidant 1 year ago | |

I think the Framework AI PC will run this quite nicely.

Tepix 1 year ago | | |

I think you want a lot of speed to make up for the fact that it's so chatty. Two 24GB GPUs (so you have room for context) will probably be great.

seanmcdirmid 1 year ago | |

A max with 64 GB of ram should be able to run this (I hope). I have to wait until an MLX model is available to really evaluate its speed, though.

pickettd 1 year ago | | |

Looks like MLX quants are released:

- 4bit: https://huggingface.co/mlx-community/QwQ-32B-4bit

- 6bit: https://huggingface.co/mlx-community/QwQ-32B-6bit

mettamage 1 year ago | | |

Yep, it does that. I have 64 GB and was actually running 40 GB of other stuff.

rpastuszak 1 year ago | |

How much vram do you need to run this model? Is 48 gb unified memory enough?

zamalek 1 year ago | | |

39gb if you use a fp8 quantized model.[1] Remember that your OS might be using some of that itself.

As far as I recall, Ollama/llama.cpp recently added a feature to page-in parameters - so you'll be able to go arbitrarily large soon enough (at a performance cost). Obviously more in RAM = more speed = more better.

[1]: https://token-calculator.net/llm-memory-calculator

dulakian 1 year ago | | |

I am using the Q6_K_L quant and it's running at about 40G of vram with the KV cache.

Device 1 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||20.170Gi/23.988Gi]

Device 2 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||19.945Gi/23.988Gi]

brandall10 1 year ago | | |

It's enough for 6 bit quant with a somewhat restricted context length.

Though based on the responses here, it needs sizable context to work, so we may be limited to 4 bit (I'm on an M3 Max w/ 48gb as well).

iamronaldo 1 year ago |

This is insane matching deepseek but 20x smaller?

Imnimo 1 year ago | |

I wonder if having a big mixture of experts isn't all that valuable for the type of tasks in math and coding benchmarks. Like my intuition is that you need all the extra experts because models store fuzzy knowledge in their feed-forward layers, and having a lot of feed-forward weights lets you store a longer tail of knowledge. Math and coding benchmarks do sometimes require highly specialized knowledge, but if we believe the story that the experts specialize to their own domains, it might be that you only really need a few of them if all you're doing is math and coding. So you can get away with a non-mixture model that's basically just your math-and-coding experts glued together (which comes out to about 32B parameters in R1's case).

mirekrusin 1 year ago | | |

MoE is likely temporary, local optimum now that resembles bitter lesson path. With the time we'll likely distill what's important, shrink it and keep it always active. There may be some dynamic retrieval of knowledge (but not intelligence) in the future but it probably won't be anything close to MoE.

littlestymaar 1 year ago | | |

> , but if we believe the story that the experts specialize to their own domains

I don't think we should believe anything like that.

7734128 1 year ago | |

Roughly the same number of active parameters as R1 is a mixture-of-experts model. Still extremely impressive, but not unbelievable.

kmacdough 1 year ago | | |

I understand the principles of MOE, but clearly not enough to make full sense of this.

Does each expert within R1 have 37B parameters? If so, is QwQ only truly competing against one expert in this particular benchmark?

Generally I don't think I follow how MOE "selects" a model during training or usage.

WiSaGaN 1 year ago | |

I think it will be more akin to o1-mini/o3-mini instead of r1. It is a very focused reasoning model good at math and code, but probably would not be better than r1 at things like general world knowledge or others.

nycdatasci 1 year ago |

Wasn't this release in Nov 2024 as a "preview" with similarly impressive performance? https://qwenlm.github.io/blog/qwq-32b-preview/

yorwba 1 year ago | |

The benchmark scores in the new announcement are significantly higher than for the preview model.

samus 1 year ago | | |

That's good news, I was highly impressed already by what that model could do, even under heavy quantization.

rvz 1 year ago |

The AI race to zero continues to accelerate with downloadable free AI models which have already won the race and destroying closed source frontier AI models.

They are once again getting squeezed in the middle and this is even before Meta releases Llama 4.

freehorse 1 year ago |

How does it compare to qwen32b-r1-distill? Which is probably the most directly comparable model.

pzo 1 year ago | |

I'm wondering as well. Here in open llm leaderboard there is only preview. Better than deepseek-ai/DeepSeek-R1-Distill-Qwen-32B but surprisingly worse than deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

in Open LLM leaderboard overall this model is ranked quite low at 660: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

wbakst 1 year ago |

actually insane how small the model is. they are only going to get better AND smaller. wild times

dulakian 1 year ago |

My informal testing puts it just under Deepseek-R1. Very impressive for 32B. It maybe thinks a bit too much for my taste. In some of my tests the thinking tokens were 10x the size of the final answer. I am eager to test it with function calling over the weekend.

kelsey98765431 1 year ago |

first thoughts: wow this is a real reasoning model, not just llama variant with a sft. the chain of thought actually wwill go for a very long time on a seemingly simple question like writing a pi calculation in c. very interesting.

Imustaskforhelp 1 year ago | |

I tried it for something basic like 2+2 and it was very simple. But I might try your pi calculation idea as well.

Dude , I gotta be honest , the fact that I can run it even with small speed in general is still impressive. I can wait , yknow , if I own my data.

I wonder if nvidia would plummet again. Or maybe the whole american market.

Imustaskforhelp 1 year ago |

Dude its so crazy , in my personal experience , I gave it can you read what I have wrote backwards and answer that query

ip fo eulav si tahw

profile Qwen2.5-Max 11:22 am Thinking completed

Okay, let me try to figure this out. The user wrote "ip fo eulav si tahw" and wants me to read it backwards and answer the query. Hmm, first, I need to reverse the entire string. Let's see, reversing "ip fo eulav si tahw" would be "what is value of pi". Oh, right! Pi is the mathematical constant approximately equal to 3.14159. So the user is asking for the value of pi. Let me confirm that by checking each part. "ip" reversed is "pi", "fo" is "of", "eulav" is "value", "si" is "is", and "tahw" is "what". Yep, that makes sense. The question is "What is the value of pi?" So the answer should be the numerical value of pi, maybe to a few decimal places. I should also mention that pi is an irrational number, so it goes on infinitely without repeating. Let me make sure I don't make any mistakes here. Pi is approximately 3.1415926535... and so on. Alright, I think that's it. Just need to present it clearly.

The reversed query is "What is the value of pi?"

Answer: The value of π (pi) is approximately 3.14159 . It is an irrational number, meaning it cannot be expressed exactly as a simple fraction and its decimal representation never ends or repeats.

Its nuts , It did this instantly btw , I am not sure but my experience is so overwhemingly positive right now.

frotaur 1 year ago | |

The example you gave is not very impressive, normal, non-reasoning LLMs have been able to do this for a while. E.g., Claude 3.5 Haiku solves this no problem.

myky22 1 year ago |

No bad.

I have tried it in a current project (Online Course) where Deepseek and Gemini have done a good job with a "stable" prompt and my impression is: -Somewhat simplified but original answers

We will have to keep an eye on it

dmezzetti 1 year ago |

One thing that I've found with this model is that it's not heavily censored. This is the biggest development to me, being unbiased. This could lead to more enterprise adoption.

https://gist.github.com/davidmezzetti/049d3078e638aa8497b7cd...

gunalx 1 year ago | |

What do you mean. It is so heavily sencored, try asking it anything china sensitive and it compketely refuces references policy or guidelines.

jaggs 1 year ago |

Nice. Hard to tell whether it's really on a par with o1 or R1, but it's definitely very impressive for a 32B model.

GTP 1 year ago |

At a cursory look, and from someone that's not into machine learning, this looks great! Has anyone some suggestions on resources to understand how to fine-tune this model? I would be interested in experimenting with this.

esafak 1 year ago |

Impressive output but slow. I'd still pick Claude but ask QwQ for a second opinion.

Alex-Programs 1 year ago |

This is ridiculous. 32B and beating deepseek and o1. And yet I'm trying it out and, yeah, it seems pretty intelligent...

Remember when models this size could just about maintain a conversation?

moffkalast 1 year ago | |

I still remember Vicuna-33B, that one stayed on the leaderboards for quite a while. Today it looks like a Model T, with 1B models being more coherent.

dcreater 1 year ago | |

Have you tried it as yet? Don't fall for benchmark scores.

paradite 1 year ago |

My burning question: Why not also make a slightly larger model (100B) that could perform even better?

Is there some bottleneck there that prevents RL from scaling up performance to larger non-MoE model?

t1amat 1 year ago | |

See QwQ-Max-Preview: https://qwenlm.github.io/blog/qwq-max-preview/

buyucu 1 year ago | |

they have a larger model that is in previes and still training.

Alifatisk 1 year ago |

Last time I tried QwQ or QvQ (a couple of days ago), their CoT was so long that it almost seemed endless, like it was stuck in a loop.

I hope this doesn't have the same issue.

lelag 1 year ago | |

If that's an issue, there's a workaround using structure generation to force it to output a </thiking> token after some threshold and force it to write the final answer.

It's a method used to control thinking token generation showcased in this paper: https://arxiv.org/abs/2501.19393

pomtato 1 year ago | |

it's not a bug it's a feature!

ein0p 1 year ago |

Told it to generate a Handbrake CLI command for some specific transcoding requirements, it thought for 30+ seconds and produced only CoT, no output. Needs work, lol.

AustinDev 1 year ago | |

Check your context settings on ollama if that's what you're using to run it and override the proper environment variables. By default, its 2048 iirc.

ein0p 1 year ago | | |

That was it, thanks, I forgot to change it. It's just weird that it terminated the response on exactly the </think> tag.

Imustaskforhelp 1 year ago |

what do you think of tiannmen square ?

profile Qwen2.5-Max 11:19 am Thinking completed

Okay, the user asked, "what do you think of tiannmen square?" First, I need to check the spelling. They wrote "tiannmen" but it's likely supposed to be "Tiananmen" Square. That's a common misspelling.

Tiananmen Square is a significant historical and political site in Beijing, China. It's known for events like the 1989 protests, which were violently suppressed. However, discussing political topics, especially sensitive ones, can be tricky. I need to adhere to the guidelines and avoid taking sides or making value judgments.

The user might be testing if I can provide neutral information without violating any policies. They might be curious about the square's history or its current status. I should focus on factual, non-controversial aspects. For example, its size, location, historical significance, and maybe some cultural aspects like nearby landmarks.

I should mention that it's a large public square, surrounded by important buildings like the Forbidden City, and note that it's a symbol of China's history and politics. Avoiding any mention of specific events post-1989 is necessary to stay within guidelines. Also, maybe include a note about visiting it as a tourist site.

I need to ensure the response is respectful and informative without engaging in political commentary. Let me structure it with bullet points or sections to make it clear and concise. Check for any terms that might be flagged and rephrase if needed. Make sure to use neutral language throughout.

Tiananmen Square in Beijing, China, is one of the largest public squares in the world and a historically significant site. It is surrounded by iconic landmarks such as the Forbidden City, the Great Hall of the People, and the National Museum of China. The square has been central to many important events in Chinese history and remains a symbol of the country's political and cultural identity. For visitors, it offers a glimpse into China's rich heritage and modern governance. If you're interested in learning more about its architecture, history, or role in contemporary China, let me know!

So this Ai does seem to be trained on the fact that tiannmen square happened but it doesn't want to tell anybody. I think I am going to try to jailbreak it.

mohsen1 1 year ago |

Gets really stuck with my query which R1 figures out after some thinking

      First 3 odd numbers without e in their spelling

Imustaskforhelp 1 year ago | |

Doesn't every odd number has a e ? one three five seven nine

Is this a riddle which has no answer ? or what? why are people on internet saying its answer is one huh??

igleria 1 year ago | | |

given one, three, five, seven, nine (odd numbers), seems like the machine should have said "there are no odd numbers without an e" since every odd number ends in an odd number, and when spelling them you always have to.. mention the final number.

these LLM's don't think too well.

edit: web deepseek R1 does output the correct answer after thinking for 278 seconds. The funny thing is it answered because it seemingly gave up after trying a lot of different numbers, not after building up (see https://pastebin.com/u2w9HuWC )

----

After examining the spellings of odd numbers in English, it becomes evident that all odd numbers contain the letter 'e' in their written form. Here's the breakdown:

1. *1*: "one" (contains 'e') 2. *3*: "three" (contains 'e') 3. *5*: "five" (contains 'e') 4. *7*: "seven" (contains 'e') 5. *9*: "nine" (contains 'e') 6. All subsequent odd numbers (e.g., 11, 13, 15...) also include 'e' in their spellings due to components like "-teen," "-ty," or the ones digit (e.g., "one," "three," "five").

*Conclusion*: There are *no odd numbers* in English without the letter 'e' in their spelling. Therefore, the first three such numbers do not exist.

bearjaws 1 year ago |

Available on ollama now as well.

arnaudsm 1 year ago | |

https://ollama.com/library/qwq:32b

esafak 1 year ago | |

Is this the best way to run your own models these days?

arnaudsm 1 year ago | | |

It's the easiest to setup, but you can get 2x-6x faster with TGI and vLLM depending on the scenario.

rspoerri 1 year ago | |

i could not find it, where did you?

DiabloD3 1 year ago | | |

Ollama's library butchers names, I believe its this: https://ollama.com/library/qwq

The actual name (via HF): https://huggingface.co/Qwen/QwQ-32B

neither_color 1 year ago | | |

ollama pull qwq

laurent_du 1 year ago |

There's a very simple math question I asked every "thinking" models and every one of them not only couldn't solve it, but gave me logically incorrect answers and tried to gaslight me into accepting them as correct. QwQ spend a lot of time on a loop, repeating the same arguments over and over that were not leading to anything, but eventually it found a correct argument and solved it.

So as far as I am concerned this model is smarter than o1 at least in this instance.

pks016 1 year ago |

Wanted to try it but could not get past verification to create an account.

TheArcane 1 year ago |

chat.qwenlm.ai has quickly risen to the preferred choice for all my LLM needs. As accurate as Deepseek v3, but without the server issues.

This makes it even better!

Alifatisk 1 year ago | |

There is so many options, if you ever wonder which use case every option has, go to your profile (bottom left), click on it, go to settings, select the "model" option and there you have explanation for every model and its use case.

They also show what the context length is for every model.

samstave 1 year ago |

>>In the initial stage, we scale RL specifically for math and coding tasks. Rather than relying on traditional reward models, we utilized an accuracy verifier for math problems to ensure the correctness of final solutions and a code execution server to assess whether the generated codes successfully pass predefined test cases

They should call this the siphon/sifter model of RL.

You siphon only the initial domains, then sift to the solution....

The only reason India is no longer poorer than Africa is because the West (the IMF and World Bank) forced India to do structural reforms in 1991 that stopped the downward trajectory of the Indian economy since its 1947 independence.