Run DeepSeek R1 Dynamic 1.58-bit(unsloth.ai) |
Run DeepSeek R1 Dynamic 1.58-bit(unsloth.ai) |
One thing I've being thinking about doing is to combine one of those LLM models running in llama.cpp, feed it with the output of whisper.cpp and connect its output to some TTS model. I wonder how far from Wheels and Roadie from the Pole Position tv series.
Not to make fun of OpenAI and the great work they've done but it's kinda like if I went out in the 90s and said I'm going to found a company to have the best REST APIs. You can always found a successful tech company, but you can't found a successful tech company on a technological architecture or pattern alone.
80%? On 2 H100 only? To get near chatgpt 4? Seriously? The 671B version??
I 100% expect some downvotes from the ccp.
And that's a really important strategic advantage China has versus America, which has such an insane fixation on pure(ish) free markets and free trade that it gives away its advantages in strategic industry after strategic industry.
Some people falsely infer from the experience with the Soviet Union that freer markets always win geopolitical competition, but that's false.
> Some people falsely infer from the experience with the Soviet Union that freer markets always win geopolitical competition, but that's false.
The data we have is 500 years of free markets in the western world and the verdict is overwhelmingly: Yes, more freedom means more winning.
Just invite some incompetent bureaucrat over your house to dictate how you should cook and you'll quickly agree.
Always happy to oblige when someone insinuates that any critics must be government agents
No, more freedom means more winning to a point. Past that point it does not, and I'd argue that's where the US is.
> Just invite some incompetent bureaucrat over your house to dictate how you should cook and you'll quickly agree.
That's supposed to be convincing, somehow? Just invite some "competent" capitalist over to your house, and he'll sell your fishing rod in exchange for a short-term discount on fish at the supermarket, and see how well you win.
They're definitely not "the nine most terrifying words in the English language." Government is a necessity and performs important functions: we'd be worse off without it. A libertarian utopia would actually be a dystopia, at least for the vast majority.
Some day, historians will ask: Why did China eclipse the United States? And the tl;dr answer will likely be: libertarians. Myopic enthusiasm for free markets has really degraded the US's ability to make strategic decisions to maintain its advantages, and it seems on track to walk down the value chain while a few people get really rich leading it that way.
Why do you need to make leaps to the future to find evidence for your claims? Anyone can simply look at the past 500 years and come to the opposite conclusion.
There's also not much evidence China will "eclipse" the United States (whatever that means). I hitchhiked mainland China in 2019 after studying the language in university precisely because I thought the country might "eclipse" mine.
I came back with the exact opposite conclusion.
If the definition of "eclipse" is more global cultural influence, I would challenge you to compare the number of American movies you've watched in the past year vs Chinese movies. Movies are just 1 dimension of this dynamic.
The country has simply too much history and insularity to propagate its influence throughout the world. The language is another key example- very few learn Chinese as a second language. Even the Chinese youth themselves often use the latin alphabet to write their own language on a keyboard.
And then there’s the whole repetition issue. Infinite loops with "Pygame’s Pygame’s Pygame’s" kind of defeats the point of quantization if you ask me. Sure, the authors have fixes like adjusting the KV cache or using min_p, but doesn’t that just patch a symptom rather than solve the actual problem? A fried model is still fried, even if it stops repeating itself.
On the flip side, I love that they’re making this accessible on Hugging Face... and the dynamic quantization approach is pretty brilliant. Using 1.58-bit for MoEs and leaving sensitive layers like down_proj at higher precision—super clever. Feels like they’re squeezing every last drop of juice out of the architecture, which is awesome for smaller teams who can’t afford OpenAI-scale hardware.
"accessible" still comes with an asterisk. Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup? For that price, I’d rather build a rig with used 3090s and get way more bang for my buck (though, yeah, it’d be a power hog). Cool tech—no doubt—but the practicality is still up for debate. Guess we'll see if the next-gen models can address some of these trade-offs.
Min_p = 0.05 was a way I found to counteract the 1.58bit model generating singular incorrect tokens which happen around 1 token per 8000!
I'm running Open WebUI for months now for me and some friends as a front-end to one of the API providers (deepinfra in my case, but there are many others, see https://artificialanalysis.ai/).
Having 1.58-bit is very practical for me. I'm looking much forward to the API provider adding this model to their system. They also added a Llama turbo (also quantized) a few months back so I have good hopes.
AMD strix halo APU will have quad channel memory and will launch soon so expect these kinds of setups available for much less. Apple is charging an arm and a leg for memory upgrades, hopefully we get competition soon. From what I saw at CES OEMs are paying attention to this use case as well - hopefully not following suite on RAM markups.
Here's hoping the Nvidia Digit (GB10 chip) has a 512 bit or 1024 bit wide interface, otherwise the Strix Halo will be the best you can do if you don't get the Mac Ultra.
I’m sure there’ll be some amount of undercutting but I don’t think it’ll be a huge difference on the RAM side itself.
Mistral's large 123B model works well (but slowly) at 4-bit quantisation, but if I knock it down to 2.5-bit quantisation for speed, performance drops to the point where I'm better off with a 70B 4-bit model.
This makes me reluctant to evaluate new models in heavily quantised forms, as you're measuring the quantisation more than the actual model.
EDIT: It seems that original authors provided a nice write-up:
https://unsloth.ai/blog/deepseekr1-dynamic#:~:text=%F0%9F%96...
Other than that, if you really need the big one you can get six 3090s and you're good to go. It's not cheap, but you're running a ChatGPT equivalent model from your basement. A year ago this was a wetdream for most enthusiasts.
This line in the stuff inside the <think> section suggests it's also been trained on YouTube clips:
>> "I'm not entirely sure if I got all the details right, but this is what I remember from watching clips and summaries online."
An excerpt from the generated summary:
>> "Set in the 23rd century during a Z-Corp invasion, the series features action sequences, strategic thinking, and humor. It explores themes of international espionage, space warfare, and humanity's role in the cosmos. The show incorporates musical numbers and catchy theme songs for an engaging viewing experience. The plot involves investigating alien warships and their secret base on Kessari planet while addressing personal conflicts and philosophical questions about space."
"It explores themes of international espionage, space warfare, and humanity's role in the cosmos" is the closest to correct line in the whole output.
Anyone who has a/the need for or understands the value of a local LLM would be OK with this kind of output.
Wishful thinking.
I'm curious, what would you use that rig for?
Random observation 2: It's time to cancel the OpenAI subscription.
I'm impressed by the 140 tokens per second speed with the 1.58-bit quantization running on dual H100s. That kind of performance makes the model practical for small or mid sized shops to use it for local applications. This is a huge win for people working on agents that require low latency that only local models could support.
Not accusing you anything. Could be that you happen to write in a way similar to LLMs. Could be that we are influenced by LLM writing styles and are writing more and more like LLMs. Could be that the difference between LLM generated content and human-generated content is getting smaller and harder to tell.
It’s the exclamation point in the first paragraph, the concise and consistent sentence structure, and the lack of colloquial tone.
OP, no worries if you’re real. I often read my own messages or writing and worry that people will think I’m an LLM too.
Amazing that OP confirmed you're correct (and good use of LLM @OP).
This is really interesting insight (although other works cover this as well). I am particularly amused by the process by which the authors of this blog post arrived at these particular seeds. Good work nonetheless!
It’s a very bold claim which is really shaking up the markets, so I can’t help but wonder if it was even verified at this point.
Oh nice! So I can try it in my local "low power/low cost" server at home.
My homesystem does run in a ryzen 5500 + 64gb RAM + 7x RTX 3060 12gb
So 64gb RAM plus 84gb VRAM
I dont want to brag around, but point to solutions for us tinkerers with a small budget and high energy costs.
such system can be build for around 1600 euro. The power consumption is around 520 watt.
I started with a AM4 Board (b450 Chipset) and one used RTX 3060 12gb which cost around 200 Euro used if you are patient.
There every additional GPU is connected with the pcie riser/extender to give the cards enough space.
After a while I had replaces the pcie cards with a single pcie x4 to 6x PCIe x1 extender.
It runs pretty nice. Awesome to learn and gain experience
> We managed to selectively quantize certain layers to higher bits (like 4bit), and leave most MoE layers (like those used in GPT-4) to 1.5bit
For example, I imagine a strong MoE base with 16 billion active parameters and 6 or 7 experts would keep a good performance while being possible to run on 128GB RAM macbooks.
Maybe using a strong reasoning model such as R1 the next generation, even more performance can be extracted from smaller models.
i’ve gotten full fp8 running on 8xh100, probably going to keep doing that
Do we finally have a model with access to the training architecture and training data set, or are we still calling non-reproducible binary blobs without source form open-source?
I also like to ask the models to create a simple basic Minecraft type game where you can break pieces and store them in your inventory, but disallow building stuff
So you can load a different active subset of the MoE into each 89GB GPU, sharding it across something like 32 different GPUs (or can you get away with less? Wouldn't be surprised if they can infer on 8x H800 gpus). Some parameters are common, others are independent. Queries can be dynamically routed between GPUs, potentially bouncing between GPUs as much as once per output token, depending on which experts they need to activate.
Though, I suspect it's normal to stick on one MoE subset for several output tokens.
This has a secondary benefit that as long as the routing distribution is random, queries should be roughly load balanced across all GPUs.
Then by using pipeline parallelism, if a new request comes, we simply stick them in a queue - GPUs 0, 1, 2, ..., 8. Request A is at GPU 2, Request B at GPU 1, Request C at GPU 0 and so on.
The other option is tensor parallelism were we split the weights evenly. You could combine pipeline and tensor parallelism as well!
I cannot understand why "openai is dead" has legs: repurpose the hardware and data and it can be multiple instances of the more efficient model.
you invest in a 100x machine expecting a revenue of X, but now you can only charge X/100 because R1 shows that AI inference can be done much more efficiently. see the price decrease of ChatGPT and addition of free O3 etc.
this reduction of future cash flows, ceteris paribus, implies that the present value of these cash flows decrease. this then results in massive repricing to the downside as market participants update their forecasts.
what you are missing is that to assume as you do, you must make the additional assumption that demand for additional compute is infinite. Which may very well be the case, but it is not guaranteed compared to the present realized fact that R1 means lower revenues for AI inference providers -> changes the capex justification for even more hardware -> NVDA receives less revenue.
I love the original DeepSeek model, but the distilled versions are too dumb usually. I'm excited to try my own queries on it.
I love the original DeepSeek model, but the distilled versions are too dumb usually.
Apart from being dumber, they also don't know as much as R1. I can see how fine-tuning can improve reasoning capability (by showing examples of good CoT) but there's no reason that would improve the knowledge of facts (relative to the Qwen or Llama model on which the finetuning was based). Is there any good quick summary of what's special about DeepSeek?
Yes, section 2.3 of the Deepseek R1 paper summarizes the training part you're asking about, in less than a page.https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSee...
youtube.com/watch?v=Nl7aCUsWykg
Don’t get me wrong what DS did is great, but anyone thinking this reshape the fundamental trend of scaling laws and make compute irrelevant is dead wrong. I’m sure OpenAI doesn’t really enjoy the PR right now, but guess what OpenAI/Google/Meta/Anthropic can do if you give them a recipe for 11x more efficient training ? They can scale it to their 100k GPUs clusters and still blow everything. This will be textbook Jevons paradox.
Compute is still king and OpenAI has worked on their training platform longer than anyone.
Of course as soon as the next best model is released, we can train on its output and catch up at a fraction of the cost, and thus the infinite bunny hopping will continue.
But OpenAI is very much alive.
Need an LLM to one-shot some complex network scripting? as of last night, o1 is still where its at.
Of course cost is incomparably higher since plus has a very low limit. Which of course is a huge deal.
2. If you have GitHub Copilot, you get o1 chat also there.
I haven't seen much value with OpenAI subscription for ages.
ChatGPT is the king of the multimodal experience still. Anthropic is a distant second, only because it lets you upload images from the clipboard and responds to them, but it can't do anything else like generate images - sometimes it will do a flowchat which is kind of cool, GPT won't do that - but will it speak to you, have tones, listen to you? no.
And in the open source side, this area has been stagnant for like 18 months. There is no cohesive multimodal experience yet. Just a couple vision models with chat capabilities and pretty pathetic GUIs to support them. You have to still do everything yourself there.
There is a huge utility for me, and many others that dont know it yet, if we could just load a couple models at once that work together seamlessly in a single seamless GUI like how ChatGPT works.
AFAIK you can't do that with newer consumer cards, which is why this became an annoyance. Even a RTX 4070 Ti with its 12 GB would be fine, if you could easily stack a bunch of them like you used to be able with older cards.
That's because it's Apple. It time to start moving to AMD systems with shared memory. My Zen 3 APU system has 64GB these days and its a mini ITX board.
It's better to get (VRAM + RAM) >= 140GB for at least 30 to 40 tokens/s, and if VRAM >= 140GB, then it can approach 140 tokens/s!
Another trick is to accept more than 8 experts per pass - it'll be slower, but might be more accurate. You could even try reducing the # of experts to say 6 or 7 for low FLOP machines!
Can you release slightly bigger quant versions? Would enjoy something that runs well on 8x32 v100 and 8x80 A100.
Apple's M chips, AMD's Strix Point/Halo chips, Intel's Arc iGPUs, Nvidia's Jetsons. The main issue with all of these though is the lack of raw compute to complement the ability to load insanely large models.
It seems that AMD Epyc CPUs support terabytes of ram, some are as cheap as 1000 EUR. why not just run the full R1 model on that - seems that it would be much cheaper than multiple of those insane NVidia-Karten.
I also tried not setting the seeds, but the results are still the same - quantizing all layers seems to make the model forget and repeat everything - I put all examples here: https://docs.unsloth.ai/basics/deepseek-r1-dynamic-1.58-bit#...
Another option is to employ min_p = 0.05 to force the model not to generate low prob tokens - it can help especially in the case when the 1.58bit model generates on average 1/8000 tokens or so an "incorrect" token (for eg `score := 0`)
Based on Nvidia being down 18% yesterday I would say the claim is generally accepted.
If confirmed, Nvidia could go down even more
“I don’t believe this, but I know others will, so I’m selling”
The only part of DeepSeek-R1 I do not like. I hope it's over, but I am not holding my breath.
That said, what they did with $5 million of GPUs is impressive. Reportedly, they resorted to using PTX assembly to make it possible:
https://www.tomshardware.com/tech-industry/artificial-intell...
If they aren't lying because they have hardware they're not supposed to have, which is also a possibility.
the cost absolutely includes the cost of GPUs and data centers, they quoted a standard price for renting h800 which has all of this built in. but yes, as very explicitly noted in the paper, it does not include cost of test iterations
ryzen 5500 + 7x3060 + cooling ~= 1.6 kW off the wall, at 360 GB/s memory bandwidth, and considering your lane budget, most of it will be wasted in single PCIe lanes. After-market unit price of 3060's is 200 eur, so 1600 is not good-faith cost estimate.
From the looks of it, your setup is neither low-power, nor low-cost. You'd be better served with a refurbished mac studio (2022) at 400GB/s bandwidth fully utilised over 96 GB memory. Yes, it will cost you 50% more (considering real cost of such system closer to 2000 eur) however it would run at a fraction of power use (10x less, more or less)
I get it that hobbyists like to build PC's, but claiming that sticking seven five year out of date low-bandwidth GPU's in a box is "low power/low cost" is a silly proposition.
You're advocating for e-waste
Now add that this guy has 7x3060 = 100% miner. So you know that he is running a optimized profile (underclocked).
Fyi, my gaming 6800 draws 230W, but with a bit of undervolting and sacrificing 7% performance, it runs at 110W for the exact same load. And that is 100% taxed. This is just a simple example to show that a lot of PC hardware runs very much overclocked/unoptimized out of the box.
Somebody getting down to 520W sounds perfectly normal, for a undervolted card that gives up maybe 10% performance, for big gains in power draw.
And no, old hardware can be extreme useful in the right hands. Add to this, its the main factor that influences the speed tends to be more memory usage (the more you can fit and the interconnects), then actual processing performance for running a LLM.
Being able to run a large model for 1600 sounds like a bargain to me. Also, remember, when your not querying the models, the power will be mostly the memory wakes + power regulators. Coming back to that youtuber, he was not constantly drawing that 130W, it was only with spikes when he ran prompts or did activity.
Yes, running from home will be more expensive then a 10$ copilot plan but ... nobody is also looking at your data ;)
But in either case its going to do much better than currently available CPUs with easily upgradeable ram. I would not be surprised to see 128gb configurations for around 3k (going of the ASUS g13 announced pricing of arround 2k for 32gb version and them saying it will go up to 128gb).
At that point sure it might not compete with max but its at a much more acceptable price point, it will not be a device you get just for the AI, but a mobile workstation that you can also run some local models on for normal money. Will need to wait and see. I know I am not buying anything from ASUS either way.
The newest Lenovo workstations that use LPDDR5x with 7467MT/s gets you a 16GB jump for the price apple charges for an 8GB.
2:1 isn't "class comparable" IMHO.
There are distilled versions like Qwen 1.5, 3, 14, 32, Llama 8, 70, but those are distilled - if you want to run the original R1, then the quants are currently the only way.
But I agree quants do affect perf - hence the trick for MoEs is to not quantize specific areas!
Being able to do semantic diffs of the output of the two models should tell you what you need to do.
O1 vs R1 performance on specific non-benchmark problems is also not that relevant until people have replicated R1 and/or tried fine-tuning it with additional data. What would be interesting to see is whether (given the different usage of RL) there is any difference in how well R1 vs O1 generalize to reasoning capability over domains they were not specifically trained for. I'd expect that neither do that well, but not knowing details of what they were trained on makes it hard to test.
Indeed, that's posting before being fully awake.
> And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy.
OF course you can't recover any accuracy, but LLM are in fact prone to this kind of repetition no matter what, this is a known failure mode that's why samplers aimed at avoiding this have been designed over the past few years.
> If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.
But it won't necessary be gibberish! even a highly quantized R1 has still much more embedded information than a 14 or even 32B model, so I don't see why it should output more gibberish than smaller models.
Using LM Studio, trying to load the model throws an error of "insufficient system resources."
I disabled this error, set the context length to 1024 and was able to get 0.24 tokens per second. Comparatively, the 32B distill model gets about 20 tokens per second.
And it became incredibly flaky, using up all available ram, and crashing the whole system a few times.
While the M4 Max 128GB handles the 32B well, it seems to choke on this. Here's to hoping someone works on something in-between (or works out what the ideal settings are because nothing I fiddled with helped much).
In theory half of the model fits to RAM, so it should be GPU limited if memory management is smart.
But everyone is using the distilled models which are much smaller.
that and they just happened to be undergoing a large scale "cyber attack"
To me, this is ripe for an application of the Jevons paradox. If architectural improvements make similar models cheaper, I would expect to see more of them trained and deployed, not fewer, ultimately increasing the market for GPU-like hardware.
The benchmark comparisons are perhaps, for now, the best way to compare reasoning prowess of R1 vs O1, since it seems pretty certain they both trained for those cases.
I think the real significance of R1 isn't the released model/weights itself, but more the paper detailing (sans training data) how to replicate it, and how effective "distillation" (i.e. generate synthetic reasoning data for SFT) can be to enhance reasoning even without using RL.
Deepinfra is pretty cheap though as a deepseek provider.
As for deepseek, I couldn't even sign up because my email domain is not on their whitelist. To just try it out for now I don't mind the increased cost.
I’d guess they didn’t quite a bit of fine tuning to censor some more sensitive topics which probably impacts the output quality for other non technical subjects.
The people making the model probably don't really give a shit about politics and just did the minimum to avoid being embarassed, but if people start jailbreaking it they will be forced to care.
I took a fair amount of time to get everything to a reduced power level and measured several llm models (and hashcat for the extreme) to find the best speed per watt, which is usally around 1700-1900 mhz or limiting 3060 to 100 to 115 watt.
If I planned it in the first run, I may got away with a used mac studio, thats right. However, I incrementally added more cards as I moved further into exploration.
I didn't wanted to confront someone, but it looks like you either show of 4x 4090 or you keep silent
I am amazed these days people lacking knowledge about hardware, and the mass benefits of undervolting/power limiting hardware. Its like people do not realize that what is sold, is often overclocked/too high vcore. The amount of people i see buying insane overspec PSUs, and go O_o ...
How is your performance with the different models on your setup?
> know Youtuber, that ran LLMs on a 4090, and the actual power draw was only 130W on the GPU.
Well, let's see his video. He must be using some really inefficient backend implementation if the GPU wasn't utilised like that.
I'm not running e-waste. My cards are L40S and even in basic inference, no batching with ggml cuda kernels they get to 70% util immediately.
Generally I set temp to 0-0.4 at absolute most.
min_p actually needs a little temperature to work effectively so with min_p I almost always use 0.2
temp=0 is also good for singular outputs. For classification tasks, it's better to actually inspect the logits.
But my goto setting is always setting min_p at least 0.01 or 0.05! It vastly suppresses incorrect rare random tokens from being created, and it helps massively!
Alex (https://alexcodes.app) also does this now btw.
And it has a headphone jack, OK? I just hate Bluetooth earbuds. And yeah, it isna problem, but I digress.
When I run a 2.5B model, I get respectable output. Takes a minute or two to process the context, then output begins at somewhere on the order of 4 to 10 tokens per sec.
So, I just make a query and give it a few and I have my response.
Here is how I see it:
That little model, which is Gemma 2.2b sorry, knows a lot of stuff. It has knowledge I don't and it gives it to me in a reasonable, though predictable way. Answers are always of a certain teacher reminding student how it all goes way.
I don't care. Better is nice, but if I were stuck somewhere with no network, being able to query that model is amazing!
First aid, how to make fires, materials and uses. Fixing stuff, theories of operation, what things mean and more are in that thing ready for me to take advantage of.
I consider what I have fast. And it will get one or two orders faster over the next few years too.
I did it on a lark (ask the model what that means) and was surprised to see I gained a nice tool.
This scares me more than it should...
Please do not trust an AI in actual life and death situations... Sure if it is literally your only option, but this implies you have a device on you that could make a phone call to an emergency number where a real human with real training and actually correct knowledge can assist you.
Even as an avid hiker the amount of times I've been out off cell service is miniscule and I absolutely refresh my knowledge on first aid regularly and any potential threats before a hike somewhere new.
It just a tool in the tool box. Like any tool, one must respect and use it with care.
For choking, as an example, it outlines all the basics, affirmed it was educational info only, and urged the help of a professional.
All in all, reasonable.
Your point stands, but this little model 2.5B has clear bounds it seems.
I plan to poke at it a few more times asking for higher risk info, curious about what it may do.
Please define "lark" for me, as in "I did it on a lark."
Output began a bit before 8:41, and was complete before 8:43:
---Begin Model Output
"On a lark" means without much planning or reason, impulsively or unexpectedly.
Here's how it works:
"I did it on a lark." This means the person acted without a specific plan or reason. "A lark" refers to the lark bird, a small songbird known for its playful nature.
So, "on a lark" suggests something spontaneous, casual, and often with a touch of playful whimsy.
Here are some other ways to understand it:
Without much thought: The person was likely to do it quickly and without much consideration. Based on a whim: It's something they decided to do without a solid plan.
Let me know if you'd like more examples
---End Model Output
To me, that is pretty damn spiffy!
> British English informal used to suggest that an activity is foolish or a waste of time: he's serious about this music lark.
I can ask this model some pretty intense things and I get meaningful output back and it doesn't take very long and I don't need an internet connection. Like I said it's spiffy.
I will, have, and will use it again sans network connection and be happy I had the option.
Better will come. That is just fine. I have something very useful today, right now.
I don't give a damn about ideology I just want everything ever thought or written searchable and interactive
I had assumed the performance advantage for MoE came from minimising traffic between GPUs. But if it's per layer routing, then it's going to massively increase inter-gpu traffic compared to vertical slicing.
I guess that means the performance advantage actually comes when batching thousands of queries? The MoE routing would mean that on each MoE layer, each GPU shard gets a batch of queries that will all hit roughly the same subset of experts (and read the same weights from memory). The batches then shuffle between each MoE layer to re-optimise.
It's kind of like GPU raytracing where you get large performance gains by running coherency sorting on rays and batching similar rays together.
Probably was not r1, but one of the other models that got trained on r1, which apparently might still be quite good.
https://www.reddit.com/r/LocalLLaMA/comments/1i8ifxd/ollama_...
It's fairly clear that R1-Llama or R1-Qwen is a distill, and they're all coming directly from DeepSeek.
As an aside, at least the larger distilled models (I'm mostly running r1-llama-distill-70b) are definitely not the same thing as the base llama/qwen models. I'm getting better results locally, admittedly with the slower inference time as it does the whole "<think>" section.
Surprisingly - The content in the <think> section is actually quite useful on its own. If you're using the model to spitball or brainstorm, getting to see it do that process is just flat out useful. Sometimes more-so than the actual answer it finally produces.
>>> /show info
Model
architecture qwen2
parameters 7.6B
context length 131072
embedding length 3584
quantization Q4_K_MI would recommend you to try to run the llama-based distill (same size, same quantization) that you can find here: https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8...
It should take the same amount of memory as the one you currently have.
In my experience the Llama version performs much better at adhering to the prompt, understanding data in multiple languages, and going in-depth in its responses.
It's a model called Qwen, trained by Alibaba, which the DeepSeek team has used to "distill" knowledge from their own (100x bigger) model.
Think of it as forcing a junior Qwen to listen in while the smarter, PhD-level model was asked thousands of tough problems. It will acquire some of that knowledge and learn a lot of the reasoning process.
It cannot become exactly as smart, for the same reason a dog can learn lots of tricks from a human but not become human-level itself: it doesn't have enough neurons/capacity. Here, Qwen is a 7B model so it can't cram within 7 billion parameters as much data as you can cram into 671 billion. It can literally only learn 1% as much, BUT the distillation process is cleverly built and allows to focus on the "right" 1%.
Then this now-smarter Qwen is quantized. This means that we take its parameters (16-bit floats, super precise numbers) and truncate them to make them use less memory space. This also makes it less precise. Think of it as taking a super high resolution movie picture and compressing it into a small GIF. You lose some information, but the gist of it is preserved.
As a result of both of these transformations, you get something that can run on your local machine — but is a bit dumber than the original — because it's about 400 times smaller than the real deal.
And I think they, the DeepSeek team, finetunes Qwen 7b on DeepSeek. That is how I understood it.
Which apparently makes it quite good for a 7b model. But, again: if I understood it correctly, is still just qween and without the reasoning of DeepSeek.
It may provide a financial opportunity for someone who disagrees with that aggregated opinion though.
I guess the 5090 either started ever so slightly to become compute limited as well, or hit some overhead limitation.
[1]: https://www.phoronix.com/review/nvidia-rtx5090-llama-cpp
What kind of timescale do you expect to be able to train a useful LLM with that?
The real r1 model is great, better than o1, but the distilled models are not even as good as the base models that they were distilled from.
This entire hype cycle has long been completely disconnected from reality. I've watched a lot of hype waves, and I've never seen one that oscillates so wildly.
I think you're right that OpenAI isn't as hurt by DeepSeek as the mass panic would lead one to believe, but it's also true that DeepSeek exposes how blown out of proportion the initial hype waves were and how inflated the valuations are for this tech.
Meta has been demonstrating for a while that models are a commodity, not a product you can build a business on. DeepSeek proves that conclusively. OpenAI isn't finished, but they need to continue down the path they've already started and give up the idea that "getting to AGI" is a business model that doesn't require them to think about product.
Couldn’t agree more! Nobody here read the manual. The last paragraph of DeepSeek’s R1 paper:
> Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.
Just based on my evaluations so far, R1 is not even an improvement on V3 in terms of real world coding problems because it gets stuck in stupid reasoning loops like whether “write C++ code to …” means it can use a C library or has to find a C++ wrapper which doesn’t exist.
OpenAI doesn't have an advantage in compute more than Google, Microsoft or someone with a few billions of $$.
Why would anyone bet? They can just short the OpenAI / MS stocks, and see in a few months if they were right or not.
> DeepSeek's first-generation of reasoning models with comparable performance to OpenAI-o1, including six dense models distilled from DeepSeek-R1 based on Llama and Qwen.
Well I guess if you are in the Enterprise Java naming model you would expect something like "VisitorModelUtilsListGetterAdapterInterceptorMessageManagerDrivenObserverPool"
If you look at their API docs you will see:
model: name of the model to push in the form of <namespace>/<model>:<tag>
I don't think there is any reason to jump to the conclusion it is some type of conspiracy here, just naming things based on a API that probably didn't think about distillation when they created it.[1]: https://ollama.com/library/deepseek-r1
[2]: https://github.com/deepseek-ai/DeepSeek-R1#deepseek-r1-disti...
But here's my advice: drop the fallacious arguments and try something more honest.
Still, his implied assertion that markets that markets can often behave irrationally, and can't be used as evidence of technical matters, seems pretty valid to me.
But I suppose you could see it as a sign that something is at least temporarily "generally accepted" among investors. That doesn't mean it's generally accepted among AI researchers, though.
Although I thought it was $6M rather than $5M, and that that was only the last step, and not the total investment. What does seem to be generally accepted among investors that this isn't good news for NVidia's profits, but that still doesn't mean that all the specific facts are generally accepted.
>Each Project DIGITS features 128GB of unified, coherent memory and up to 4TB of NVMe storage.
Even if $3k is only the starting price, it doesn't sound like spending more buys you more memory.
Efficient market hypothesis is for nerds!
How many laptops have you personally seen be stolen on a train?
Depending on the train type and configuration, many actually seem like pickpocket paradise.
FWIW I’ve used my laptop on the train plenty, I’ve never had anything stolen nor felt in any danger of it.
People get up to use the bathroom or the cafe car, the laptop is left behind for ten minutes, one of the train stops is while they're away from their seat, and someone sees an opportunity, snags it, and gets off at the stop.
This is an actual thing. And if it's worth a thousand bucks then it's very much worth getting off at an earlier stop then you'd planned, and continuing your journey on the next train.
Ticket inspectors or guards are irrelevant. There isn't one in your car 99% of the time.
I don't why you're trying to argue laptop theft on trains in first-world countries isn't a thing. It absolutely is.
So, yes, theft on trains for people that think they are 100% safe are a thing, but applying the same idea (to assume something is 100% safe and not be cautious) I wonder how do such people use the internet...
My work policies state you simply cannot leave your laptop out of sight for any period unless it's in a secure location (work|home). I feel the same way for my personal laptop as well.
Obviously, nobody steals things while the train is in motion. They wait until the train is about to leave the station, snatch a phone or handbag and jump out just as the door is closing. The train leaves, the thief blends in with other passenger leaving the station, and by the time news of the theft has made it from the passengers to the driver to the station staff the thief is long gone.
Of course people drive around $6,000+ cars all the time, so....
Something interesting: I live near a train line where the doors are not automatic (they have to be opened manually on each stop), and there have been incidents where people get pickpocketed while the train is still in motion, and the thief jumps out right before the station, when the train has slowed down significantly but is still in motion. Many people have been hurt doing this.
Make sure the laptop is insured and that full disk encryption is enabled. If it’s a Mac, make sure you have it in Find My so you can wipe it remotely if that’s something you worry about.
Yesterday's entry: "... kind of a mind flex that you noted you used Meta Stories glasses to take that photo."
But yes, you're right. I've never personally seen a laptop get stolen. In fact, most people who have their laptop get stolen never see their laptop get stolen either.
I have, however, had coworkers who've had their laptops stolen. Multiple times.
Today's baseline laptops are really good as it is. 32-64 GiB of RAM is plenty, and at least on PC laptops you can do it fairly cheaply. Apple has been a consistent year or two ahead in mobile CPU performance but it fell out of my consideration ever since I realized the M1 and 7040 were both very sufficient for any local computation I cared about. (I'm not going to say I'd specifically go for less efficiency or performance, but it has become significantly lower priority over other things like repairability.)
Not really specifically hating on Apple, here. If I was going to get another Mac it'd be a Mac Mini or Mac Studio probably, ideally with a third-party SSD upgrade to both save on costs and get a slight bit of extra drive performance too. I've definitely considered it, even though I am very far from an Apple fan, just due to the superior value and efficiency they have in many categories.
So a little trick I figured out is to close my laptop lid and then slide it into a pocket of my backpack. I can then carry it with me when I get up and move around.
So then I can take it with me to eat lunch or an extended toilet break. Maybe some day all laptops will have that feature.
The attempted thief didn't succeed in taking the phone, but did knock the laptop onto the ground, damaging it.
If I would make a statistics of primary cause of remaining without a laptop among people I know, the biggest danger is liquids in glasses (that ends up on the laptops) ...
I don't think I've ever seen a human being do that before on a train. Not to go to the toilet, nor to grab a coffee in another car.
You can't be paranoid about everything. My friend in France had put his laptop back into his bag where it wasn't visible and assumed that was good enough, but someone must have seen him do it and just took the whole bag.
You are applying a totally unreasonable standard, to suppose that the thefts were due to unreasonable carelessness. What, do you think someone should take their large luggage into the bathroom too, every time they need to pee?
Talk about victim-blaming.
The standard is mine and I follow it. The same way I find absurd not to do it, you find it unreasonable to do it.
I find the expectation that things are not stolen (if unsupervised in public places) strange considering the huge amount of inequalities in wealth around even in civilized countries. I do not agree with the idea of stealing, thiefs should be punished, but expecting everybody "to behave" given the situation seems unrealistic to me.
That does not mean that I think that things are stolen 100% of the time. I have a friend that forgot a laptop on a bus (Netherlands) and the driver found it at the end of the line and gave it to lost objects so my friend got it back.
If you find it absurd how 99% of people act on long-distance trains, I don't know what to tell you.