The Era of 1-bit LLMs: ternary parameters for cost-effective computing

The Era of 1-bit LLMs: ternary parameters for cost-effective computing(arxiv.org)

1040 points by fgfm 2 years ago | 447 comments

cs702 2 years ago |

There are two findings I find shocking in this work:

* In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).

* In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value. See the paper for exact details.

On existing hardware, the gains in compute and memory efficiency are significant, without performance degradation (as tested by the authors).

If the proposed methods are implemented in hardware, we will see even greater gains in compute and memory efficiency.

Wow.

paul_mk1 2 years ago | |

Fun to see ternary weights making a comeback. This was hot back in 2016 with BinaryConnect and TrueNorth chip from IBM research (disclosure, I was one of the lead chip architects there).

Authors seemed to have missed the history. They should at least cite Binary Connect or Straight Through Estimators (not my work).

Helpful hint to authors: you can get down to 0.68 bits / weight using a similar technique, good chance this will work for LLMs too.

https://arxiv.org/abs/1606.01981

This was a passion project of mine in my last few months at IBM research :).

I am convinced there is a deep connection to understanding why backprop is unreasonably effective, and the result that you can train low precision DNNs; for those note familiar, the technique is to compute the loss wrt to the low precision parameters (eg project to ternary) but apply the gradient to high precision copy of parameters (known as the straight through estimator). This is a biased estimator and there is no theoretical underpinning for why this should work, but in practice it works well.

My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.

cs702 2 years ago | | |

Thank you. Others on this thread have addressed the citation-trail issues you raise. I just want to tell you how helpful I find your comment about why ternary weights ought to work at all without degrading performance:

> My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.

Your guess sounds and feels right to me, even if currently there's no way to express it formally, with the rigor it deserves.

Thank you again for your comment!

mjcohen 2 years ago | | |

IIRC, Hamming's book "Digital Filters" (1989) has a section on FFTs with only the sign of the coefficient being used. It performed surprisingly well.

fabmilo 2 years ago | | |

They train using Straight Through Estimator but is cited in the previous BitNet paper. What happen to the TrueNorth Chip? I think investing in specialized hardware for AI is a good bet.

WhitneyLand 2 years ago | | |

That’s really interesting to see the breadcrumb trail goes back that far.

So what are the most important insights in this paper compared to what was previously done?

I assume there’s more context to the story and it’s not just that no one thought to apply the concepts to LLM’s until now?

eru 2 years ago | | |

You can probably apply the same techniques 'Deep neural networks are robust to weight binarization and other non-linear distortions' used to get to 0.68 bits / weight to get your ternary weights below one bit; so you can claim they are still one-bit networks.

WiSaGaN 2 years ago | | |

Could the reason that 3 states in this case be more efficient than 2 states be that 3 is closer to 2.718... (Euler's number) than 2 is?

nxobject 2 years ago | | |

As aside, I'm curious: what was it like to work at IBM research, especially as a legacy industrial research org?

antimatter15 2 years ago | | |

They cite straight through estimators in the previous work with many of the same authors on (actual binary) BitNet

vessenes 2 years ago | |

I'd be VERRY cautious about being excited here.

My priors are like this:

1. Initial training of a neural network moves all weights around a large amount at first.

2. Later training of the network adjusts them a small amount.

3. An undertrained network will therefore look a lot like figuring out "positive, negative, or 0?" for each node during early training.

If all these things are true, then

1. Early training of an fp16 network and a bitnet with 0 added will be roughly similar in results

2. Later training will yield different / worse results, as the network gets into the 'fine tuning' part of the training.

I think the paper's stats back these priors up -- they say "this works on (3B+) large networks, but not small ones." They then imply there's something about the structure of a large network that allows a bitnet to do well. It seems more likely to me it works on large networks because they have not put the compute into 3B+ networks to get past the 'gross tuning' phase.

The networks they have compute to put in to get them 'fully' trained -- those networks don't show the results.

Also, a quick reminder that Perplexity 12 is really terrible. You would not want to use such a network. Hopefully I'm wrong and we can get something for free here! But, I'm cautious - to - skeptical.

vessenes 2 years ago | | |

Update - I'm still cautious about this paper, but I had the table numbers inverted in my head while thinking about it. The paper shows better perplexity results than competing models at larger parameter sizes, so I was wrong.

svantana 2 years ago | | |

Wait, are we reading the same paper? What I'm seeing is comparable accuracy to unquantized models for <4B params, and nothing reported for larger models except resource consumption.

gradascent 2 years ago | | |

Then perhaps a method emerges out of this to make training faster (but not inference) - do early training on highly quantized (even ternary) weights, and then swap out the weights for fp16 or something and fine-tune? Might save $$$ in training large models.

cs702 2 years ago | | |

Thank you. Your key point -- that so far all models with the proposed methods may have been only "grossly trained" -- is compelling. If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That seems sensible to me, and makes replication easier, but I agree we need more to see extensive testing, after more extensive pretraining, on models of larger sizes.

mise_en_place 2 years ago | | |

Intuitively I've always been a bit skeptical of quantization. Wouldn't there be a tiny loss in precision by doing this type of quantization? I could imagine the error function increasing by utilizing these types of techniques.

gliptic 2 years ago | | |

> Also, a quick reminder that Perplexity 12 is really terrible.

The 3B model had a perplexity of 9.91, less than LLaMa 1 in fp16.

nutanc 2 years ago | |

We have been experimenting with the paper(https://www.researchgate.net/publication/372834606_ON_NON-IT...).

There is a mathematical proof that binary representation is enough to capture the latent space. And in fact we don't even need to do "training" to get that representation.

The practical application we tried out for this algorithm was to create an alternate space for mpnet embeddings of Wikipedia paragraphs. Using Bit embedding we are able to represent 36 million passages of Wikipedia in 2GB.(https://gpt3experiments.substack.com/p/building-a-vector-dat...)

SushiHippie 2 years ago | | |

Wow, this works better than I would've thought.

> Who moderates Hacker News?

First result:

> Hacker News

> At the end of March 2014, Graham stepped away from his leadership role at Y Combinator, leaving Hacker News administration in the hands of other staff members. The site is currently moderated by Daniel Gackle who posts under the username "dang".

cs702 2 years ago | | |

You're talking about mapping floating-point vector representations, i.e., embeddings, computed by a pretrained LLM to binary vector representations, right? And you're talking about doing this by first having someone else's pretrained LLM compute the embeddings, right? Sorry, but that seems only minimally, tangentially related to the topic of running LLMs in ternary space. I don't see how your comment is relevant to the discussion here.

fabmilo 2 years ago | | |

I find this extremely interesting. Do you share the source code of the process? any more references?

m3kw9 2 years ago | | |

How is this not lossy compression?

creshal 2 years ago | |

> * In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).

Why is this so shocking? Quantization has been widely explored, driving that to its extreme (and blowing up parameter count to make up for it) just seems like a natural extension of that.

Easier said than done, of course, and very impressive that they pulled it off.

> In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value

I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.

cs702 2 years ago | | |

> Why is this so shocking? Quantization has been widely explored, driving that to its extreme (and blowing up parameter count to make up for it) just seems like a natural extension of that.

I find it shocking that we don't even need lower floating-point precision. We don't need precision at all. We only need three symbols to represent every value.

> I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.

I find it shocking. Consider that associative addition over ternary digits, or trits, represented by three symbols (a,b,c) has only three possible input pairs, (a,b), (a,c), or (b,c) (within each pair, order doesn't matter), and only three possible outputs, a, b, or c. Matrix multiplications could be executed via crazy-cheap tritwise operations in hardware. Maybe ternary hardware[a] will become a thing in AI?

---

[a] https://en.wikipedia.org/wiki/Ternary_computer

satellite2 2 years ago | | |

Because it's no longer a linear optimization or curve fitting problem. It becomes a voting or combinatorial problem. Which at least in my mind are two completely different areas of research.

gemeral 2 years ago | | |

> and blowing up parameter count to make up for it

based on (an admittedly rapid and indulgent reading of the paper), it seems like they're not increasing the parameter size. Do you mind pointing out where the blowup is occurring?

SuchAnonMuchWow 2 years ago | | |

No, unless I'm mistaken it's a huge impact: it means the matrix product is separable: basically, it's a O(n²) algorithm, and not O(n3): add together all the c_j = sum(a_i_j), d_i = sum(b_i_j), and the final results are all the combinations of cj+di. And even then, half that is unnecessary because the d_i can all be pre-computed when before inference since they are weights.

But I skimmed over the paper, and didn't found the part where it was explained how they replace the product by additions: from what I understand, they remplace multiplications by bi by selecting +ai, 0, or -ai. So the final matrix multiplication can be implemented by only additions, but only because the weights are 1,0,-1 they avoid multiplications altogether. This is really different from what the GP said (remplacing a0*b0+... by a0+b0+...).

ncruces 2 years ago | | |

Well I guess it's the “blowing up parameter count to make up for it” that confuses me, but maybe it's just ignorance.

Like what would be the expected factor of this blow up to make up the difference between ternary and whatever 16 bits encoding they were using?

I mean intuitively I'd expect to need ~10× the symbols to encode the same information? Are they using an order of magnitude more parameters, or is that not how it works?

Noe2097 2 years ago | |

There is another _shocking_ realization in this work: there are 11 types of people: those who know what binary means, those who don't, and those who say they do but actually don't.

"The era of 1-bit LLMs"

Representing { -1, 0, 1 } can't be done with 1-bit, I'm sorry -- and sad, please let's all get back to something vaguely sound and rigorous.

npunt 2 years ago | | |

Ternary supporters are always bitter about this

(I'll let myself out)

gpderetta 2 years ago | | |

There are 10 types of people, those who don't know binary, those who do and those who know ternary.

hk__2 2 years ago | | |

> please let's all get back to something vaguely sound and rigorous

Something rigorous would be to actually read the paper rather than stop at the first part of its title. The authors are not claiming their LLM is 1-bit.

esrauch 2 years ago | | |

One trit but that's not a word anyone knows.

jandrese 2 years ago | |

It seems like the AI space is slowly coming back around to the old Thinking Machines CM-1 architecture. It's not too often in computing where you see ideas a full 40 years ahead of their time make it into production.

giantrobot 2 years ago | | |

IIUC the main issue with the CM-1 architecture was feeding the processor cluster with data. That required a heftier front end system than was practical/affordable at the time. With modern CPUs and memory subsystems the GPUs can be saturated pretty easily. So going back to huge clusters of super narrow cores won't starve them for work.

theendisney 2 years ago | | |

Memristors any moment now

abeppu 2 years ago | |

> On existing hardware, the gains in compute and memory efficiency are significant, without performance degradation (as tested by the authors).

Did they actually show absence of performance degradation?

I think it's conspicuous that Table 1 and Table 2 in the paper, which show perplexity and accuracy results respectively, are only for small model sizes, whereas Figure 2, Figure 3 (latency, memory, energy consumption) and Table 3 (throughput) all show larger model sizes. So it seems like they had every opportunity to show the perplexity/accuracy comparisons at the larger model sizes, but did not include them.

cs702 2 years ago | | |

Others have already made the same point in this thread. See my response here: https://news.ycombinator.com/item?id=39539508

flockonus 2 years ago | |

Considering how much faster additions are processed, and how a particular silicon chip could be optimized for this very specific case; all parts added together perhaps could show >100x speed up vs current systems.

I must concur, "wow".

Nevermark 2 years ago | | |

For hardware, 2-argument ternary additions and multiplications should be very close in terms of the tiny circuit required for either.

If you are doing ternary calculations on 32/16-bit hardware, then the additions would be simpler.

p1esk 2 years ago | |

Ternary networks have been used since 2015. There are hundreds of papers. They all require full QAT (training from scratch). Not sure why you’re shocked.

cs702 2 years ago | | |

Because it's not just the use ternary values. It's also that there are no dot-products; there are only additions. And when we apply both changes to existing LLMs, there's no performance degradation (as tested by the authors).

rhaps0dy 2 years ago | |

I think you need more evidence than this paper (which is very short and light on actual numbers) to be this shocked.

For example, most of the plots in the paper are actually of throughput, memory, etc. all performance characteristics that are better on the ternary version. Which, of course.

The only thing that contains perplexities are Table 1 and 2. There, they compare "BitNet b1.58 to our reproduced FP16 LLaMA LLM in various sizes" on the RedPajama data set. The first thing to note is the perplexities are very high: they're all at least ~9.9, which compared for example with quantized Llama on wikitext-2 which is 6.15 (https://www.xzh.me/2023/09/a-perplexity-benchmark-of-llamacp...). Maybe RedPajama is a lot harder than wikitext-2, but that's a big gap.

I think probably their benchmark (their "reproduced FP16 LLaMA LLM") is just not very good. They didn't invest much in training their baseline and so they handily beat it.

cs702 2 years ago | | |

Thank you. I think the paper as it is provides enough evidence to support the claims. If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That's sensible. It allows for easier replication of the results. Otherwise, I agree with you that more extensive testing, after more extensive pretraining, is still necessary.

fzliu 2 years ago | |

This will be big for FPGAs - adders are extremely cheap compared to multipliers and other DSP blocks.

eru 2 years ago | | |

Multipliers for eg 8 bit or 4 bit floating point values should also be pretty cheap? (I assume multipliers have a cost that grows quadratically with the number of bits?)

phkahler 2 years ago | |

>> we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value

Thinking out loud here. If you encode 64 weights in 2 64-bit words you can have the bits in one word indicating +1 if they're 1, and the bits in the other word indicating -1 if they are 1. You should be able to do the "products" with a few boolean operations on these 2 words to get a pair of 64 bit words for the result. Then summing becomes a matter of using a count-of-1's instruction on each word and subtracting the "negative" count from the positive. If AVX instructions can do this too, it seems like equivalent of 10-100 TOPS might be possible on a multi-core CPU.

cs702 2 years ago | | |

Yes. More generally, this will enable implementation via crazy-cheap bit-wise ops in binary hardware, and possibly, maybe, via crazy-cheap trit-wise ops in ternary hardware that manipulates ternary digits, or trits. Note that any binary op over trits has only nine possible (trit, trit) input pairs and only three possible trit outputs. Maybe ternary hardware for AI will become a thing?

beagle3 2 years ago | |

I haven’t been keeping tabs, but this seems very much like RIP / Achilioptas version of the Johnson Lindenstrauss lemma.

Perhaps the rest of the JL lemma promise applies as well - compressing the number of parameters by a few orders of magnitude as well.

lr1970 2 years ago | |

Authors reported perplexity only for small up to 3B weights models. On the other hand, they reported throughput for 70B model, but not its performance (perplexity, end-to-end tasks). Very unfortunate omission. Overall, the paper is rather poorly written.

cs702 2 years ago | | |

If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That's sensible. It allows for easier replication of the results. Otherwise, I agree with you that more extensive testing, after more extensive pretraining, at larger model sizes, is still necessary.

bjornsing 2 years ago | |

> * In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value. See the paper for exact details.

Aren’t you over complicating it a bit here? A dot product between a vector of activations (a₁, a₂, …) and a vector of ternary weights (b₁, b₂, …) can of course be computed as the sum of all activations for which the weight is 1, minus the sum of all activations for which the weight is -1.

It can’t however be computed as (a₁+b₁ + a₂+b₂ ...). You must have gotten that wrong.

PaulHoule 2 years ago | |

I am not startled at all. Dense vector representations are pretty silly, they can’t really be the road to knowledge representation.

anon373839 2 years ago |

> BitNet b1.58 can match the performance of the full precision baseline starting from a 3B size. ... This demonstrates that BitNet b1.58 is a Pareto improvement over the state-of-the-art LLM models.

> BitNet b1.58 is enabling a new scaling law with respect to model performance and inference cost. As a reference, we can have the following equivalence between different model sizes in 1.58-bit and 16-bit based on the results in Figure 2 and 3.

> • 13B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 3B FP16 LLM.

> • 30B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 7B FP16 LLM.

> • 70B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 13B FP16 LLM.

This paper seems to represent a monumental breakthrough in LLM efficiency, as the efficiency gains come with zero (or negative) performance penalty.

Does it seem at all likely that existing models could be converted?

osigurdson 2 years ago |

I have often mused that, in some ways, it seems like the transistor is really being wasted in AI applications. We use binary states in normal computing to reduce entropy. In AI this is less of a concern, so why not use more of the available voltage range? Basically, re-think the role of the transistor and re-design from the ground up - maybe NAND gates are not the ideal fundamental building block here?

w-m 2 years ago |

I was reading Exposing Floating Point today (as Airfoil is on the HN front page and I was perusing the archive of the author). It's a blog explaining the inner workings of floating point representations. About zero values it says [0]:

> Yes, the floating point standard specifies both +0.0 and −0.0. This concept is actually useful because it tells us from which “direction” the 0 was approached as a result of storing value too small to be represented in a float. For instance -10e-30f / 10e30f won’t fit in a float, however, it will produce the value of -0.0.

The authors of the LLM paper use the values {-1, 0, -1}. Connecting the two ideas, I'm now wondering whether having a 2-bit {-1, -0, 0, 1} representation might have any benefit over the proposed 1.58 bits. Could the additional -0 carry some pseudo-gradient information, ("the 0 leaning towards the negative side")?

Also, I've seen 2-bit quantizations being proposed in other LLM quantization papers. What values are they using?

[0] https://ciechanow.ski/exposing-floating-point/#zero

lucubratory 2 years ago |

After reading the results I skipped back to the comment section to ask if this was real because it looks a little too good to be true, but figured I should check authors and it's Microsoft research and UCAS so yeah, real. This is going to change a lot of things, obviously the edge computing applications they point out, but also this is going to bottom out the cost of providing high-performance LLMs in the cloud. I don't know what that means for the economics long term, naively way less costs maybe means new entrants without an entire cloud available can compete easier? I do wonder if something like this has already been found and implemented by either OpenAI or Google.

gojomo 2 years ago |

That's not a 'bit' ("Binary digIT"). It's closer to a 'trit' ("TeRnary-digIT"). Specifically, ternary digits spanning {-1, 0, 1} (rather than the usual {0, 1, 2} in a base-3 numbering system) are 'balanced ternary'.

A great intro to the theoretical reasons ternary might have some promise in computing is this 2001 article from 'American Scientist', "Third Base", which quotes Knuth calling balanced-ternary "perhaps the prettiest numbering system of all" and also discusses an abortive Soviet effort in the direction of ternary computing:

http://web.archive.org/web/20011205185830/http://americansci...

In an aside, the article hints that e-nary digits (base 2.718…) if somehow made practical/meaningful, might actually be better than ternary (or perhaps even optimal?).

So maybe this paper's observation that ~"1.58 bits" (ln2(3) binary-digits) is a sweet-spot could be further refined into some method for representing the state of a e-nary-modeled algorithm in ln2(e) binary-digits (~"1.44 bits") per underlying e-it.

(As it may be of renewed interest, I've also put this 2001 "American Scientist" base-3 intro as a new HN submission for discussion: https://news.ycombinator.com/item?id=39541756)

ulnarkressty 2 years ago |

Take this with a grain of salt until someone reproduces it. Improvements such as these require extraordinary evidence. Not to mention extreme quantization has been tried before.

tuananh 2 years ago |

Major breakthrough in LLM scene. Achieve performance and perplexity equivalent to full FP16 models of same parameter size.

And you can fit 120B model with a single card 24GB VRAM. This is mind blowing.

cyanydeez 2 years ago | |

I mean, it expands the hardware selection, but until there's models and leader boards etc, can't really say it's a break through.

fnordpiglet 2 years ago | | |

I would assume a GPU isn’t specifically optimized for ternary computation and specialized accelerators would whip the pants off a GPU

Klipper3 2 years ago |

The theoretical capacity of a binary network is 69% of the capacity of a full-weight network, so it makes sense that LLM would converge to 1-bit networks in the long term.

It's nice to finally see practical networks reach the theoretical limits found in the statistical mechanics of Ising models. A good pointer to efficient 1-bit training, from the statistical mechanics point of view, is here:

https://www.pnas.org/doi/full/10.1073/pnas.0700324104

arunk47 2 years ago | |

What is stopping us right now from doing this one bit networks ?

tarruda 2 years ago | | |

I think no code was released yet

esha_manideep 2 years ago |

These models will are compatible with llama.cpp out of the box, we (GigaML - https://gigaml.com) are planning to train a small model (3-4B, 1-bit, opensource) with the latest stack-v2 dataset released today. Let me know if anyone is interested in collaborating with us.

a2code 2 years ago | |

I'm interested in collaborating. For example, from the comments it occurred to me that a 128-bit SIMD register can contain 64 2-bit values. It seems straightforward that SIMD bitwise logical operations could be used in training such models.

libertalia0 2 years ago | |

Highly interested in collaborating – got a bunch of proprietary legal data already pre-sorted and labeled for various scenarios. I've already benchmarked legal use-cases (i.e. legal speciality, a few logic-based questions, and specific document creation) with various LLMs – so would love to see what benchmarks this can produced compared to early Mistral or Llama.

Let me know what's the best way to reach out!

fgfm 2 years ago |

It's funny how discoveries in NLP & computer vision complement each other. The replacement of multiplication by additions made me think about the AdderNet paper (https://arxiv.org/abs/1912.13200), which concluded as you had to suffer almost no performance drop.

Perhaps the accumulators in current hardware cannot leverage this to its full potential, but combined with such a strict quantization, this would open LLM to the wider ML community much earlier than expected (when consumer hardware allows you to train near SOTA LLMs from scratch on your machine).

oxxoxoxooo 2 years ago |

Prior art:

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

https://arxiv.org/abs/1602.02830

Ternary Neural Networks for Resource-Efficient AI Applications

https://arxiv.org/abs/1609.00222

kandu 2 years ago | |

Also: training neural networks by turning connections on and off, or by just flipping the sign of the weights: https://arxiv.org/abs/2006.16627

alexey-salmin 2 years ago |

Also from Microsoft in 2021: Make Every feature Binary: A 135B parameter sparse neural network for massively improved search relevance [1]

[1] https://www.microsoft.com/en-us/research/blog/make-every-fea...

imjonse 2 years ago |

Too bad there seem to be no pretrained models to download. This is not a quantization method to apply on existing models, so having the pretrained weights is needed if one wants to test it.

bArray 2 years ago | |

+1 On this, the real proof would have been testing both models side-by-side.

It seems that it may be published on GitHub [1] according to HuggingFace [2].

[1] https://github.com/microsoft/unilm/tree/master/bitnet

[2] https://huggingface.co/papers/2402.17764

imjonse 2 years ago | | |

Nothing there yet, but it's good to know they want to publish just did not get around to yet.

SushiHippie 2 years ago | | |

From [2]:

> We would definitely be happy to open-source the models for future research. Please stay tuned!

UncleOxidant 2 years ago | | |

link #2 appears to be broken.

rapatel0 2 years ago |

The mathematics of the BNNs are sound. The shannon entropy of a word is really small (I vaguely remember ~2 bits). Also all neural networks are ridiculously over provisioned.

I worked on 7 years ago trying to efficiently binarize CNNs from existing models. It the difficult was getting training running without the losses going to high. I think that vision models will be much more difficult to binarize, but you might not need to with clip if the vision encoder stays in regular math {fp16,int8}

az226 2 years ago | |

What about text to speech models? Do you think ternary will work?

rapatel0 2 years ago | | |

Just to be clear, it's all theoretically possible. There are already versions of BNN versions of YoLo and other CNNs. No reason why transformers wouldn't work for that or audio. It just might be harder to get them to train well enough.

Speech to text, however, is super interesting. You just gave me an idea! I'm gonna go run some experiments :D

londons_explore 2 years ago |

Powers of 3 don't pack well into binary memory...

A 1 bit multiplier in silicon is a single logic gate, but a ternary decoder to decode a packed tri-state 'weight' is bigger.

I therefore suspect that this method will be extended to make all weights simple 1 or 0 (ie. Binary). Perhaps that will be done by having half the weights have 1 or 0 values, while the other half are -1 or 0.

tromp 2 years ago | |

5 trits fit into 1 byte pretty well, since 3^5 = 243 is just under 2^8 = 256.

That should be called an 8/5 = 1.6 bit model though, while the paper names it 1.58 bit, closer to log_2(3) ~ 1.5849625

londons_explore 2 years ago | | |

But the decoder for that will be 25+ gates, which is huge compared to the handful of gates to use the resulting weights.

JKCalhoun 2 years ago | | |

Would be nice to have hardware instructions that work on 5 tris natively.

baq 2 years ago | |

You can build dedicated silicon with ternary gates: https://medium.com/@rxseger/exploring-ternary-logic-tnand-an...

Not sure if it's more efficient than just binary digital circuits in highly integrated chip, though.

samatman 2 years ago | | |

It's optimal if your program is naturally ternary, which this one is. Using three signals, rather than ternary gates, is less effective, because you need much more precision to detect two different voltage levels rather than just up and down.

fasa99 2 years ago | |

I think it's the right chain of thought. You could either have 0/1 and then have additional nodes with negative activation functions, or -1/1

-1/1 is appealing to me (0 = -1) because bit hackery could be used instead of the multiplication function, presumably on integral or fixed-point representations. The goal would be to eliminate any "if/then" like "if 0 do this if 1 do that" to avoid the need for branch prediction - there are bit-hackery ways to bypass this. That would lend itself well to all existing processors, ASICs, FPGAs, GPUs, etc.

fabmilo 2 years ago | |

can't you have 2 bits ? first bit for the sign second bit for the 1 0 you can represent -1 +1 +0 -0

jdthedisciple 2 years ago |

People have been doing this 6 years ago.

    https://github.com/yashkant/quantized-nets
    https://github.com/TropComplique/trained-ternary-quantization
    https://github.com/buaabai/Ternary-Weights-Network

I too find it very interesting.

But why this sudden, renewed fuzz?

imtringued 2 years ago | |

Probably because despite the 1200 citations, they didn't have the ability to apply it to modern LLMs. Nobody cares about an image classifier using 50% less parameters since most of them were small enough to fit in memory anyway.

gerash 2 years ago | |

I haven't read the paper but I clearly remember 1-bit quantization from at least 5-6 years ago

dindobre 2 years ago |

Refreshing paper in terms of machine learning papers, simple explanation, easy to replicate, no alchemy-tier interpretations. Can't wait to see this paper replicated or disproved when it comes to real-life production tasks.

wongarsu 2 years ago | |

The most glaring omission is that they only compared to fp16 models, not to quantized models. And of course the benchmarks might be misleading compared to the real experience.

But if you wanted to make LLM-specific hardware (or x64 instructions tuned for LLMs) this model architecture makes that extremely cheap. Multiplication requires a lot of transistors, this architecture requires only two-bit adders. You could make SIMD instructions that do thousands of these in parallel, for fairly little silicon cost.

imjonse 2 years ago | |

The presentation is simplified because it implies knowledge of its predeccesor, BitNet https://arxiv.org/abs/2310.11453

dindobre 2 years ago | | |

Makes sense!

stormfather 2 years ago |

How does backprop work here? I can't imagine flipping bits of everything upstream of an error is effective.

spyder 2 years ago | |

From the BitNet paper:

"Straight-through estimator. To train our 1-bit model, we employ the straight-through estimator (STE)[BLC13] to approximate the gradient during backpropagation. This method bypasses the nondifferentiable functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions, during the backward pass. STE allows gradients to flow through the network without being affected by these non-differentiable functions, making it possible to train our quantized model."

also the author's (@shumingma) answer in the comments: https://huggingface.co/papers/2402.17764#65df17ed4d436404cdc...

joelthelion 2 years ago | |

(haven't read the paper). Maybe you can flip bits with a probability distribution that depends on the gradient?

stormfather 2 years ago | | |

That's an interesting idea! Would love to try that on MNIST one day.

bilsbie 2 years ago |

This really just sounds absurd. How can ternary possibly encode enough information?

Anyone willing to explain it like I’m a Django developer who watched half a karpathy video?

barbarr 2 years ago | |

The activations are still 8-bit, so a lot of complexity and nonlinearity is still expressible. Only the weights are 1.58-bit.

HanClinto 2 years ago | |

On its own, each trit doesn't encode much information at all. But it's not about information at the individual level -- it's more about the shape of the network.

I appreciated this comment [0] from earlier in the thread by paul_mk1:

> My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.

For myself, I've done a lot of work with image hashing (such as pHash and dHash) -- and in those, you throw away a LOT of information, but simply by keeping the value of each region and tracking whether or not it's above or below the average (essentially, the sign), then it's astounding how robust those algorithms are. Because you don't look at the individual pixels of an image, but it's very good at capturing the impression of the overall _shape_ of the image.

It's less about each individual datum, and more about the shape of the network.

If you're not familiar with Lottery Ticket Hypothesis, that would be worth reading up on.

[0]: https://news.ycombinator.com/item?id=39544500

Solvency 2 years ago | |

Because by making the model larger you don't need 64bit precision floats you only need 64 discrete bits.

gemeral 2 years ago | | |

Do you mind pointing out where they make the model larger? The paper seems to suggest they are maintaining the same model sizes.

> Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption

naasking 2 years ago |

Interesting return to ternary. Effectively, each weight says only whether it's correlated (+1), uncorrelated (0), or anti-correlated (-1) with the input, and the structure of the network is the actual computation over that information.

eigenvalue 2 years ago |

Is it really so surprising that something like this works given how human brain neurons work? My admittedly basic understanding is that these operate through an all-or-nothing principle for their action potentials (firing): they either fire or they don't, based on whether the input signals reach a certain threshold. So the output is already sort of binary in biological neurons. The inputs are more like continuous values, since they are the sum of many different neurons sending signals into each neuron, but in this paper the activations are 8-bit, not binary/ternary. Can any neuroscientists here comment?

fasa99 2 years ago | |

Well I think it's an interesting idea, and to add to that, the "-1" values would correspond to an inhibitory neuron!

What neurons can do though is integrate over time, so your output can be one spike, or 3 spikes very quick, same for your input, and maybe 10 quick spikes in a row is a more powerful signal than a lone spike. We know this intuitively, though, via vision, we don't see in mac-classic style black/white images, we see shades of brightness and color, indicating that at least our optic nerve is sending what amounts to an analog signal (even if encoded as binary spikes - is the spike timing not analog?)

This is not to mention all the biochemical signaling that happens, and the multitude of local neurotransmitters and global physiological/hormonal factors at play. And all that weird stuff like glial cells and astrocytes is there in the mix too.

m00x 2 years ago | |

This isn't really how neurons work.

First of all, they operate independent of a synchronized clock, and they can also accumulate signals instead of executing on a input. Neuromorphic chips are closer to how the brain works, but they're still super early. I believe Intel has the best one with the Loihi 2.

(Not a neuroscientist but my wife is and that's what I understand from our chats)

joelthelion 2 years ago |

Assuming this is confirmed, what's the impact on training?

Inference is definitely an issue for LLMs right now. But if training were suddenly possible for lone hackers (or maybe smaller companies), it would open up a lot of new possibilities as well.

lucubratory 2 years ago | |

In theory it should make training a lot easier too, particularly on CPUs. But I think you'll still need reasonably expensive compute to get a model something close to the current big models, and you really can't ignore data. Data quality and quantity are both huge ingredients in model quality, at least as big as architecture. It's still non-trivial to get a good quality, large dataset, certainly out of the reach of lone hackers and most small companies.

nutate 2 years ago |

Triggered by the use of 1-bit to describe a trit.

sp332 2 years ago |

1-bit LLMs remind me of a random forum post I read about SACD and limitations of the 1-bit DSD audio format. https://www.audiosciencereview.com/forum/index.php?threads/d... Accumulating approximate values in one bit leads to being "constantly overloaded", with any error correction overwriting all of your real signal from the next step. I think this trinary system might leave enough room to avoid this problem.

smaddox 2 years ago |

Damn. Well, I guess I better hurry up and write and publish a paper on the Ternary Neural Network research that I've been doing (part-time) for the last several months, before it all gets scooped.

riskable 2 years ago | |

Modify your schedule, sure but do not rush it (just to beat the other folks). The first paper on any given topic may garner some 15 minutes of fame but the well-researched, boring paper is one oft-cited. Even if it isn't the first on its topic.

Be thorough and by golly, include some useful visuals! Even bad pictures and low-effort charts and graphs can vastly improve the grokability of a research paper.

Also, request assistance! Are you terrible at making charts and graphs? Ask someone to help you! For the low, low price of adding their name to the paper I'm 100% certain you can borrow an expert's time to add some dapper displays of useful information along with drastic wording and layout improvements.

The amount of papers in the wild that are just walls of jargon with completely useless, nearly-impossible-to-read charts and graphs is seemingly limitless.

Refreshing is the paper that a non-expert can read and understand! You don't have to ELI5 but well-written text and explanations are loved by all. The individual using it to gain actual knowledge will grok it from skimming and looking at the data anyway so you might as well take the time to explain some of the more complicated aspects like it's going to be read by a freshman STEM major (no need to go further back in education than that).

If you need help with grammar just paste a portion of your text into some LLM (even the small, locally-run models) and they usually do a pretty good job at finding and fixing such mistakes.

raghavtoshniwal 2 years ago |

Sooo, short Nvidia?

MadDemon 2 years ago | |

Depends if this results in more efficient models or simply larger, more capable models.

wongarsu 2 years ago | | |

In both cases this is a prime opportunity for anyone to disrupt Nvidia. They are in this market position in large part because both video games and neural networks do a lot of highly parallel floating point math, especially matrix multiplication. This model architecture doesn't do any of that.

Of course it should be fairly simple for Nvidia to add special silicon and instructions for two-bit addition to a future generation of their cards. But it'll take a while because they already have a roadmap and preexisting commitments. And any competitor doesn't have to copy everything Nvidia does to make floating point numbers go fast, they can just focus on making two-bit data handling and addition go fast.

etiam 2 years ago | |

Hardly for this reason, but it does look suspiciously high doesn't it.

sebzim4500 2 years ago | |

These still run on GPUs

londons_explore 2 years ago | | |

GPU's aren't yet awfully efficient at 1 bit math.

I could imagine FPGA designs might be competitive.

And dedicated ASIC's would almost certainly beat both by a decent margin.

leroman 2 years ago | | |

- we have llama.cpp (could be enough or at least as mentioned in the paper a co-processor to accelerate the calc can be added, less need for large RAM / high end hardware)

- as most work is inference, might not need for as many GPUs

- consumer cards (24G) could possibly run the big models

the8472 2 years ago |

What does it mean for future hardware if it's not using floating point matrix multiplication units?

kromem 2 years ago | |

This opens the door to very exciting hardware shifts, like to optical computing, where there's already been over a decade of research on ternary optical computing and other parallel research at using optical computing for more efficient neural networks.

If this really holds up, it likely means we'll be moving to new dedicated hardware for AI compute much faster than when it was FP.

cyanydeez 2 years ago | |

https://stackoverflow.com/questions/45373679/why-is-it-faste...

gpderetta 2 years ago | | |

As per answer, the reason float is faster than in is because a) hardware companies provide float ALUs than integer ALUs and b) float FMA is a thing, while integer FMA isn't. Both are because currently most HPC-like loads use floats instead of integers, not because of intrinsic hardware reasons.

KeplerBoy 2 years ago | |

Expect Nvidia to advertise with their TOPS numbers instead of their FLOPS.

rfoo 2 years ago | | |

Already happened years ago. They advertised TOPS for int8/int4 [0], and with 50% sparsity [1].

[0] low-bit CNNs worked pretty well actually.

[1] Totally useless marketing snake oil.

elromulous 2 years ago |

So for the uninitiated (me), does this mean the input is not a float (i.e. is quantized on input), such that all the math can be done with int operations?

This seems almost too good to be true.

Edit: Answering my own question, yes. The details are in the original bitnet paper: https://arxiv.org/abs/2310.11453

ein0p 2 years ago |

How is it a 1 bit LLM if 2 bits are required for each weight (and one of the 4 possible states is wasted to be able to represent 0)

ricardobeat 2 years ago | |

As someone else pointed out here, you can store 5 ternary values in 1 byte, 3^5 == 243.

ein0p 2 years ago | | |

That’s still not 1 bit, and that would basically destroy whatever perf advantage you might hope to get if you want to keep the model in memory in that format rather than unpack it on load.

Animats 2 years ago |

Well, that's 2 bits, but still...

LLMs have gone from 32-bit floating point numbers down to 16 and 8 bit values. Now 2 bits. It's a hint as to how evolution did it. The basic component is simple and has very wide tolerances. There are just a lot of them. That's something biology can evolve.

rafaelero 2 years ago |

Looks like we have finally rediscovered a biological neuron.

bilsbie 2 years ago | |

How so?

rafaelero 2 years ago | | |

They propagate information in a binary way (either they activate or not).

fl0ki 2 years ago |

Would there be value in distinguishing -0 and +0? If a 0 was quantized from a small negative or a small positive, it seems like retaining the sign is better than forgetting it.

The question remains whether the benefit and the simpler design are worth the loss of density.

transfire 2 years ago |

Shouldn’t that be “1-trit”?

QuesnayJr 2 years ago | |

They call it 1.58-bit in the paper. (1.58 is roughly the base 2 logarithm of 3.)

jmmcd 2 years ago | | |

So by “1-bit” they mean “less than 2 bits”. AI is an insufferable field at times like this.

bmacho 2 years ago | |

Read the pdf https://arxiv.org/pdf/2402.17764.pdf they call it 1-bit everywhere.

I don't know why do they do this, 1-bit seems to be a very wrong name for {-1, 0, 1}.

edflsafoiewq 2 years ago | | |

I think 0 "doesn't count", since you don't have to add or subtract anything for it, just mask it out.

FrustratedMonky 2 years ago | | |

Yes Technically, but it is catchy for the masses. 1-bit seems to get the idea across, even if not technically describing {-1,0,1}.

BenoitEssiambre 2 years ago |

Low bit parameters is always talked about in terms of performance benefits but I wonder if allowing the LLM to combine parameters to represent values, means it can select the resolution of each value, that is use a kind of internal scientific notation to track the uncertainty of values. More low bit parameters combined together means more precision and resolution, less can mean more uncertainty. This might allow the LLM to better calibrate the uncertainty of it's knowledge in a Bayesian way, to prevent hallucinations from the overconfidence you get from overfitting on too many bits.

bilsbie 2 years ago |

How would you use this in something like PyTorch? There’s no ternary data type.

edflsafoiewq 2 years ago | |

Widen it to a datatype it does have, like int8.

modeless 2 years ago |

Maybe a silly question but nonlinearity is important for neural nets. Wouldn't it make more sense for the three values to be e.g. (2, 0, -1) so they are not colinear?

Also, what are the prospects for FPGA implementations of this?

hoseja 2 years ago |

Balanced ternary, my beloved.

Avisite 2 years ago |

Does quantization need to be an all or nothing? with the kind of low bit models we have seen, my assumption would be that only certain weights would benefit from the extra precision. A mixture of precision with 2-bit, 3-bit, to 8-bit weights might perform well, but I am unsure if any training process could identify the weights that need the extra precision.

kromem 2 years ago | |

Given the weights are just mapping to a virtual network structure anyways, my guess would be that as parameter sizes increase any difference node precision might have will evaporate when trained from the ground up.

So moving to extremely high efficiency native ternary hardware like with optics is going to be a much better result than trying to mix precision in classical hardware.

We'll see, but this is one of those things that I wouldn't have expected to be true but as soon as I see that it is it kind of makes sense. If it holds up (and it probably will) it's going to kick off a hardware revolution in AI.

anon291 2 years ago |

This is something that's been tried many times before. 1-bit to 2-bit models and binary NNs have a long history.

ryeguy_24 2 years ago |

How does gradient descent work with these discrete ternary parameters? If you compute the partial differential for a parameter, how do you determine what to nudge the parameter when updating on back propagation? Do you only update if the "nudging amount" meets a threshold?

edflsafoiewq 2 years ago | |

> While the weights and the activations are quantized to low precision, the gradients and the optimizer states are stored in high precision to ensure training stability and accuracy. Following the previous work [ LSL+21 ], we maintain a latent weight in a high-precision format for the learnable parameters to accumulate the parameter updates. The latent weights are binarized on the fly during the forward pass and never used for the inference process.

jcarrano 2 years ago |

Strictly speaking it should say "1-trit LLM", or, as they later mention 1.58 bit.

karmasimida 2 years ago |

This is exciting news, if the 8B numbers are true, we can already use model like Mixtral 8x7, even with a single GPU?

But further into the development, we need comparison to large model sizes. 70B might be too much to ask, but 13B should be there at least.

cjbprime 2 years ago | |

You could already run Mixtral on the more expensive single consumer GPUs (with 24GB VRAM) before this paper, at e.g. 3-bits per weight.

elijahbenizzy 2 years ago |

There's an interesting mental model I've been toying with. At what point do LLMs just become circuit-shaped NNs with stochastic gradient descent backing them?

E.G. are we just determining the best program by rearranging 1s and 0s?

nborwankar 2 years ago |

“Integer arithmetic is all you need” ? NVIDIA stock arrow up or down?

hatthew 2 years ago | |

if true, nvidia number go down

farhanhubble 2 years ago |

What's the benefit of using ternary encoding over just a binary representation? And if we have come so far is there potential for a more efficient algorithm than gradient descent?

TriangleEdge 2 years ago |

How do you train these? Or is it only for already trained models?

simonvc 2 years ago |

The paper talks about LLMs a lot, but would this result hold for all Transformers? Are Ternary Transformers going to make things like Whisper faster/better?

bilsbie 2 years ago |

Could there be some value in recognizing areas where the model needs finer grained weights and somehow using a different data type just in certain areas?

fabiospampinato 2 years ago | |

It seems tough to do, besides I'm not sure what the benefit would be, with that you can't do the optimized matrix multiplication anymore, and if you need more precision presumably you can just add more neurons and/or train for longer and/or with better data.

Blackthorn 2 years ago |

Is there any rigorous way to answer the question of how much information (be it entropy or some other measurement) is contained in a model's weights?

riskable 2 years ago | |

Yes, actually: That's the entire point of the paper! The concept is that the amount of information contained in a weight like 0.00006103515625 is equivalent to 0. -0.99951172 is equivalent to -1, 1.26406236 equivalent to 1, etc. That there's no practical difference when actually utilizing the model (if trained in ternary from the start).

The paper posits (and provides evidence) that if you train a model using ternary values instead of floating point values you get equivalent (useful/practical) information. You can't take an existing model and round all the values down to `{-1,0,+1}` values but you can (re)train a model using ternary values to get the same end result (equivalent information/output).

Technically a model trained using FP16 values contains vastly more information than a model trained using ternary values. Practically though it seems to make no difference.

My prediction: Floating point models will still be used extensively by scientists and academics in their AI research but nearly all real-world, publicly-distributed AI models will be ternary. It's just too practical and enticing! Even if the ternary representation of a model is only 90% effective it's going to be so much faster and cheaper to use it in reality. We're talking about the difference between requiring a $500 GPU or a $5 microcontroller.

Blackthorn 2 years ago | | |

I don't think you really answered my question. What's been done by the paper is show experimentally that networks don't have enough information to justify their weight precision, and that's really good and a very important result, but what I was asking was if there's a rigorous way to take an arbitrary network and determine its information content (either by itself, or compared to another network). Possibly that can be relative to its outputs.

kouru225 2 years ago |

Ok can someone catch me up to speed on LLM hardware requirements? Last I looked I needed a 20 gb vram card to run a good one. Is that not true anymore?

SushiHippie 2 years ago | |

Not true anymore, but it also highly depends on what your definition of "a good one" is.

Many people find Mistral 7B to be excellent, around gpt-3.5 level of good.

Mistral 7B normally requires like 20gb VRAM, but with llama.cpp and quantization, you could even run it on your phone (albeit bad quality).

Quantization >= q4_K_M seem to provide nearly as good responses as the unquantized model, and q4_K_M only needs ~7GB of VRAM.

See the table here:

https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGU...

Using ollama you can get up and running even a bit faster than with llama.cpp directly (ollama uses llama.cpp under the hood).

kouru225 2 years ago | | |

Oh Jesus so basically it’s very feasible for me to run my own local llm on a NAS or a server or something… well I guess it’s time for me to get on with the times…

Thanks!

llm_trw 2 years ago |

So are there any details on the algorithms they used for backprop? I'm not seeing any in the paper other than "we used a lot of tokens".

wongarsu 2 years ago | |

It's a fairly straightforward modification of BitNet, so I assume this quote from the BitNet paper applies:

To train our 1-bit model, we employ the straight-through estimator (STE)[BLC13 ] to approximate the gradient during backpropagation. This method bypasses the non-differentiable functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions, during the backward pass. STE allows gradients to flow through the network without being affected by these non-differentiable functions, making it possible to train our quantized model

IanCal 2 years ago | |

Does this help? https://arxiv.org/abs/2310.11453

It seems to have more details (it's the paper before the linked one) about the actual training, but I'm scanning it and this isn't my field so maybe it's too light also.

llm_trw 2 years ago | | |

Not really, that's for the binary version of the algorithm, the ternary version can propagate a lot more information in the backwards pass using the fact outputs either -1, 0, 1.

But I imagine they are using the same thing since a bunch of the authors are the same.

superdisk 2 years ago |

Is there anything about this specific to LLMs, or could you use it for any transformer based model? It seems like they made a modified transformer.

kromem 2 years ago | |

It seems like it could be any transformer, which is exciting now that even in imaging gradient transformers are all the rage. But ideally we'd need to see this result in other transformers (but I have a hard time seeing why it wouldn't be the case).

riskable 2 years ago | | |

At the very least it could be used to reduce the requirements and speed up the prompt recognition step(s) of image-based generative AI.

"Stable Diffusion 3 XS" will use ternary? Here's to hoping :)

Mizza 2 years ago |

I hope somebody gives this team access to the good data and a lot of crunch, I'd love to see what happens when you train the big fella.

wenyuanyu 2 years ago |

If this turns out to be true. It could indeed be a game changer... Given the advanced AI chip shortage... Also, for the chip ban on China...

rossjudson 2 years ago |

I predict Daniel Lemire will build the most efficient training and inferencing systems, close to theoretical performance limits.

lavp 2 years ago |

What does “perform slightly better than Llama” mean exactly? A model like this needs to be trained from scratch right?

dr_dshiv 2 years ago |

Wondering if this might have any impact on the use of quantum computers in LLM training/distillation…

brunooliv 2 years ago |

Do the implications at a practical level mean that the size of gguf files will become smaller?

Havoc 2 years ago |

If true then I'm guessing this would make ASICs for this far more simple too, right?

K0IN 2 years ago |

when can we expect the first ~100+ million parameter models to run on raspberry pi Pico?

Alifatisk 2 years ago |

If this paper (especially the results on Table 4) is true, then this is a game changer!

checker659 2 years ago |

If all the weights are either 1, 0 or -1, isn't this what biological neurons do?

nathan_compton 2 years ago | |

Not even remotely. I suppose you could kind of say that activations are boolean in the sense that neurons emit spikes, but arguably significant information is encoded in spike timing.

yieldcrv 2 years ago |

This is great, my employer just gave me a M1 laptop with only 16gb ram and I had to downgrade my 7B parameter local LLM’s to 3 bit quantizing, they’ve been surprisingly okay!

In my personal machine at 64gb ram, I usually use 8x7B at Q5 or 70B at Q4

Its Mistral all the way down! Imagining Q1.58 that’s doing well makes me happy

woadwarrior01 2 years ago | |

You can run 4 bit quantized versions of SOLAR-10.7B and Llama 2 13B based models quite well on 16GB M1 laptops.

turnsout 2 years ago | |

Quantized 7B LLMs should work fine on your machine, though maybe you’re talking about speed?

yieldcrv 2 years ago | | |

7B works fine

FergusArgyll 2 years ago | |

You shouldn't have to quantize it that much, maybe you're running a lot of other programs while running inference?

Also, try using pure llama.cpp, AFAIK it's the least possible overhead

regularfry 2 years ago | | |

Getting more value out of phi-2-sized models is where you really want to be on lower-end M1's.

yousif_123123 2 years ago |

Any models published as well?

jonbaer 2 years ago | |

I really can't tell but it seems to be a continuation of this work if I read the To-Dos correctly, what do you think? Here it seems to be 1-bit on just the transformer, https://huggingface.co/shi3z/BitNetWikipedia110M

1ba9115454 2 years ago |

A tenary is all you need.

singularity2001 2 years ago |

So we almost go back full circle to human (animal) brain binary spikes?

concrete_head 2 years ago | |

It's not quiet spikes but getting closer to the idea. I'm amazed it has taken this long for this type of thing to reach HN which gives next to no attention to spiking neural networks.

Simon Thorpe, a CNRS researcher has got some fascinating papers and lectures on YouTube on using binary weights on neuromorphic hardware which has had practical applications for over 20 years already.

I made an account just to drop his name somewhere on this forum.

singularity2001 2 years ago | | |

why is his name so dangerous you can't drop it on your main account lel?

klysm 2 years ago |

Does this mean we can compile LLMs to run on FPGAs directly?

loa_in_ 2 years ago | |

I don't know if ternary gate arrays are a thing, but if so then yes.

klysm 2 years ago | | |

True, ternary gates probably don't exist, but two gates gets you there and hardware is very fast.

m3kw9 2 years ago |

How much of a waste is using NVidia hardware for this?

leroman 2 years ago |

Can someone versed in the ways of math explain how this is different from previous quantization methods?

And specifically, seeing how going from 16fp to 8bit mostly gives same perplexity while anything further seems to lose quality / dumb down the model, how is this even less precise method is able to achieve this?

TheCoreh 2 years ago | |

If I understand it correctly, this seems to be more than just quantizing, the models are apparently trained in this format as well. So it's possible that the many layers adjust themselves in a way that "cancels out" the inaccuracies of the lower bit count

kromem 2 years ago | |

So modern NNs aren't really using the network nodes in the structure they physically are, but essentially builds a virtual neural network using combinations of nodes (how you can model hundreds of parameters in only a dozen or so nodes).

So as the number of nodes scales up, the individual precision probably matters less and less. Which is what they found here - it reaches parity at 3B and then starts exceeding performance at larger sizes, up to the 2T tested.

Seemingly when trained from scratch the virtual network can find adequate precision from ternary physical nodes where needed. This is different from the information loss as an already trained floating point network has its weights quantized to smaller precision and sees a performance loss.

Not only is this approach more efficient, it seems to perform better too at larger network sizes, which is probably the most interesting part.

IanCal 2 years ago | |

It's not quantising existing models, they're training new ones.

leroman 2 years ago | | |

I understand this part but it seemed that the 16->8->4 etc is similar to compression of the "net" and seemed to lower quality below 8.

wenyuanyu 2 years ago |

I wonder how the training process works...

arunk47 2 years ago |

Okay wait, can I train my own llm yet?