Be thorough and by golly, include some useful visuals! Even bad pictures and low-effort charts and graphs can vastly improve the grokability of a research paper.
Also, request assistance! Are you terrible at making charts and graphs? Ask someone to help you! For the low, low price of adding their name to the paper I'm 100% certain you can borrow an expert's time to add some dapper displays of useful information along with drastic wording and layout improvements.
The amount of papers in the wild that are just walls of jargon with completely useless, nearly-impossible-to-read charts and graphs is seemingly limitless.
Refreshing is the paper that a non-expert can read and understand! You don't have to ELI5 but well-written text and explanations are loved by all. The individual using it to gain actual knowledge will grok it from skimming and looking at the data anyway so you might as well take the time to explain some of the more complicated aspects like it's going to be read by a freshman STEM major (no need to go further back in education than that).
If you need help with grammar just paste a portion of your text into some LLM (even the small, locally-run models) and they usually do a pretty good job at finding and fixing such mistakes.
Of course it should be fairly simple for Nvidia to add special silicon and instructions for two-bit addition to a future generation of their cards. But it'll take a while because they already have a roadmap and preexisting commitments. And any competitor doesn't have to copy everything Nvidia does to make floating point numbers go fast, they can just focus on making two-bit data handling and addition go fast.
I could imagine FPGA designs might be competitive.
And dedicated ASIC's would almost certainly beat both by a decent margin.
- as most work is inference, might not need for as many GPUs
- consumer cards (24G) could possibly run the big models
If this really holds up, it likely means we'll be moving to new dedicated hardware for AI compute much faster than when it was FP.
[0] low-bit CNNs worked pretty well actually.
[1] Totally useless marketing snake oil.
This seems almost too good to be true.
Edit: Answering my own question, yes. The details are in the original bitnet paper: https://arxiv.org/abs/2310.11453
LLMs have gone from 32-bit floating point numbers down to 16 and 8 bit values. Now 2 bits. It's a hint as to how evolution did it. The basic component is simple and has very wide tolerances. There are just a lot of them. That's something biology can evolve.
The question remains whether the benefit and the simpler design are worth the loss of density.
I don't know why do they do this, 1-bit seems to be a very wrong name for {-1, 0, 1}.
Also, what are the prospects for FPGA implementations of this?
So moving to extremely high efficiency native ternary hardware like with optics is going to be a much better result than trying to mix precision in classical hardware.
We'll see, but this is one of those things that I wouldn't have expected to be true but as soon as I see that it is it kind of makes sense. If it holds up (and it probably will) it's going to kick off a hardware revolution in AI.
But further into the development, we need comparison to large model sizes. 70B might be too much to ask, but 13B should be there at least.
E.G. are we just determining the best program by rearranging 1s and 0s?
The paper posits (and provides evidence) that if you train a model using ternary values instead of floating point values you get equivalent (useful/practical) information. You can't take an existing model and round all the values down to `{-1,0,+1}` values but you can (re)train a model using ternary values to get the same end result (equivalent information/output).
Technically a model trained using FP16 values contains vastly more information than a model trained using ternary values. Practically though it seems to make no difference.
My prediction: Floating point models will still be used extensively by scientists and academics in their AI research but nearly all real-world, publicly-distributed AI models will be ternary. It's just too practical and enticing! Even if the ternary representation of a model is only 90% effective it's going to be so much faster and cheaper to use it in reality. We're talking about the difference between requiring a $500 GPU or a $5 microcontroller.
Many people find Mistral 7B to be excellent, around gpt-3.5 level of good.
Mistral 7B normally requires like 20gb VRAM, but with llama.cpp and quantization, you could even run it on your phone (albeit bad quality).
Quantization >= q4_K_M seem to provide nearly as good responses as the unquantized model, and q4_K_M only needs ~7GB of VRAM.
See the table here:
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGU...
Using ollama you can get up and running even a bit faster than with llama.cpp directly (ollama uses llama.cpp under the hood).
Thanks!
To train our 1-bit model, we employ the straight-through estimator (STE)[BLC13 ] to approximate the gradient during backpropagation. This method bypasses the non-differentiable functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions, during the backward pass. STE allows gradients to flow through the network without being affected by these non-differentiable functions, making it possible to train our quantized model
It seems to have more details (it's the paper before the linked one) about the actual training, but I'm scanning it and this isn't my field so maybe it's too light also.
But I imagine they are using the same thing since a bunch of the authors are the same.
"Stable Diffusion 3 XS" will use ternary? Here's to hoping :)
In my personal machine at 64gb ram, I usually use 8x7B at Q5 or 70B at Q4
Its Mistral all the way down! Imagining Q1.58 that’s doing well makes me happy
Also, try using pure llama.cpp, AFAIK it's the least possible overhead
Simon Thorpe, a CNRS researcher has got some fascinating papers and lectures on YouTube on using binary weights on neuromorphic hardware which has had practical applications for over 20 years already.
I made an account just to drop his name somewhere on this forum.
And specifically, seeing how going from 16fp to 8bit mostly gives same perplexity while anything further seems to lose quality / dumb down the model, how is this even less precise method is able to achieve this?
So as the number of nodes scales up, the individual precision probably matters less and less. Which is what they found here - it reaches parity at 3B and then starts exceeding performance at larger sizes, up to the 2T tested.
Seemingly when trained from scratch the virtual network can find adequate precision from ternary physical nodes where needed. This is different from the information loss as an already trained floating point network has its weights quantized to smaller precision and sees a performance loss.
Not only is this approach more efficient, it seems to perform better too at larger network sizes, which is probably the most interesting part.
If this paper holds, I'd expect that's where custom accelerators will be heading.
edit: also this might be implementable purely using bitwise vector operations. Would need to check the throughput of those.
The main reason why we run this stuff on GPUs is their memory bandwidth, anyway.
Total = 1085 gates. The reality is probably far more, because you're going to want to use carry-look-ahead and pipelining.
Whereas 1 bit multiplies and add's of say a 16 bit accumulator use... 16 gates! (and probably half since you can probably use scheduling tricks to skip past the zero's, at the expense of variable latency...)
So when 1 bit math uses only 1/100th of the silicon area of 16 bit math, and according to this paper gets the same results, the future is clearly silicon that can do 1 bit math.
In my time lurking I've noticed that the community here basically focuses solely on the von Neumann architecture. If anyone is interested in delving into the world of spikes he has some interesting ideas and good material available.
* In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).
* In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value. See the paper for exact details.
On existing hardware, the gains in compute and memory efficiency are significant, without performance degradation (as tested by the authors).
If the proposed methods are implemented in hardware, we will see even greater gains in compute and memory efficiency.
Wow.
Authors seemed to have missed the history. They should at least cite Binary Connect or Straight Through Estimators (not my work).
Helpful hint to authors: you can get down to 0.68 bits / weight using a similar technique, good chance this will work for LLMs too.
https://arxiv.org/abs/1606.01981
This was a passion project of mine in my last few months at IBM research :).
I am convinced there is a deep connection to understanding why backprop is unreasonably effective, and the result that you can train low precision DNNs; for those note familiar, the technique is to compute the loss wrt to the low precision parameters (eg project to ternary) but apply the gradient to high precision copy of parameters (known as the straight through estimator). This is a biased estimator and there is no theoretical underpinning for why this should work, but in practice it works well.
My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.
> My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.
Your guess sounds and feels right to me, even if currently there's no way to express it formally, with the rigor it deserves.
Thank you again for your comment!
So what are the most important insights in this paper compared to what was previously done?
I assume there’s more context to the story and it’s not just that no one thought to apply the concepts to LLM’s until now?
My priors are like this:
1. Initial training of a neural network moves all weights around a large amount at first.
2. Later training of the network adjusts them a small amount.
3. An undertrained network will therefore look a lot like figuring out "positive, negative, or 0?" for each node during early training.
If all these things are true, then
1. Early training of an fp16 network and a bitnet with 0 added will be roughly similar in results
2. Later training will yield different / worse results, as the network gets into the 'fine tuning' part of the training.
I think the paper's stats back these priors up -- they say "this works on (3B+) large networks, but not small ones." They then imply there's something about the structure of a large network that allows a bitnet to do well. It seems more likely to me it works on large networks because they have not put the compute into 3B+ networks to get past the 'gross tuning' phase.
The networks they have compute to put in to get them 'fully' trained -- those networks don't show the results.
Also, a quick reminder that Perplexity 12 is really terrible. You would not want to use such a network. Hopefully I'm wrong and we can get something for free here! But, I'm cautious - to - skeptical.
The 3B model had a perplexity of 9.91, less than LLaMa 1 in fp16.
There is a mathematical proof that binary representation is enough to capture the latent space. And in fact we don't even need to do "training" to get that representation.
The practical application we tried out for this algorithm was to create an alternate space for mpnet embeddings of Wikipedia paragraphs. Using Bit embedding we are able to represent 36 million passages of Wikipedia in 2GB.(https://gpt3experiments.substack.com/p/building-a-vector-dat...)
> Who moderates Hacker News?
First result:
> Hacker News
> At the end of March 2014, Graham stepped away from his leadership role at Y Combinator, leaving Hacker News administration in the hands of other staff members. The site is currently moderated by Daniel Gackle who posts under the username "dang".
Why is this so shocking? Quantization has been widely explored, driving that to its extreme (and blowing up parameter count to make up for it) just seems like a natural extension of that.
Easier said than done, of course, and very impressive that they pulled it off.
> In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise additions (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value
I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.
I find it shocking that we don't even need lower floating-point precision. We don't need precision at all. We only need three symbols to represent every value.
> I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.
I find it shocking. Consider that associative addition over ternary digits, or trits, represented by three symbols (a,b,c) has only three possible input pairs, (a,b), (a,c), or (b,c) (within each pair, order doesn't matter), and only three possible outputs, a, b, or c. Matrix multiplications could be executed via crazy-cheap tritwise operations in hardware. Maybe ternary hardware[a] will become a thing in AI?
---
based on (an admittedly rapid and indulgent reading of the paper), it seems like they're not increasing the parameter size. Do you mind pointing out where the blowup is occurring?
No, unless I'm mistaken it's a huge impact: it means the matrix product is separable: basically, it's a O(n²) algorithm, and not O(n3): add together all the c_j = sum(a_i_j), d_i = sum(b_i_j), and the final results are all the combinations of cj+di. And even then, half that is unnecessary because the d_i can all be pre-computed when before inference since they are weights.
But I skimmed over the paper, and didn't found the part where it was explained how they replace the product by additions: from what I understand, they remplace multiplications by bi by selecting +ai, 0, or -ai. So the final matrix multiplication can be implemented by only additions, but only because the weights are 1,0,-1 they avoid multiplications altogether. This is really different from what the GP said (remplacing a0*b0+... by a0+b0+...).
Like what would be the expected factor of this blow up to make up the difference between ternary and whatever 16 bits encoding they were using?
I mean intuitively I'd expect to need ~10× the symbols to encode the same information? Are they using an order of magnitude more parameters, or is that not how it works?
"The era of 1-bit LLMs"
Representing { -1, 0, 1 } can't be done with 1-bit, I'm sorry -- and sad, please let's all get back to something vaguely sound and rigorous.
(I'll let myself out)
Something rigorous would be to actually read the paper rather than stop at the first part of its title. The authors are not claiming their LLM is 1-bit.
Did they actually show absence of performance degradation?
I think it's conspicuous that Table 1 and Table 2 in the paper, which show perplexity and accuracy results respectively, are only for small model sizes, whereas Figure 2, Figure 3 (latency, memory, energy consumption) and Table 3 (throughput) all show larger model sizes. So it seems like they had every opportunity to show the perplexity/accuracy comparisons at the larger model sizes, but did not include them.
I must concur, "wow".
If you are doing ternary calculations on 32/16-bit hardware, then the additions would be simpler.
For example, most of the plots in the paper are actually of throughput, memory, etc. all performance characteristics that are better on the ternary version. Which, of course.
The only thing that contains perplexities are Table 1 and 2. There, they compare "BitNet b1.58 to our reproduced FP16 LLaMA LLM in various sizes" on the RedPajama data set. The first thing to note is the perplexities are very high: they're all at least ~9.9, which compared for example with quantized Llama on wikitext-2 which is 6.15 (https://www.xzh.me/2023/09/a-perplexity-benchmark-of-llamacp...). Maybe RedPajama is a lot harder than wikitext-2, but that's a big gap.
I think probably their benchmark (their "reproduced FP16 LLaMA LLM") is just not very good. They didn't invest much in training their baseline and so they handily beat it.
Thinking out loud here. If you encode 64 weights in 2 64-bit words you can have the bits in one word indicating +1 if they're 1, and the bits in the other word indicating -1 if they are 1. You should be able to do the "products" with a few boolean operations on these 2 words to get a pair of 64 bit words for the result. Then summing becomes a matter of using a count-of-1's instruction on each word and subtracting the "negative" count from the positive. If AVX instructions can do this too, it seems like equivalent of 10-100 TOPS might be possible on a multi-core CPU.
Perhaps the rest of the JL lemma promise applies as well - compressing the number of parameters by a few orders of magnitude as well.
Aren’t you over complicating it a bit here? A dot product between a vector of activations (a₁, a₂, …) and a vector of ternary weights (b₁, b₂, …) can of course be computed as the sum of all activations for which the weight is 1, minus the sum of all activations for which the weight is -1.
It can’t however be computed as (a₁+b₁ + a₂+b₂ ...). You must have gotten that wrong.
> BitNet b1.58 is enabling a new scaling law with respect to model performance and inference cost. As a reference, we can have the following equivalence between different model sizes in 1.58-bit and 16-bit based on the results in Figure 2 and 3.
> • 13B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 3B FP16 LLM.
> • 30B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 7B FP16 LLM.
> • 70B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 13B FP16 LLM.
This paper seems to represent a monumental breakthrough in LLM efficiency, as the efficiency gains come with zero (or negative) performance penalty.
Does it seem at all likely that existing models could be converted?
> Yes, the floating point standard specifies both +0.0 and −0.0. This concept is actually useful because it tells us from which “direction” the 0 was approached as a result of storing value too small to be represented in a float. For instance -10e-30f / 10e30f won’t fit in a float, however, it will produce the value of -0.0.
The authors of the LLM paper use the values {-1, 0, -1}. Connecting the two ideas, I'm now wondering whether having a 2-bit {-1, -0, 0, 1} representation might have any benefit over the proposed 1.58 bits. Could the additional -0 carry some pseudo-gradient information, ("the 0 leaning towards the negative side")?
Also, I've seen 2-bit quantizations being proposed in other LLM quantization papers. What values are they using?
A great intro to the theoretical reasons ternary might have some promise in computing is this 2001 article from 'American Scientist', "Third Base", which quotes Knuth calling balanced-ternary "perhaps the prettiest numbering system of all" and also discusses an abortive Soviet effort in the direction of ternary computing:
http://web.archive.org/web/20011205185830/http://americansci...
In an aside, the article hints that e-nary digits (base 2.718…) if somehow made practical/meaningful, might actually be better than ternary (or perhaps even optimal?).
So maybe this paper's observation that ~"1.58 bits" (ln2(3) binary-digits) is a sweet-spot could be further refined into some method for representing the state of a e-nary-modeled algorithm in ln2(e) binary-digits (~"1.44 bits") per underlying e-it.
(As it may be of renewed interest, I've also put this 2001 "American Scientist" base-3 intro as a new HN submission for discussion: https://news.ycombinator.com/item?id=39541756)
And you can fit 120B model with a single card 24GB VRAM. This is mind blowing.
It's nice to finally see practical networks reach the theoretical limits found in the statistical mechanics of Ising models. A good pointer to efficient 1-bit training, from the statistical mechanics point of view, is here:
Let me know what's the best way to reach out!
Perhaps the accumulators in current hardware cannot leverage this to its full potential, but combined with such a strict quantization, this would open LLM to the wider ML community much earlier than expected (when consumer hardware allows you to train near SOTA LLMs from scratch on your machine).
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
https://arxiv.org/abs/1602.02830
Ternary Neural Networks for Resource-Efficient AI Applications
[1] https://www.microsoft.com/en-us/research/blog/make-every-fea...
It seems that it may be published on GitHub [1] according to HuggingFace [2].
> We would definitely be happy to open-source the models for future research. Please stay tuned!
I worked on 7 years ago trying to efficiently binarize CNNs from existing models. It the difficult was getting training running without the losses going to high. I think that vision models will be much more difficult to binarize, but you might not need to with clip if the vision encoder stays in regular math {fp16,int8}
Speech to text, however, is super interesting. You just gave me an idea! I'm gonna go run some experiments :D
A 1 bit multiplier in silicon is a single logic gate, but a ternary decoder to decode a packed tri-state 'weight' is bigger.
I therefore suspect that this method will be extended to make all weights simple 1 or 0 (ie. Binary). Perhaps that will be done by having half the weights have 1 or 0 values, while the other half are -1 or 0.
That should be called an 8/5 = 1.6 bit model though, while the paper names it 1.58 bit, closer to log_2(3) ~ 1.5849625
Not sure if it's more efficient than just binary digital circuits in highly integrated chip, though.
-1/1 is appealing to me (0 = -1) because bit hackery could be used instead of the multiplication function, presumably on integral or fixed-point representations. The goal would be to eliminate any "if/then" like "if 0 do this if 1 do that" to avoid the need for branch prediction - there are bit-hackery ways to bypass this. That would lend itself well to all existing processors, ASICs, FPGAs, GPUs, etc.
https://github.com/yashkant/quantized-nets
https://github.com/TropComplique/trained-ternary-quantization
https://github.com/buaabai/Ternary-Weights-Network
I too find it very interesting.But why this sudden, renewed fuzz?
But if you wanted to make LLM-specific hardware (or x64 instructions tuned for LLMs) this model architecture makes that extremely cheap. Multiplication requires a lot of transistors, this architecture requires only two-bit adders. You could make SIMD instructions that do thousands of these in parallel, for fairly little silicon cost.
"Straight-through estimator. To train our 1-bit model, we employ the straight-through estimator (STE)[BLC13] to approximate the gradient during backpropagation. This method bypasses the nondifferentiable functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions, during the backward pass. STE allows gradients to flow through the network without being affected by these non-differentiable functions, making it possible to train our quantized model."
also the author's (@shumingma) answer in the comments: https://huggingface.co/papers/2402.17764#65df17ed4d436404cdc...
Anyone willing to explain it like I’m a Django developer who watched half a karpathy video?
I appreciated this comment [0] from earlier in the thread by paul_mk1:
> My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.
For myself, I've done a lot of work with image hashing (such as pHash and dHash) -- and in those, you throw away a LOT of information, but simply by keeping the value of each region and tracking whether or not it's above or below the average (essentially, the sign), then it's astounding how robust those algorithms are. Because you don't look at the individual pixels of an image, but it's very good at capturing the impression of the overall _shape_ of the image.
It's less about each individual datum, and more about the shape of the network.
If you're not familiar with Lottery Ticket Hypothesis, that would be worth reading up on.
> Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption
What neurons can do though is integrate over time, so your output can be one spike, or 3 spikes very quick, same for your input, and maybe 10 quick spikes in a row is a more powerful signal than a lone spike. We know this intuitively, though, via vision, we don't see in mac-classic style black/white images, we see shades of brightness and color, indicating that at least our optic nerve is sending what amounts to an analog signal (even if encoded as binary spikes - is the spike timing not analog?)
This is not to mention all the biochemical signaling that happens, and the multitude of local neurotransmitters and global physiological/hormonal factors at play. And all that weird stuff like glial cells and astrocytes is there in the mix too.
First of all, they operate independent of a synchronized clock, and they can also accumulate signals instead of executing on a input. Neuromorphic chips are closer to how the brain works, but they're still super early. I believe Intel has the best one with the Loihi 2.
(Not a neuroscientist but my wife is and that's what I understand from our chats)
Inference is definitely an issue for LLMs right now. But if training were suddenly possible for lone hackers (or maybe smaller companies), it would open up a lot of new possibilities as well.
The most inspiring aspect to me here is just realizing how much potential low-hanging fruit there is in this space! What other seemingly naïve optimizations are there to try out?
Training CIFAR-10 speedily w/ ternary weights on an fp16 interface (using fp16 buffers, and norm params unchanged): https://gist.github.com/tysam-code/a43c0fab332e50163b74141bc...
does that mean we can do integer instead of floating point math for some parts of the training? that seems like a really big win
So, do we need the -1, and/or would a 2.32 bit (5 state, or 6 with +/-0) LLM perform better than a 1.58 bit LLM?
If we could train in this domain it would be an even bigger game changer.
.. And the paper is _true_ of course, indeed, this sort of compounding quantum leap in efficiency due to representational change starts to get towards the Black Mirror / SciFi foundational mythology level of acceleration. Wild (if true!)
Doesn't this mean that current big players can rapidly expand by huge multiples in size.?
You can choose their model ("Experimental"), but is not faster than the other models.
All of these, proprietary models are fast on Perplexity. I do guess they are using some insane cache system, better API infrastructure...
> We use binary states in normal computing to reduce entropy. In AI this is less of a concern, so why not use more of the available voltage range?
Transistors that are fully closed or fully open use basically no energy: they either have approximately zero current or approximately zero resistance.
Transistors that are partially open dissipate a lot of energy; because they have some current flowing at some resistance. They get hot.
In addition, modern transistors are so small and so fast that the number of electrons (or holes..) flowing through them in clock cycle is perhaps in the range of a few dozen to a hundred. So that gives you at most 7 bits (~log_2(128)) of precision to work with in an analog setting. In practice, quite a bit less because there's a lot of thermal noise. Say perhaps 4 bits.
Going from 1 bit per transistor to 4 bits (of analog precision) is not worth the drastically higher energy consumption nor the deviation from the mainstream of semi-conductor technological advances.
this then require a higher skill from the engineers/consumers.
if you want to avoid that you need to add op-amps with a gain of 1 at the boundary of each one, this also that care of the power loss at each stage.
the other part is that there's a limit of to the amount of useful information/computation you can do with analog processing too once you take into account voltage noise. when you do a comparison there are stages where analog win but also place where where digital wins.
I'll edit later this with a link to some papers that discuss these topics if I manage to find them in my mess.
They visit Texas company Mythic AI to discuss how they use flash memory for machine learning. There's a California company named Syntiant doing something similar.
I remember reading a review on the history in grad school (can't remember the paper) where the author stated that one of the initial interests in NNs by the military was their distributed nature. Even back then, people realized you could remove a neuron or break a connection and they would still work (and even today, dropout is a way of regularizing the network). The thinking was that being able to build a computer or automated device that could be damaged (radiation flipping bits, an impact destroying part of the circuit, etc) and still work would be an advantage given the perceived inevitably of nuclear war.
Compared to a normal von Neumann machine which is very fault intolerant - remove the CPU and no processing, no memory=no useful calculation, etc. One reason people may have avoided further attempts at physical neural networks is it's intrinsically more complex than von Neumann, since now your processing and memory is intertwined (the NN is the processor and the program and the memory at the same time).
von neumann? though it is funny to imagine von braun inventing computer architecture as a side hustle to inventing rocket science.
Also preceding the perceptron was the McCulloch & Pitts neuron, which is basically a digital gate. NNs and computing indeed have a long history together.
It's my long held opinion that LUTs (Look Up Tables) are the basis of computation for the future. I've been pondering this for a long time since George Gilder told us that wasting transistors was the winning strategy. What could be more wasteful than just making a huge grid of LUTs that all interconnect, with NO routing hardware?
As time goes by, the idea seems to have more and more merit. Imagine a grid of 4x4 bit look up tables, each connected to its neighbors, and clocked in 2 phases, to prevent race conditions. You eliminate the high speed long lines across chips that cause so much grief (except the clock signals, and bits to load the tables, which don't happen often).
What you lose in performance (in terms of latency), you make up for with the homogenous architecture that is easy to think about, can route around bad cells, and be compiled to almost instantly, thanks to the lack of special cases. You also don't ever have to worry about latency, it's constant.
(However, analog computing is still a bad fit for machine learning, because it requires a lot more power.)
[1] https://thechipletter.substack.com/p/john-c-dvorak-on-intels...
Probably, but is it worth the cost? One of the goals behind BitNet and this paper is to find a way to implement LLMs as efficiently in hardware as possible, and foregoing floating point semantics is a big part of it. I'm not sure if there's a way to encode -0 that doesn't throw out half the performance gains.
> Could the additional -0 carry some pseudo-gradient information
It looks like training was done on fp32 or bf16. Low-bit quantization is approximated with STE during training. I'd expect training itself cause each point to "polarize" towards 1 or -1.
> 2-bit quantizations being proposed
Symmetric (i.e. without 0) exponential values were pretty popular IIRC.
In my mind the two zero values would represent a tiny epsilon around 0, let's say -0.01 and +0.01. Looking at them like this, it would mean
+0 +0 -0 = +0
+0 -0 -0 = -0
+1 * +0 = +0
-1 * +0 = -0
Performing addition with the same sign count in each group would be problematic. How to decide on the sign of +0-0 or +1-1, other than flipping a coin?Or you could use the regular positive-two base and encode {-2, -1, 0, 1} the normal way with two's complement.
Negative bases are fun. See https://en.wikipedia.org/wiki/Negative_base
In other words, only inference cost is holding it back from completely changing everything.
So if we have a shortcut to getting something like GPT4 to run locally on a small device, watch out.
SDXL-ligtning/cascade can generate images in 200ms which is fast enough to fit in a web request, and paradoxically makes it even cheaper to generate.
And using groq at 500 t/s is wild compared to any of the other platforms.
Sure, Nvidia might eat their lunch in a couple of years, but bitcoin ASICs prove that you can have a niche producing specialized processors, and VCs would probably jump at the thought of disrupting Nvidia's high margin business.
There's rain.ai, d-matrix, etc.
https://en.wikipedia.org/wiki/Nat_(unit) (make sure to read the footnotes, too)
Edit: See also also, on the radix economy of balanced ternary (called "tristate") vs base 3: https://web.archive.org/web/20090312094241/http://abhijit.in... + a wild Marvin Minsky appears: https://archive.fo/gL2Bv
That page also brings up the whole "but division" problem with balanced ternary, however, I personally suspect that http://degiorgi.math.hr/aaa_sem/Div_Krishna/887-889.pdf ("A Division Algorithm for Signed-Digit Arithmetic" by Chin Tung, from 1968 !) might offer an overlooked path to a solution to that problem
And see also also², this quote from TAOCP:
"Cauchy pointed out that negative digits make it unneccesary for a person to memorize the multiplication table past 5x5."
The—INCREDIBLY ANNOYING TO LOCATE—source for which is "105. Calculs numériques. sur les moyens d'éviter les erreurs dans les calculs numériques." on Pdf page 445/document page 431 here:
https://www.e-rara.ch/download/pdf/5702285?name=Tome%2520V%4...
See also also³: https://pdfs.semanticscholar.org/5f77/b1cf105024b41b6824ba91... (Vince, Andrew - Radix Representation and Rep-Tiling)
( +a vaguely related paper here on quantum mechanics & radix economy, BUT it makes the mistake of using an overly specific formula applicable only to unsigned-digit representations thus drawing the wrong conclusions: https://www.researchgate.net/profile/Vladimir_Garcia-Morales... )
{ -1, 0, 1, 2 }
is most obvious, but it's not clear whether it's better or worse than { -1, 0, 1/2, 1 }
Maybe theoretically (if not architecturally) it would best to "split the difference" between the two and choose { -1, 0, 1/phi, phi }
or perhaps the more implementable { -1, 0, 1, 3 }
EDIT: Of course you can also go the other way, with { -1, 1 }Btw TrueNorth project evolved into "NorthPole" chip by the same group, and was recently in the press. From afar NorthPole looks like an interesting design point and leverages on-chip memory (SRAM)--so it's targeting speed and efficiency at the expense of memory density (so perhaps like Groq in some respects). Tbh I haven't followed the field closely after leaving the group.
But in fairness, getting these techniques to work at scale is no small feat. In my experience quantization aware training at these low bit depths was always finicky and required a very careful hand. I'd be interested to know if it has become easier to do, now that there are so many more parameters in LLMs.
In any case full kudos to the authors and I'm glad to see people continuing this work.
But since they are (optimized up to 8 or 10x if packing terns beyond 2 bits, in practice it seems 3-5x considering larger other structures needed in memory) more memory efficient, the largest models can be that much larger.
> The number of training tokens is a crucial factor for LLMs. To test the scalability of BitNet b1.58 in terms of tokens, we trained a BitNet b1.58 model with 2T tokens following the data recipe of StableLM-3B [ TBMR], which is the state-of-the-art open-source 3B model.
> [..]
> Our findings shows that BitNet b1.58 achieves a superior performance on all end tasks, indicating that 1.58-bit LLMs also have strong generalization capabilities.
The approach is used to solve other problems and papers have been published under https://www.researchgate.net/profile/K-Eswaran
We are currently trying a build a full fledged LLM using just this approach(no LLM training etc) and also an ASR. We should have something to share in a couple of months.
It says here ( https://www.researchgate.net/publication/370980395_A_NEURAL_... ) that each layer can be represented as a matrix multiplication (equation 3): Ax = s
So concatenating multiple layers could just be reduced to a single matrix multiplication?
If there is no non-linearity I don't see how this could replace neural networks, or am I missing something?
We believe the brain does not do nonlinear maps!
All that routing hardware, and the special function units featured in many FPGAs are something you have to optimize the usage of, and route to. You end up with using solvers, simulated annealing, etc... instead of a straight compile to binary expressions, and mapping to the grid.
Latency minimization is the key to getting a design to run fast in an FPGA. In a BitGrid, you know the clock speed, you know the latency by just counting the steps in the graph. BitGrid performance is determined by how many answers/second you can get from a given chip. If you had a 1 Ghz rack of BitGrid chips that could run GPT-4, with a latency of 1 mSec per token, you'd think that was horrible, but you could run a million such streams in parallel.
Sounds interesting, but this is the part I would need more explanation on.
Just started reading your linked blog, I see it goes into some details there.
I was talking about analogue computing.
https://github.com/ggerganov/llama.cpp/pull/1684#issue-17396...
TBH I think they won't get anywhere. Doing good game engine work... why that would translate to AGI?
> It is interesting that things still train even when various parts are pretty wrong — as long as the sign is right most of the time, progress is often made.
https://forums.fast.ai/t/how-to-do-reproducible-models-and-u...I’m glad people are doing it though, and I’ll happily adapt to accessing inference at that speed.
Intel and AMD could also implement support in their "next generation" and that would be huge.
Trits are helpful for neural nets, though, since they really love signs and they need a 0.
So from the perspective that it's all just bits in the end the only thing that is interesting is how useful it is to arrange those bits into trits for this particular algorithm, and that the algorithm seems to be able to use things more effectively that way than with raw bits.
This may seem an absolutely bizarre zigzag, but I am reminded of Busy Beavers, because of the way they take very the very small primitives of a Turing Machine, break it down to the smallest pieces, then combine them in ways that almost immediately cease to be humanly comprehensible. Completely different selection mechanism for what appears, but it turns out Turing Machine states can do a lot "more" than you might think simply by looking at human-designed TMs. We humans have very stereotypical design methodologies and they have their advantages, but sometimes just letting algorithms rip can result in much better things than we could ever hope to design with the same resources.
Thank you. I find many other things interesting here, including the potential implications for hardware, but otherwise, yes, I agree with you, that is interesting.
The bit about floating point numbers just being a collection of bits interpreted in a certain way helps make sense why a bigger model doesn't need floating points at all.
Yes. Though here the interesting point is not so much that these structures exist, but that 'stupid' back-propagation is smart enough to find them.
You can't find busy beavers like that.
The vectors are not.
That being said, before-LLM-era deep learning already had low bit quantization down to 1w2f [0] working back in 2016 [1]. So it's certainly possible it would work for LLM too.
[0] 1-bit weights, 2-bit activations; though practically people deployed 2w4f instead. [1] https://arxiv.org/abs/1606.06160
> only three possible input pairs, (a,b), (a,c), or (b,c) (within each pair, order doesn't matter)
The correct number, ignoring order, is six pairs, because we have to include (a,a), (b,b), and (c,c).
I admit it did shock me when it came out.
Cp = (Ap & Bp) | (An & Bn)
Cn = (An & Bp) | (Ap & Bn)
So 64 products in 6 instructions, or 256 in 6 instructions with AVX2, or 512 in six instructions using AVX512. If you can execute 2 instructions at a time on different words, this becomes 1024 "products" in 6 cycles or between 0.5 and 1 TOP per core.
The summing still involves using popcount on the positive and negative bits - I doubt AVX supports that but its still a fast way to "sum" individual bits. I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.
C_sgn = A_sgn ^ B_sgn
C_mag = A_mag & B_mag
The result can then be converted into bitmasks for positive and negative: C_plus = C_mag & ~C_sgn
C_minus = C_mag & C_sgn
This solution should be more efficient if there is an "AND NOT" instruction, or when multiplying more than two factors. sum = popcount(mag) - 2*popcount(mag & sgn)Yes, I agree. This still needs to be more extensively tested.
https://huggingface.co/papers/2402.17764
"We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready."
(1) Take your data as a stream. Use your machine learning gadget to give you the (predicted) probability for each of the possible next tokens. Then use those probability in arithmetic coding to specify which token actually came next.
(2) Take your data D. Apply lossy compression to it. Store the result L := lossy(D). Also compute the residue R := D - uncompress(L). If your lossy compression is good, R will be mostly zeroes (and only a few actually differences), so it will compress well with a lossless compression algorithm.
Approach (1) is a more sophisticated version of (2). None of this is anything I came up with, those approaches are well known.
See eg https://arxiv.org/abs/2306.04050 and https://en.wikipedia.org/wiki/Audio_Lossless_Coding or https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049... (Probably not the best links, but something I could find quickly.)
It discusses whether it is possible to evolve AGi using... computer game engine! And that is John's bread and butter.
Im not in this field but that's a question that's been bugging me for a while. Off you can't do this wouldn't energy consumption balloon?
That's not to say that a 70B model is necessary, but surely something larger than 3B is doable, especially given that the results of the paper directly imply a significant reduction in memory requirements for training such a model.
Isn't memory use in training higher, since they maintain high precision latent weights in addition to the binarized weights used in the forward pass?
I thought mmap'ing models to only keep the currently needed pieces in RAM was something that was figured out ~6 months ago? Performance wasn't terribly great iirc, but with how much faster 1.58B is, it should still be okay-ish.
For LLM, you are mostly dealing with b = W @ a where a and b are vectors, only W is the matrix. If a is sparse (i.e. have a few 0s), you don't need all the columns from W to do the matrix-vector multiplication. A cleverly arranged W can make sure during inference, only related columns loaded from flash. Further more, if you can apply "One Weird Trick" paper to this matrix-vector multiplication, you can shard W by rows, i.e. `b[i:i+n] = W[i:i+n,:] @ a[i:i+n] for i in range(N, N / b)` such that while the previous b[i:i+n] is still computing, you have visibility on which columns of the next matrix to be loaded already.
I can propose an alternate view of things. Not that I'm going to argue that it is the only true statement in the world, but I think it is necessary for a thought to progress to have an alternative hypothesis.
So the proposition is: formal symbolisms can deal only with those problems that where already solved in imprecise human's languages.
To invent calculus and orbital mechanics you need first to talk for a several centuries (or thousands of years?) about what is position and velocity, you need to talk your way upto acceleration, and then you need to find a way to measure them and to define in a strict geometric terms. Ah, and infinity, it was a very counter-intuitive idea, Xenon invented some of his paradoxes specifically to point at counter-intuitiveness. When Newton came all these talks and debates did the most of work for him.
> the ability to understand the human tongue is insignificant compared to the power of math.
But the fun is: you cannot know if someone understands math if they do not understand human language too. You cannot teach math to those who cannot speak human language.
Math is a cream on top with a limited applicability. What math can say about love? I do not like to sound like Dumbledor, but really behind all we do there is an emotions motivating us. Math cannot deal with emotions, because it was built that way and because non-math talks about emotions hadn't bring a good model for emotions, which math could express in a formalized language.
> Dijkstra says
I wonder when he said it? Before AI concluded that expert-systems based on logic were acknowledged to be a failure or after that?
> To invent calculus and orbital mechanics you need first to talk for a several centuries (or thousands of years?) about what is position and velocity, you need to talk your way upto acceleration, and then you need to find a way to measure them and to define in a strict geometric terms. Ah, and infinity, it was a very counter-intuitive idea, Xenon invented some of his paradoxes specifically to point at counter-intuitiveness. When Newton came all these talks and debates did the most of work for him.
For the sake of argument, let's grant your story about what you need to invent calculus.
But once you invented calculus, you can then use it to solve all kinds of problems that you would never in a thousand years be able to handle with mere talk.
Not "all kinds of problems" but very specific kinds of problems which is possible to formalize into a math language. How would you go about inventing thermodynamics if you didn't know words "temperature" and "pressure"? You'd need to start for your senses that can tell you "this is a hot surface", or "this is a cold one", or "this one is colder than that", you need to decide that "coldness" is a "negative heat" (it is not the most obvious idea for an animal, because animals have as receptors for a cold, so receptors for a heat, you could feel hot and cold at the same time, if you managed to stimulate both kinds of receptors at the same time). Then you need to notice that some materials change volume when heated, then you need to come up with an idea to use measurements of a volume to measure a temperature, and only then you can try to invent pV=nRT, which becomes almost tautological at that point, because your operational definition of a temperature makes it equivalent to a volume.
After that you really can use calculus and make all sorts of quantitative statements about thermodynamic systems. But before all that "mere talk" was finished thermodynamics was not a kind of a problem calculus can deal with.
However, this paper is evidence that the field is figuring out how to built what's actually needed, which is a good thing.
If you want to measure its ability to do mindlessly repetitive tasks without diverging from instructions, you should compare it to humans doing the same, not expect it to act like a calculator.
If you want to measure its ability to solve problems that involve many such steps that are simple to express but tedious to carry out, ask it to write and evaluate code to do it instead.
If you include a large amount of properly solved math in its training text, it gets MUCH better at that kind of math.
It has a very deep set of intelligences that are alien to us, that allow it to predict and ACT LIKE us, when it comes to generating the next word. You're only seeing the output of those intelligences through a very lossy channel.
As a side note, there are structures in human language that apparently encode much more information that you might think at first glance. The fact that Word2Vec had such mathematical properties, despite it's relative simplicity, astound me to this day. Throwing a bunch of sine/cosine values on top of that to represent position in a sentence to enable LLMs is also amazing in that it works.
- The result of 69*94 is 6466.
In fact that kind of 'finishing' is very important, because otherwise you can waste a lot of time talking without noticing that you are not going anywhere. See eg philosophy or theology or pre-scientific-revolution science (ie natural philosohpy and natural history).
I think you could conceive of abstraction from other forms, maybe something like platonic forms as a base instead of language (again probably not in humans, but in others)
But whether or not it can "do maths" to your definition depends very much on what you want it to do, and how you define "do maths". To me it's irrelevant if it's doing the low-level calculations as long as it knows how to express them as code. If I wanted a calculator I'd use a calculator. And I don't consider a calculator able to "do math" just because it can precisely add numbers.
Meanwhile I've had lengthy discussions with GPT about subjects like orbital mechanics and calculating atmospheric effects where it correctly used maths that I had to double-check not because I didn't trust GPT (though I also want't to verify for that reason) but because I didn't know the maths (not that it was anything particularly advanced, but I lost interest in maths during my CS degree and picked the minimum amount of maths I could get away with).
By my definition it can "do maths" just fine. I guess you don't consider my view of that "reasonable". I can live with that, as meanwhile, it will keep doing maths for me when I need it.
Of course this was also a case of moving the goalposts to set up a strawman - in the comment of yours I replied to, you claimed it couldn't reliably add two numbers.
I'm not moving goalposts, the original claim was that LLMs can "do math". Primary school arithmetic is math.
GPT-4 can't do math and that's okay, I don't understand why so many of you are so touchy and defensive about this. It's a limitation that exists, nothing more, nothing less.
If you train a model to do math (and optimize representation for that), it'll do math. GPT-4 just isn't, and, generally speaking, they aren't, because it's much more efficient to train them to "use a calculator". Same as with humans.
I find it really comical that this is what people complain about GPT over - there's zero benefit to get LLMs to get good at this over other tasks. To the extent we get it "for free" as a benefit of other learning, sure, but when we make kids practice this over and over again to drill doing it without getting sloppy, it has traditionally been out of some belief that it's important, but a computer will always have a "calculator" that is far more efficient than the LLM at its disposal and it's idiocy to care about whether it does that part well the tedious and hard way or knows how to describe the problem to a more efficient tool
I also find it comical that people use tasks where LLMs behaviour is if anything mot human-like, in its tendency to lose focus and start taking shortcuts (before GPT4 started writing Python instead, it'd for a while try really hard to not give you a step by step breakdown and instead clearly take shortcuts even you prompted it heavily to reason through it step by step), when presented with stupidly repetitive tasks as examples of how they're not good enough.
All human knowledge is "symbolic". that is, knowledge is a set of abstractions (concepts) along with relations between concepts. As an example, by "knowing" addition is to understand the "algorithm" or operations involved in adding two numbers. reasoning is the act of traversing concept chains.
LLMs dont yet operate at the symbolic level, and hence, it could be argued that they dont know anything. LLM is a modern sophist excelling at language but not at reasoning.