Finding the average of two unsigned integers without overflow(devblogs.microsoft.com) |
Finding the average of two unsigned integers without overflow(devblogs.microsoft.com) |
(a>>1) + (b>>1) + (a&b&1)
No division needed. unsigned average(unsigned a, unsigned b) {
return (a + b) / 2;
}
At the end of the day it's all just text. There are plenty of steps before any of it does anything at all.Really depends on where you're coming from. Anyone who has dipped their toes in embedded programming will immediately know they are equivalent, and many will correct /2 to a bitshift, because that's what you want to happen.
I get that bit twiddling is obscure outside of low level programming, but bit shifts really is kindergarten stuff in this domain.
Turning (a+b)/2 into a/2 + b/2 is basic obvious math.
If you do it and to any basic testing you will realize you are getting of by one errors, locking at them can then make it obvious that when they appear and hence how to fix them.
Sure a proof is more complex, but then you can just trivially test it for all smaller-bit numbers over all possible inputs, hence making proofs unnecessary (for that numbers).
This is a solution a not yet graduated bachelor student can find in less then a day.
Having granted a patent for this should lead to disciplinary measurements against the person granting the patent tbh.
Props for including the assembler breakdown for every major CPU architecture.
Divide both operands by 2 was my first idea before loading the page. (I like to try that sometimes before reading the articles.)
I didn't think about the carry bit, but it seems like that would be a logical solution after 5 minutes of extra thinking.
I'm not sure how that's patentable.
That's insane to me.
But maybe there is more too it.
I didn't read the patent itself.
The patent is for a circuit design to perform that algorithm in a single cycle. The algorithm was never patented, nor could it be.
logic [31:0] a, b, average; assign average = (a >> 1) + (b >> 1) + (a & b & 32'd1);
If you have some novel arrangement of gears and levers to propose, or some recipe of doping material A with chemical B and then doing so-and-such - that's a classical hardware patent, but the whole point of these "an apparatus" patents is that they aren't trying to describe a novel device anybody invented, they're describing the algorithm the patent is for the algorithm and phrases like "an apparatus" are just a wafer thin excuse for this to be waved through by people whose salary is paid for by the endless flood of such software patents.
As a parallel comment is explaining there hasn't for many years been any substantial difference between what seems to be very much "hardware" and what common folk think of as "software".
The patent application even spells it out, explaining that "A general purpose computer [...] can execute the invention in a single instruction cycle [...]". What "invention" do you suppose is being "executed" here? Is the general purpose computer... making a special circuit? Maybe your laptop is now selling DVD players to Best Buy? No. Their "invention" is just the algorithm, they successfully patented the algorithm.
Nobody would "stop you writing that algorithm", but if they wanted to, and if you've got enough money for it to be worth taking your money, while that patent was valid they could sue you for infringing on their invention and unless you've got deep pockets and proof you invented it first, you are screwed.
I honestly doubt that they could sue you for infringing on this.
I mean... ignoring the bitwise arithmentic (which this only obvious to people used to doing binary operations) this is the kind of maths that an 11yo could do.
That said, the patented solution is a little more complex. But not by much.
Which makes me curious: what other patents have we violated in our day-to-day without even knowing it?
Patents are like the criminal code - always remember "Three Felonies a Day" [1]. The system is set up so that if you are one of the 99%, the 1% can come in and bust you at will if you become too much of an annoyance/threat. They will find something if they just keep digging deep enough (not to mention that they can have your entire company's activity combed through with a microscope if they find a sympathetic court), and blast you with enough charges and threaten sequential jail time so that you cannot reasonably do anything other than plead guilty and forfeit your right to a fair trial [2].
And for what it's worth, that "play by the rules as we want or we will destroy you" tactic can even hit multi-billion dollar companies like Epic Games. It's one thing if society decides to regulate business practices by the democratic process of lawmaking... but the fact that Apple can get away banning perfectly legal activities such as adult content, vaping [3] or using a non-Apple payment processor from hundreds of millions of people is just insane, not to mention incredibly damaging to the concept of democracy.
[1]: https://kottke.org/13/06/you-commit-three-felonies-a-day
[2]: https://innocenceproject.org/guilty-pleas-on-the-rise-crimin...
[3]: https://www.macrumors.com/2020/06/01/pax-vape-management-web...
I looked at the title while still waking up.
At first I thought of the low + (high - low) / 2 method. I then figured maybe it was better to simply predivide both numbers before adding and just correcting for the lowest bit (how was that ever patented?!).
However, I didn't like having to perform two divisions so I thought there was probably something clever one could do with bit operations to avoid it. But, still being tired, I decided I didn't want to actually spend time thinking on the problem and I'd already spent a minute on it.
Emphasis on supposed.
The granted patents include: laser used to exercise cat, and mobile wood based dog game (log used to play fetch).
https://abovethelaw.com/2017/10/8-of-my-favorite-stupid-pate...
https://patents.google.com/patent/US5443036A/en
https://patents.google.com/patent/US6360693
Apple steals the cake though. By patenting a geometric shape.
the patented solution immediately came to mind
And it's from 1996.
The actual patent system failure here is the patent is not useful -- it's not valuable. If you needed this solution, you could sit down and derive it in less than an hour. That's not because it's obvious, but because the scope is so small.
The only difference between this patent and say a media codec is how long it would take to reinvent it. It might take you 200 years to come up with something as good as h.265, but there's no magic to it. There's a problem, somebody came up with a solution, somebody else could do it again given enough time to work on it. This is true for everything that's ever been patented.
The point of patents is to compensate for value of the work needed to reinvent, and so the real problem here is that value is less than any sane minimum. The value is less than the patent examiner's time to evaluate it! But court rulings have said it doesn't matter how insignificant a patent is, as long as it does anything at all it's "useful", which leads to these kinds of worthless patents.
Any logic designer, who is not completely incompetent, when seeing the expression
(a / 2) + (b / 2) + (a & b & 1);
will notice that this is a 1-cycle operation, because it is just an ordinary single addition.
In hardware the divisions are made just by connecting the bits of the operands in the right places. Likewise, the "& 1" is done by connecting the LSB's of the operands to a single AND gate and the resulting bit is connected to the carry of the adder, so no extra hardware devices beyond a single adder are needed. This is really absolutely trivial for any logic designer.
The questions at any hiring interview, even for beginners, would be much more complex than how to implement this expression.
It is absolutely certain that such a patent should have never been granted, because both the formula and its implementation are obvious for any professional in the field.
Our experiences and training has changed dramatically over the past 26 years.
The patent issued in 1996 and wasn't revisited since then (because never asserted in litigation). The USPTO is a lot different now, a quarter-century later.
Please be more specific or link something that explains how they've improved.
Adding two bits produces a sum and a carry:
0 + 0 = 0, carry 0
1 + 0 = 1, carry 0
0 + 1 = 1, carry 0
1 + 1 = 0, carry 1
So the sum is XOR, and the carry is bitwise AND. We can rewrite x + y as (x ^ y) + (x & y)*2Distribute the divide, and you get (x ^ y)/2 + (x & y) which is the mystery expression.
(Note this distribution is safe only because (x & y)*2 is even.)
Software patents are absolutely disgusting.
To be precise, legally it is "not obvious back in 1996." There is a lot of stuff that is obvious today that wasn't 25 years ago. That said, this one in particular probably would have been invalidated as obvious if it was ever litigated (and it was not). Also, the USPTO has reined in software patents a lot in recent years (but always people advocating for more or less).
Given its simplicity this makes me wonder if a compiler has ever transformed legal original IP code into patented code.
https://ai.googleblog.com/2006/06/extra-extra-read-all-about...
https://news.ycombinator.com/item?id=3530104
https://news.ycombinator.com/item?id=1130463
https://news.ycombinator.com/item?id=14906429
https://news.ycombinator.com/item?id=6799336
https://news.ycombinator.com/item?id=9857392
https://news.ycombinator.com/item?id=12147703
https://news.ycombinator.com/item?id=621557
https://news.ycombinator.com/item?id=7594625
https://news.ycombinator.com/item?id=9113001
https://news.ycombinator.com/item?id=16890739
If doomscrolling all that isn't enough to make you fear for mankind's future I'm pretty sure there's an Ulrich Drepper glibc bug report rejection related to this topic (or several) that you can google...
On topic: Raymond's post has some other great stuff. SWAR!
After reading the article, learning that
unsigned average(unsigned a, unsigned b)
{
return (a / 2) + (b / 2) + (a & b & 1);
}
was once patented actually made me a bit sad for our entire system of patents.Though actually it shouldn't be patented because it's an obvious implementation of a math formula (and math is not patentable).
There’s another algorithm that doesn’t depend on knowing which value is larger, the U.S. patent for which expired in 2016:
unsigned average(unsigned a, unsigned b)
{
return (a / 2) + (b / 2) + (a & b & 1);
}
There's no way that should be patentable.That's completely retarded; it's literally the first solution I think of when I hear this problem.
In the case of the phone you already know the patented solution, which obviously makes it impossible for you to judge its obviousness. That presumable wasn't the case with GP and the presented problem.
I suspect the POWER team has a good sense of humor. There's also the EIEIO instruction https://www.ibm.com/docs/en/aix/7.2?topic=set-eieio-enforce-...
EDIT: it's included in the collection of methods in the article as expected.
unsigned midpoint(unsigned a, unsigned b) {
asm("add\t%1,%0\n\t"
"rcr\t%0"
: "+r"(a)
: "r"(b));
return a;
}
Although `(a & b) + (a ^ b) / 2` is probably the more conservative choice. unsigned average(unsigned a, unsigned b)
{
return (a & b) + (a ^ b) / 2;
}
A quick sanity check of this23 & 21 = 21
23 ^ 21 = 2
21 + 2 / 2 = 22 (order of operations)
I wonder why this is there. It seems the best solution but no one else is mentioning it. It also has no context near it. Nor is it stated correctly. It's just there on it's own.
They say this because they know the USPTO strategy is to just hand out patents after putting in some bare minimum effort to review, and postpone the real review process to the unlikely day that someone chooses to challenge it in court and can pay private firms to do their job for them.
The winners in this arrangement are the government, the big law firms, and the large corporations that can afford them.
* x86 SSE/AVX/AVX2 have (V)PAVGB and (V)PAVGW, for 8-bit and 16-bit unsigned integers. These are "rounding" instruction though: adding 1 to the sum before the shift.
* ARM "Neon" has signed and unsigned "Halving Addition". 8,16 or 32 bit integers. Rounding or truncating.
* RISC-V's new Vector Extension has instructions for both signed and unsigned "Averaging Addition". Rounding mode and integer size are modal.
* The on-the-way-out MIPS MSA set has instruction for signed, unsigned, rounded and truncated average, all integer widths.
Some ISAs also have "halving subtraction", but the purpose is not as obvious.
I was surprised that the article didn't mention the need for this in binary search, and the famous problems [1] that occured due to naive attempts.
[1]: https://en.m.wikipedia.org/wiki/Binary_search_algorithm
Gcc and Clang both recognize the pattern of shifts and OR that reproduce a rotation, and substitute the actual instruction, no intrinsic needed.
I bet MSVC does too.
It's 2022. stdint.h is old enough to drink, and is probably married with a kid on the way. Just include it already?
unsigned average(unsigned a, unsigned b)
{
return (a / 2) + (b / 2) + (a & b & 1);
}
That makes the patent system seem broken to me.One of the reasons I love Python is that integers never overflow, so this becomes a trivial problem.
https://www.askpython.com/python/built-in-methods/python-rou...
"Also, if the number is of the form x.5, then, the values will be rounded up if the roundup value is an even number. Otherwise, it will be rounded down.
For example, 2.5 will be rounded to 2, since 2 is the nearest even number, and 3.5 will be rounded to 4."
Consider a term project of an undergraduate CS course, where the goal is spelled out, but the method is left for discovery.
Methods developed within any such project immediately invalidate patents. They're apparently obvious to folks learning to become "skilled in the art".
Yes, in practice, reaching a legal threshold would be hard (are you sure the students didn't read the patent or any description directly resulting from it?). But I'd definitely run a "patent invalidation course" - if I had confidence that the results would actually affect patents.
a & b is the bits they have in common. If they both have a bit x, you should keep it, because the average of x and x is x.
a ^ b is the bits that only one of them have. You should halve these bits, because the average of x and 0 is x/2.
What sets this post apart from the rest is that it actually provides useful information for everybody, after all those other posts of type "but that's obvious, I looked at it for 1 minute, saw half the solution, was too lazy for the rest and in hindsight it's all obvious. What a ridiculous patent system."
Not to say that the patent system is problematic...
How is the above SWAR? It looks like a regular set of instructions.
For example, when x and y are 32-bit integers holding 4 8-bit integers, you can do
z = (x ^ y) + (x & y) & 0x7f7f7f7f;
Now z holds four 8-bit integers which hold the sum (modulo 256) of the integers of x and y. The bit mask is to stop the carry from propagating.https://en.wikipedia.org/wiki/SWAR
Maybe that's the way it's meant?
Compilers might be smart enough to pick up the idiom used in the example and compile them to something done in parallel?
(a >> 1) + (b >> 1) + (a & 1) & (b & 1)> I would argue that the bug is not in the algorithm -- the bug is in languages that don't detect integer overflow by default.
Concretely, this is true enough. But abstractly, not so much: the algorithm is actually "buggy" if you abstract the problem a little. Namely, finding a midpoint of two operands does not require that the operands be numbers, or even addable for that matter. The introduction of that requirement is therefore a bug (at least in my eyes). The easiest way to see this is to replace integers with pointers. Then adding two pointers isn't even a well-defined operation in the general case, let alone dividing them by two. Whereas subtracting them and moving half the distance is actually quite well-defined, and we can see it behaves better too.
I would probably go so far as to claim that this is not an isolated example of where thinking about problems more abstractly helps us come up with solutions that have non-obvious benefits.
Read https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63303 to see many problems around pointer differencing.
I haven't seen the mentioned proof, but if said proof is not formal and mechanized and/or does not consider all possibilities, including overflow, then should we really consider it to be a proof of correctness? It might prove some desirable properties, but I don't think we should leave it at that. I certainly don't think we should claim that "It is not sufficient merely to prove a program correct; you have to test it too."
When it comes to software programs, I believe proofs can and should be exhaustive. That does not necessarily mean you need to exhaustively test every input, but it does mean you need to prove correctness for every input, including inputs that might result in overflow or undefined behavior. Otherwise, we should not consider it a proof of correctness.
The trust in the significance of a proof is limited by the trust in the statement to be proven. The statement to be proven was "If XYZ assumptions about the statements in the underlying language hold, then the result of that function is the result of a binary search". The proof is correct. Just that XYZ contained something incorrect.
There is no free lunch. Even if we prove all our software formally correct, we still need to "program" the specifications, with all that comes with it. They can contain bugs, need to be debugged, revisited, etc. The above is an excellent example of this.
I have no idea why it was granted. I sent it to headquarters as a design pattern description.
That’s pretty much the definition of “prior art.”
I guess they were able to reshape it into a form that the patent office wouldn’t laugh out the door. I never really looked at it. I hate reading patents; even my own.
I am not a lawyer, but I suspect prior art is a defense.
this would require a lawsuit to find out, but this wording in the description suggests the authors’ intent with the claims was to patent this method of computing an average on any processor:
> A general purpose computer or processor with suitable circuitry can execute the invention in a single instruction cycle (as is preferred) or multiple instruction cycles.
LZW compression[1] was patented, and Unisys actually had success extracting license fees with it[2]. In that sense, clearly an algorithm was "patented enough" that the patent was granted and used, even though the patent is literally just math. The MP3 patent is also just a mathematical algorithm which had even more legal success.
So, while technically algorithms are not patentable, in reality, the USPTO will grant patents for algorithms if you write enough legal gunk around them, and the difference doesn't really matter when a patent troll is sending threatening emails and your lawyers are demanding 20x the licensing fee to take the case.
[0]: https://en.wikipedia.org/wiki/Alice_Corp._v._CLS_Bank_Intern...
However, this is also why business method algorithms like the Amazon "One-Click" are not patentable in most countries. There is no trivial theoretical equivalence between one-click shopping and an electronic circuit.
In the US, unfortunately,they can be (although not all algorithms are patentable). For example algorithms used in many media formats are patented.
I've never come across this problem before, I read the headline and that solution came into my head immediately before I'd even clicked. I don't think I'm clever, surely half of HN feels the same way.
Software patents are comical.
Wow! Just wow...
unsigned average(unsigned a, unsigned b)
{
return (a >> 1) + (b >> 1) + (a & b & 1);
}But likely, Intel/AMD would have had to pay fees if they were implementing a similar solution in their hardware.
Everything is. That's kinda hindsight's thing.
Not so say that a few people in this thread probably saw this solution right away, but the "this was all obvious" crowd in this thread is a little too large for my taste. Be real, guys.
If you're not aware that numbers can overflow (and you probably don't tend to think about that for every single + you type, I guess), then the proper solution is less obvious.
The system is super broken.
Edit: but back in the 70s the USPTO didn't have the search databases they had in the 90s or 2000s. It's more the XOR patent being wielded in litigation that was extra controversial
Back before software patents were a thing, the "math" was not patentable. Eliminating software patents will be a return to the previous status quo.
The only thing this patent covers is the physical circuit implementation, not the math.
The claims in that patent are not limited to a physical circuit implementation.
The description includes a circuit diagram as one possible embodiment of the claimed invention, but the patent covers any implementation (and the description indicates it was intended to cover instructions running on general purpose processors).
Not true. See https://en.wikipedia.org/wiki/Reexamination It's even easier today than a decade ago, though the Wikipedia article doesn't explain that aspect very well. (I wouldn't be able to explain it, either. I think it has to do with reduced ability for a patent owner to drag out review, including dragging it into court.) Probably not nearly easy enough, though.
That's unfair, as the commenters here are providing a software solution. The patent is about a hardware solution which involves two parallel adder circuits. It implements in hardware exactly what the software solution does, but you can't express it in software because there is no operand that expresses "implement this addition twice please". You'd have to express it as:
avg = [x>>1 + y>>1, x>>1 + y>>1 + 1][x&y&1]
Which isn't 1-cycle either without the specialized adder.The patented expression is computable in an obvious way by a single ordinary adder and a single AND gate connected to the carry input of the adder, without any other devices (the shifts and the "& 1" are done by appropriate connections).
Any ordinary N-bit adder computes the sum of 3 input operands, 2 which are N-bit, and a third which is an 1-bit carry.
If a significant fraction of people come up with it on the spot, it's obvious. And they did.
Keep in mind this solution was to support MPEG-1 video encoding in the olden days when state of the art processors were 100 MHz and 800 nm process. Doing this in 1 cycle while reusing already existing function units seems like a clever solution to me -- not patent-worthy, not difficult, but clever.
If so then the technique in the post isn't actually patented.
If that C code would get threatened, then the 1 cycle thing is a red herring.
Also "Doing this in 1 cycle while reusing already existing function units"? In hardware you can use a normal adder without any special technique...
The fact that the right shift for a negative integer gives the floor function of the result just makes the correction easier than if you had used division with truncation towards zero.
The shifted out bit is always positive, regardless whether the shift had been applied to negative or positive numbers.
Except for following a tradition generated by a random initial choice, programming would have been in many cases easier if the convention for the division of signed numbers would have been to always generate positive remainders, instead of generating remainders with the same sign as the quotient.
Actually I was curious to see if GCC would be smart enough to automatically choose what's the best optimization depending on the underlying architecture, but it doesn't appear to be the case.
For x86_64 (with -O3 or -Os):
avg_64bits:
.LFB0:
.cfi_startproc
movl %edi, %edi
movl %esi, %esi
leaq (%rdi,%rsi), %rax
shrq %rax
ret
.cfi_endproc
avg_patented_do_not_steal:
.LFB1:
.cfi_startproc
movl %edi, %eax
movl %esi, %edx
andl %esi, %edi
shrl %eax
shrl %edx
andl $1, %edi
addl %edx, %eax
addl %edi, %eax
ret
Clearly just casting to 64bits seems to denser codeFor ARM32 (-O3 and -Os):
avg_64bits:
push {fp, lr}
movs r3, #0
adds fp, r1, r0
adc ip, r3, #0
mov r0, fp
mov r1, ip
movs r1, r1, lsr #1
mov r0, r0, rrx
pop {fp, pc}
avg_patented_do_not_steal:
and r3, r1, #1
ands r3, r3, r0
add r0, r3, r0, lsr #1
add r0, r0, r1, lsr #1
bx lr
A lot more register spilling in the 64bit version since it decides to do a true 64bit add using two registers and an adc.My code, for reference:
uint32_t avg_64bits(uint32_t a, uint32_t b) {
uint64_t la = a;
uint64_t lb = b;
return (la + lb) / 2;
}
uint32_t avg_patented_do_not_steal(uint32_t a, uint32_t b) {
return (a / 2) + (b / 2) + (a & b & 1);
}The original explanation was the actual SIMD approach, which is really cool. You can extend it right away to other problems.
Let's do the average of 85 and 95. The leftmost digit of both is equal, so we leave it: X5 The "xor"/2 is the sum of the non equal digits divided by 2 (respecting their powers): (8+9)10 = (17)10 = 85 Now, we add them: 5 + 85 = 90 (which is the average of 85 and 95).
Let's take now 87 and 89: The rightmost digits are equal: 8X We do the 'xor'/2 for the left most: (7+9)/2 = 8 8X + 8 = 88
Let me know if there are some examples for which it would fail (I only spent a minute to test if the explanation extends to other bases).
If you would use a conditional jump, that would have a high cost.
However the maximum or minimum should always be computed without conditional jumps and many CPUs have special instructions for max and min, which are not more expensive than additions or subtractions.
On CPUs without max & min instructions, computing max or min requires 2 instructions (compare + conditional copy).
2 instructions vs. 1 instruction increases the program size but not necessarily the execution time, if the instructions can be overlapped with others.
Due to the complex architecture of modern CPUs, it is impossible to determine the cost of a simple sequence of instructions in the general case.
For each particular CPU, a different but equivalent sequence of instructions can be the best and longer sequences of instructions may happen to be executed in less time, if they can be better overlapped on a certain CPU.
Many of your statements are misleading in context. Implying that you can't know or deduce things about the cost of a simple sequence of instructions is very odd. All software and people that work on optimization do it all the time.
And remember that the context is a suggestion that pointer types and pointer-subtraction is the answer to a question about integers, so getting into detail about instruction sequences isn't really going to help, as the basic idea is flawed.
[1] https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(file...
It is known that gcc fails to do "if conversion" in many cases when it should.
The correct code using the CMOV instruction and only a single computation of the expression could easily be written with inline assembly.
However, if inline assembly is used, there is a much simpler solution using the carry flag, which was presented at the end of the article that started this thread.
In reality, all the high level language solutions that have been listed in the article are very bad in comparison with the right solution using inline assembly.
The solution using inline assembly and the carry flag is actually applicable to any CPU, with the minor inconvenient that the mnemonics for the instructions could vary (which can be handled with defined macros), because all have carry flags and on all CPUs this solution will be the shortest and the fastest.
For the high-level solutions, which is the best of them can vary from CPU to CPU, which was my point.
Companies pool their patents together, promising not to sue each other over patents, in order to collectively defend themselves against trolls.
but instead every company chooses to defend themselves by deluging the patent office with more crap, and collective abuse of the patent system to patent stuff even the “inventor” thinks is obvious becomes accepted totally normal behavior.
I would not be so certain about that one.
Many countries are thinking of outright "meat taxes", health scores (similar to smoking warnings in the desired nudging effect) or extending CO² taxes onto agriculture: meat production causes about 14% of global CO² emissions [1], and outlawing/disincentivizing meat consumption is a very easy, very fast and incredibly effective way of cutting down on CO², methane and dung emissions. Not to mention the indirect emissions from land burning (especially in Brazil) or the societal cost of overconsumption of meat (e.g. obesity and heart issues).
Personally, I'm in the "omnivore" camp but recognize that the way how we deal with meat products has to be massively reformed. We need to cut waste and curb consumption, the sooner the better.
[1]: https://www.theguardian.com/environment/2021/sep/13/meat-gre...
I do not see where this would be of any use.
On the other hand, if you want a quotient that has some meaningful relationship with the ratio between the dividend and the divisor, there are other more sensible definitions of the integer division than the one used in modern programming languages.
You can have either a result that is a floating point number even for integer dividend and divisor, like in Algol, or you can define the division to yield the quotient rounded to even (i.e. with a remainder that does not exceed half of the divisor).
In both cases 10/3 and -10/-3 would yield the same result and I can imagine cases when that would be useful.
For the current definition of the integer division, I do not care whether 10/3 and -10/-3 yield the same result. It does not simplify any algorithm that I am aware of, while having a remainder of a known sign simplifies some problems by eliminating some tests for sign.
lower = 0x7f7f7f7f;
highest = ~lower;
z = ((x & lower) + (y & lower)) ^ ((x ^ y) & highest);
Note this only improves performance for larger container integers.Presumably it would take 4 logical instructions to do the vectored math, vs. 4 logical instructions to do the scalar additions.
I suppose you're looking at a minimum of two registers for the vectored approach, vs. 8 for the scalar approach. Having the result available in separate registers makes them immediately available for use, though.
There's also the overhead of getting the numbers in and out of memory. Loading and storing one word is obviously going to be way better than loading 4 bytes individually.
It seems to me like the vectored approach would be better for algorithms that require iterating through a large dataset in memory. The scalar approach would be better for algorithms that have a bunch of dependent calculations. Perhaps that's an obvious conclusion!
That's pretty neat though. For the large dataset scenario, perhaps you could get a significant speedup on relatively simple architectures such as cortex-m microcontrollers. I suspect that sufficiently modern high end CPUs/compilers wouldn't benefit so much from it, though? All the pipelining, superscaling and caching and whatnot could sufficiently mask the latencies of the memory accesses to the point of being irrelevant. Also, a sufficiently clever compiler could implement the loop with actual SIMD instructions and achieve significantly higher performance than the manual in-register optimization.
This would be a fun way to compute a basic 8-bit checksum on a binary blob in a microcontroller... Not that it would be practically useful because any non-trivially sized blob would be better served with at least a Fletcher checksum if not a full CRC, both of which seemingly lack the necessary associativity.
The operation itself is not necessarily faster (it could be faster due to pipelining, I think, and it would probably ne faster when storing 8 8-bit ints in a 64-bit int), but it can save the hassle and runtime of packing and unpacking into four variables.
Four voice digi player using SWAR in second version https://www.youtube.com/watch?v=1GrdvcghDXE
We don't need to admit defeat in optimizing such a simple case. A comparison is unnecessary and will pessimize, compared to a little bit-manipulation.
however, i believe that individuals should respect the commons, and don’t believe in the tragedy of the commons as an excuse for individual actions that contribute to the problem.
(insert starfish on the beach story)
Reminds me of a weekend hobby project I did back in 2014 or so. I had an itch to play with analog video signal generation from an atmega328p. Rather than use one of the existing libraries, though, I started from scratch with the goal of achieving the highest possible resolution. I used the SPI peripheral to clock out 8 pixels at a time at 8Mhz without any gaps, giving me something like 12 instructions to prepare the next byte. There wasn't enough RAM for a frame buffer at the resolution, so I instead used character tiles; that ate up the whole budget. I forget what the resolution was, but it was significantly higher than the existing library was capable of. There was a jitter, which I tracked down the the variability in interrupt latency due to the AVR having variable cycle length instructions. I was using a timer interrupt to schedule the start of and complete transmission of each scanline, so that the main program could focus purely on application logic. I wrote an inline assembly routine at the start of the interrupt handler to insert a variable number of noop instructions depending on the relative phase of the hardware timer, and the output became rock solid.
That of course reminds me of a project in 2007 where I needed to go the other direction, and decode an analog video signal on an 8 bit PIC microcontroller. The signal was from a camera on an actuator, meant to detect the relative position of the sun for the purpose of aiming a parabolic solar concentrator. I was able to filter out all visible light with some overdeveloped film negative so that the video signal was simply a white dot on a black background, and then wire it up through some voltage dividers to the PIC's two voltage comparators. One comparator detected sync pulses, and the other one detected black to white transitions. The firmware would simply track the timing of sync pulses to know the current scanline and position within the current scanline. Good times!
Some more info on the digi player on the ST. It used a timer interrupt to service the PCM sample but if you used just the interrupt there was significant noise because there was significant variability of the timing on the interrupt. To get the timing tighter the interrupt timing was changed to hit the routine on every video line just prior to the actual hsync and then hsync was polled to get very precise timing.
The PCM was just a linearization by combining three logarithmic volumes of the three PSG voices.
During the title sequence, a special version of the code was running where several 68000 registers were reserved globally for the digi player. So those did not need to be saved / restored in the interrupt routine!
The SWAR was involved when advancing the four wrapping 8 bit indices into the each of the four voice's 256 entry sample tables. This was part of a monumental effort to get the interrupt routine to be as quick as possible.
Your video decoding reminds me of when I worked at a video card company in the nineties we had a competitive advantage by using a commodity part in an unusual way. This video decoder hardware was commonly used to take composite video and decode it. We supported that but we also did a bunch of advanced features by using a seldom used mode where it could be used in a raw mode where it took the composite signal and stored the analog to digital conversion in memory. We had high speed assembly code that could decode the video better than the hardware and supported some cool additional features. Anyway... memories a bit hazy. Been a while but I remember it being very cool.