{n} times faster than C(owen.cafe) |
{n} times faster than C(owen.cafe) |
People say that forth isn't very optimisable for our register machines, but I reckon that you can get pretty good results with some clever stack analysis. It's actually possible to determine arity statically if you don't have multiple-arity words, which are very rare. That allows you to pass arguments by register.
Anyway, I'm not even close to an expert so don't take what I said as facts.
int run_switches(const char* input) {
int res = 0;
// Align to word boundary.
while ((uintptr_t) input % sizeof(size_t)) {
char c = *input++;
res += c == 's';
res -= c == 'p';
if (c == 0) return res;
}
// Process word-by-word.
const size_t ONES = ((size_t) -1) / 255; // 0x...01010101
const size_t HIGH_BITS = ONES << 7; // 0x...80808080
const size_t SMASK = ONES * (size_t) 's'; // 0x...73737373
const size_t PMASK = ONES * (size_t) 'p'; // 0x...70707070
size_t s_accum = 0;
size_t p_accum = 0;
int iters = 0;
while (1) {
// Load word and check for zero byte.
// (w - ONES) & ~w has the top bit set in each byte where that byte is zero.
size_t w;
memcpy(&w, input, sizeof(size_t));
if ((w - ONES) & ~w & HIGH_BITS) break;
input += sizeof(size_t);
// We reuse the same trick as before, but XORing with SMASK/PMASK first to get
// exactly the high bits set where a byte is 's' or 'p'.
size_t s_high_bits = ((w ^ SMASK) - ONES) & ~(w ^ SMASK) & HIGH_BITS;
size_t p_high_bits = ((w ^ PMASK) - ONES) & ~(w ^ PMASK) & HIGH_BITS;
// Shift down and accumulate.
s_accum += s_high_bits >> 7;
p_accum += p_high_bits >> 7;
if (++iters >= 255 / sizeof(size_t)) {
// To prevent overflow in our byte-wise accumulators we must flush
// them every so often. We use a trick by noting that 2^8 = 1 (mod 255)
// and thus a + 2^8 b + 2^16 c + ... = a + b + c (mod 255).
res += s_accum % 255;
res -= p_accum % 255;
iters = s_accum = p_accum = 0;
}
}
res += s_accum % 255;
res -= p_accum % 255;
// Process tail.
while (1) {
char c = *input++;
res += c == 's';
res -= c == 'p';
if (c == 0) break;
}
return res;
}
Fun fact: the above is still 1.6x slower (on my machine) than the naive two-pass algorithm that gets autovectorized by clang: int run_switches(const char* input) {
size_t len = strlen(input);
int res = 0;
for (size_t i = 0; i < len; ++i) {
char c = input[i];
res += c == 's';
res -= c == 'p';
}
return res;
}You can speedup the code by unrolling your inner loop a few times (try 4x or 8x) - it does mean that your overflow prevention limit is lowered (to a multiple of the unrolled grouping number) and run a few more times. But the speedup offsets the increased bookkeeping.
A version I played with showed increased speed by saving the in-progress accumulation in an array and then doing the final accumulation after the main loop is done. But that may be due to the CPU arch/compiler I'm using.
But if your code will be cross-platform/run on different OSes/CPU arch's, then a SWAR version may be more consistently performant - no need to guess if the compiler's optimization heuristics decided to go with the general purpose CPU registers or faster SIMD registers.
Downside is that the devs are exposed to the gnarly optimized code.
But aren't you reading off the end of the buffer in your memcpy(&w...)? Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?
I just passed in the string length since the caller had that info, otherwise you'd scan the whole string again looking for the zero terminator.
If we go by the absolute strictest interpretation of the C standard my above implementation is UB.
But in practice, if p is word-aligned and is at least valid for 1 byte, then you will not pagefault for reading a whole word. In fact, this is how GCC/musl implement strlen itself.
> Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?
Then the start address is valid (it must contain the null byte), and aligned to a word boundary, in which case I assume it is ok to also read a whole word there.
But in practice no one has page boundaries that cross word boundaries, and I align to a word boundary before doing the word-by-word loop.
Only if AND with 00001100 yields zero the other 3 tests are needed.
Ofc I have no idea what opcodes the language provides.
Rewriting the C function as follows got a 5.5x speedup:
int run_switches(char *input) {
int r = 0;
char c;
while (1) {
c = *input++;
if (c == 's') r++;
if (c == 'p') r--;
if (c == '\0') break;
}
return r;
}
Results: [16:50:14 user@boxer ~/looptest] $ gcc -O3 bench.c loop1.c -o lone
[16:50:37 user@boxer ~/looptest] $ gcc -O3 bench.c loop2.c -o ltwo
[16:50:47 user@boxer ~/looptest] $ time ./lone 1000 1
449000
./lone 1000 1 3.58s user 0.00s system 99% cpu 3.589 total
[16:50:57 user@boxer ~/looptest] $ time ./ltwo 1000 1
449000
./ltwo 1000 1 0.65s user 0.00s system 99% cpu 0.658 totalhttps://owen.cafe/posts/the-same-speed-as-c/
And as others have pointed out, you can tweak the input, then vectorize the algo, if you want to go that route.
I considered this a pedagogical exercise and I sincerely hope nobody will start dropping down to assembly without a very good reason to.
One thought: If the code is rewritten using bit arithmetic, then potentially the result could be even faster as there need not be a pointer look-up.
A bit arithmetic solution would have a mask created for the characters ‘p’ and ‘s’, and then the result could be AND-ed, and then with more bit arithmetic this all 1s value can be translated to a 1 if and only if all the bits are 1. Following which, there would be a no conditional check and simply be both an add and a subtract operation but where the value to be added will only be 1 if the mask for ‘p’ matches and 1 to be subtracted if the mask for ‘s’ matches respectively. I’m not fully sure if this would necessarily be faster than the pointer look-up solution, but it would be interested to try this version of the code and see how fast it performs.
Update: The bit arithmetic could also be done with an XOR on the mask, and following which the ‘popcnt’ x86 instruction could be used to figure out if all are 0 bits.
This statement is true if the jumps are unpredictable. If the jumps are predictable, then jumps will be faster.
Linus had a whole rant about this back in the day, arguing that cmov is not useful if branches are predictable: https://yarchive.net/comp/linux/cmov.html
My naive, untested intuition is that there's only one meaningful difference: the former has to dump the entire pipeline on a miss, and the latter only has to nop a single instruction on a miss.
But maybe I'm missing something. I'll re-read his rant.
EDIT:
Linus rants a lot, but makes one concrete claim:
You can always replace it by
j<negated condition> forward
mov ..., %reg
forward:
and assuming the branch is AT ALL predictable (and 95+% of all branches
are), *the branch-over will actually be a LOT better for a CPU.*
So, I decided to test that. [18:50:14 user@boxer ~/src/looptest] $ diff -u loop2.s loop4.s
--- loop2.s 2023-07-06 18:40:11.000000000 -0400
+++ loop4.s 2023-07-06 18:46:58.000000000 -0400
@@ -17,11 +17,15 @@
incq %rdi
xorl %edx, %edx
cmpb $115, %cl
- sete %dl
+ jne _run_switches_jmptgt1
+ mov $1, %dl
+_run_switches_jmptgt1:
addl %edx, %eax
xorl %edx, %edx
cmpb $112, %cl
- sete %dl
+ jne _run_switches_jmptgt2
+ mov $1, %dl
+_run_switches_jmptgt2:
subl %edx, %eax
testb %cl, %cl
jne LBB0_1
[18:50:29 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop2.s -o l2
[18:50:57 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop4.s -o l4
[18:51:02 user@boxer ~/src/looptest] $ time ./l2 1000 1
449000
./l2 1000 1 0.69s user 0.00s system 99% cpu 0.697 total
[18:51:09 user@boxer ~/src/looptest] $ time ./l4 1000 1
449000
./l4 1000 1 4.53s user 0.01s system 99% cpu 4.542 total
I feel pretty confident that Linus has made a poor prediction about poor prediction here. Jumps are indeed slower.To be fair to Linus, since Clang and I are using sete here, not cmov, I also tested cmov, and the difference was insignificant:
[19:53:12 user@boxer ~/src/looptest] $ time ./l2 1000 1
449000
./l2 1000 1 0.69s user 0.00s system 99% cpu 0.700 total
[19:53:15 user@boxer ~/src/looptest] $ time ./l5 1000 1
449000
./l5 1000 1 0.68s user 0.00s system 99% cpu 0.683 total
Jumps are slower. $ time ./lone 1000 1
851000
real 0m3.578s
user 0m3.574s
sys 0m0.004s
$ time ./ltwo 1000 1
851000
real 0m3.583s
user 0m3.583s
sys 0m0.000s
$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [17:23:00 user@boxer ~/looptest] $ uname -a
Darwin boxer.local 21.6.0 Darwin Kernel Version 21.6.0: Thu Jun 8 23:57:12 PDT 2023; root:xnu-8020.240.18.701.6~1/RELEASE_X86_64 x86_64
[17:23:47 user@boxer ~/looptest] $ cc -v
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: x86_64-apple-darwin21.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
Clang generates the sete instruction for me with the above code: [17:23:49 user@boxer ~/looptest] $ gcc -c -O3 loop2.c
[17:25:00 user@boxer ~/looptest] $ objdump -d --symbolize-operands --x86-asm-syntax=intel --no-show-raw-insn loop2.o
loop2.o: file format mach-o 64-bit x86-64
Disassembly of section __TEXT,__text:
0000000000000000 <_run_switches>:
0: push rbp
1: mov rbp, rsp
4: xor eax, eax
6: nop word ptr cs:[rax + rax]
<L0>:
10: movzx ecx, byte ptr [rdi]
13: add rdi, 1
17: xor edx, edx
19: cmp cl, 115
1c: sete dl
1f: add eax, edx
21: xor edx, edx
23: cmp cl, 112
26: sete dl
29: sub eax, edx
2b: test cl, cl
2d: jne <L0>
2f: pop rbp
30: retIf the fastest way to implement a particular `switch` in assembly is with the equivalent of a set of `if`s, a reasonably smart compiler "should" be able to output the assembly to do that. And I thought that gcc and clang at least have actually been smart enough to do that for a while now.
But if the number of `if`s is high and the distribution sufficiently dense, where a jump table is better than a bunch of `if`s, then a `switch` should output that.
OTOH, a sufficiently smart compiler could theoretically turn a bunch of `if`s into a `switch`-like jump table - but it's much harder to reason that case through correctly than it is the other way, so I'm not sure any current compilers are sufficiently smart to do that.
E.g. note how both the switch- and if-based functions generate the same code using a lookup table here:
In general branching code is faster than branchless code and there's many many places that will demonstrate this with a quick Google. You know how many cycles a correctly predicted branch takes? 0.
On the other hand branchless code has to wait for each calculation to reach a certain stage in the pipeline since the thing to be output is dependent on the result. The CPU will have a whole lot of halts.
So why is this faster? Because the input is literally random(). The branch predictor will be wrong. This isn't normal though. The compiler is creating code that will be faster in most normal use cases.
It's an artificial benchmark that works against the output the compiler produces.
Pretty shocking for such a simple program.
The whole point of using C is not to think about the underlying architecture. As soon as you start taking "jumps are a lot slower than conditional arithmetic on x86" into account, you're not writing in C, you're writing in assembly with extra steps :-)
If a compiler can convert jumpy code to the predicated instrs, it should be able to trivially convert conditional arith to such too (even easier & more consistently than branches I'd say).
while (c = *input++) {
if (c == 's') r++;
if (c == 'p') r--;
} int run_switches_branchless(const char* s) {
int result = 0;
for (; *s; ++s) {
result += *s == 's';
result -= *s == 'p';
}
return result;
}
...the compiler will do all the branchless sete/cmov stuff as it sees fit. It will be the same speed as the optimized assembly in the post, +/- something insignificant. However it won't unroll and vectorize the loop. If you write it like this: int run_switches_vectorized(const char* s, size_t size) {
int result = 0;
for (; size--; ++s) {
result += *s == 's';
result -= *s == 'p';
}
return result;
}
It will know the size of the loop, and will unroll it and use AVX-512 instructions if they're available. This will be substantially faster than the first loop for large inputs, although I'm too lazy to benchmark just how much faster it is.Now, this requires knowing the size of your string in advance, and maybe you're the sort of C programmer who doesn't keep track of how big your strings are. I'm not your coworker, I don't review your code. Do what you want. But you really really probably shouldn't.
On my computer, the initial C version runs at 389 MB / second. I haven’t tested the assembly versions, but if they deliver the same 6.2x speedup, would result in 2.4 GB/second here.
Here’s C++ version which for long buffers exceeds 24 GB/second on my computer: https://gist.github.com/Const-me/3ade77faad47f0fbb0538965ae7... That’s 61x speedup compared to the original version, without any assembly, based on AVX2 intrinsics.
size_t run(char *str) {
uint8_t *p = (uint8_t*)str;
long end = 0;
size_t res = 0, vl;
while (1) {
vl = __riscv_vsetvlmax_e8m8();
vuint8m8_t v = __riscv_vle8ff_v_u8m8(p, &vl, vl);
end = __riscv_vfirst_m_b1(__riscv_vmseq_vx_u8m8_b1(v, '\0', vl), vl);
if (end >= 0)
break;
res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl);
p += vl;
}
vl = __riscv_vsetvl_e8m8(end);
vuint8m8_t v = __riscv_vle8_v_u8m8(p, vl);
res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl);
return res;
}
Here are the results from the above, the switch and the table c implementation, ran on my mangopi mq pro (C906, in order rv64gc with rvv 0.7.1, and a 128 bit vector length): switch: 0.19 Bytes/Cycle
tbl: 0.17 Bytes/Cycle
rvv: 1.57 Bytes/Cycle (dips down to 1.35 after ~30 KiB)
Edit: you can go up to 2/1.7 Bytes/Cycle, if you make sure the pointer is page aligned (and vl isn't larger than the page size), see commentsI tried building one, my self, but my miserable web skills didn't allow me to lazily load the instructions, which made it too slow for actual use.
Can I share your project on lemmy?
How does the normal load deal with faults?
I'll update the parent comment, it slowed down the speed from 2/1.7 to 1.57/1.36 Bytes/Cycle.
However, on other processors, this might not be the case. https://wordsandbuttons.online/using_logical_operators_for_l...
The good question is what do we need C for in general? Of course, we can hand-tailor our code to run best on one particular piece of hardware. And we don't need C for that, it would be the wrong tool. We need assembly (and a decent macro system for some sugar)
But the original goal of C was to make translating system-level code from one platform to another easier. And we're expected to lose efficiency on this operation. It's like instead of writing a poem in Hindi and translating it in Urdu, we write one in Esperanto and then translate to whatever language we want automatically. You don't get two brilliant poems, you only get two poor translations, but you get them fast. That's what C is for.
However, whether or not using cmov is effective compared to a regular test/jump is highly dependent on how predictable the branch is, with cmov typically performing better when the branch is very unpredictable. Since they got a 6x speedup with cmov, I assume that their test input (which isn't described in the post, and is also not in their GitHub repo) consists of random strings consisting almost entirely of s and p characters. There's nothing wrong with this, but it does make the post seem a little misleading to me, as their clever speedup is mostly about exploiting an unmentioned property of the data that is highly specific to their benchmark.
The benchmark only tests strings made of 's' and 'p', so I think it is fair.
The idea is as follow. We want to increase `res` by one when the next character is `s`. Naively, we might try something like this:
res += (c - 'r'); // is `res += 1` when c == 's'
This doesn't work, as `'p' - 'r' == -2`, and we'd need it to be -1.But `'p' - 'r'`, when viewer as an unsigned integer, underflows, setting the carry flag. Turns out x64 has an instruction (adc) that adds two registers _plus_ the carry flag.
Therefore we can replace two `cmp, cmov` with one `sub, adc`:
run_switches:
xor eax, eax # res = 0
loop:
movsx ecx, byte ptr [rdi]
test ecx, ecx
je ret
inc rdi
sub ecx, 'r'
adc eax, ecx # Magic happens here
jmp loop
ret:
ret
Benchmarks are as follows (`bench-x64-8` is the asm above): Summary
'01-six-times-faster-than-c/bench-x64-8 1000 1' ran
1.08 ± 0.00 times faster than '02-the-same-speed-as-c/bench-c-4-clang 1000 1'
1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
Of course, one could improve things further using SWAR/SIMD... int run_switches(const char *buf) {
size_t len = strlen(buf);
int res = 0;
for (size_t i = 0; i < len; ++i) {
res += (buf[i] == 's') - (buf[i] == 'p');
}
return res;
}
strlen() should be implemented in a pretty fast way, and after the buffer size is known, the compiler can autovectorize the inner loop, which does happen in practice: https://gcc.godbolt.org/z/qYfadPYoqI did all of the above, plus profiling (sb-sprof combined with disassemble will show assembly level profiling).
Second, consider this replacement function:
ssize_t test(const char \*input) {
ssize_t res = 0;
size_t l = strlen(input);
size_t i;
for (i=0; i<l; ++i) {
res += (input[i] == 's') - (input[i] == 'p');
}
return res;
}
The timings are (using gcc -O3 -march=native): your function 640 cycles, mine 128 cycles. How can that be? I'm reading the memory twice! I have one call to strlen in there, and memory is slow. Shouldn't this be much slower?No. strlen is a hack that uses vector instructions even though it may technically read beyond the string length. It makes sure not to cross page boundaries so it will not cause adverse reactions, but valgrind needs a suppression exception to not complain about it.
If you know the length beforehand, the compiler can vectorize and unroll the loop, which it happens to do here. To great effect, if I may say so.
The art of writing fast code is usually to get out of the way of the compiler, which will do a perfectly fine job if you let it.
If you really wanted to, you could get rid of the strlen by hacking your logic into what strlen does. That would make the C code much less readable and not actually help that much. My test string is "abcdefghijklmnopqrstuvxyz", so it's all in the l1 cache.
cmp ecx, 's' # if (c == 's')
jne loop # continue
add eax, 1 # res++
jmp loop # continue
should be cmp ecx, 's' # if (c != 's')
jne loop # continue
add eax, 1 # res++
jmp loop # continueThe short answer to this question is 'yes', but there are some extenuating factors:
- Although we could do interesting things with unlimited computational resources, the current crop of c compilers is simply not very good, compared with what's possible today.
- Performance is always workload-dependent; the compiler has been somewhat shafted here because it doesn't know what sorts of inputs the function usually receives. The compiler output is better than the 'improved' code for some inputs. (It's possible you could get a better result from the existing compilers and c code just by using profile-guided optimisation.)
- The difference is prone to be more pronounced in simple loops than large ones. This is a contrived use-case. There is not a factor of 6 of performance hiding in optimised c code which could be recovered by doing the sorts of optimisations done by the op. Probably something more like 10-20%.
For instance, suppose you're writing a program to find the nth Fibonacci number for whatever reason. In Python, the naive version might look like:
def fib(n):
if n <= 1:
return n
return fib(n - 1) + fib(n - 2)
On my machine, that takes about 12 seconds to find the 40th number. Altering that slightly like: from functools import cache
@cache
def fib(n): ...
makes the whole program take about 30 milliseconds total. The 400th takes about 32ms and emits an answer that won't fit in a 256-bit int.Of course you can do the exact same kind of caching in C! I mean, the main Python interpreter's written in C, so by extension any algorithm you can express in Python you can also express in C. It'd probably be a lot faster, too!
But in practice, if I'm writing that in Python, I can use the obvious algorithm, spent 10 seconds slapping a caching decorator on it, verify that the end result is ridiculous fast and efficient, then move on to other problems.
Any reasonable C compiler will emit assembler that's vastly better than anything I could come up with. Conversely, I personally can write far better algorithms in Python than I could in C, because it's easier for me to express cleverness in that language. Those algorithmic improvements tend to have a far better speed payoff than I'd personally get from a more efficient implementation of a crappy method.
int run_switches(char *input) {
int res = 0;
while (true) {
char c = *input++;
if (c == '\0') return res;
// Here's the trick:
res += (c == 's') - (c == 'p');
}
}
This gives a 3.7x speed compared to loop-1.c. The lower line count is also nice. res += (c == 's') ? 1 : (c == 'p') ? -1 : 0
I haven't done C in decades so I don't trust myself to performance test this but I'm curious how it compares. Pretty disappointed that TFA didn't go back and try that in C.Maybe you get different results though?
Kept me reading into the follow-up article.
Also, I love the UI of this blog.
Also, how is such optimization carried out in a large scale software? Like, do you tweak the generated assembly code manually? (Sorry I'm a very very very beginner to low-level code)
It's not a useful architecture, but it teaches the thought process really well, and you end up discovering a lot of optimization naturally.
For this article, I'm measuring every step to see what the performance implications of the changes are, which, along with some educated guesses and some googling/reading other articles, was enough for me to figure out what was going on.
In part two (https://owen.cafe/posts/the-same-speed-as-c/) especially, I didn't know what was going on with the benchmarks for a long time. Eventually I got lucky and made a change, which led to a hypothesis, which lead to more tests, which led to a conclusion.
[0] godbolt.org
https://ipthomas.com/blog/2023/07/n-times-faster-than-c-wher...
int table[256] = {0};
void init() {
table['s'] = 1;
table['p'] = -1;
}
int run_switches(char *input, int size) {
int res = 0;
while (size-- >= 0) res += table[input[size]];
return res;
}https://owen.cafe/posts/the-same-speed-as-c/
But taking the length of the string as a parameter is not, because that changes the problem statement (making the solution vectorizable)
Also note that you'll try to read element -1 of the input. You probably want to change the `>=` to a `>`
Such bloated ISA like x86 might actually handle predicate support, but who will try such a radical change?
Also the original ARM 32 bit instruction sent had extensive predication.
It achieves 3.88GiB/s
I intentionally didn't go down the route of vectorizing. I wanted to keep the scope of the problem small, and show off the assembly tips and tricks in the post, but maybe there's potential for a future post, where I pad the input string and vectorize the algorithm :)
#include <stddef.h>
int run_switches(const char *s, size_t n) {
int res = 0;
for (; n--; ++s)
res += (*s == 's') - (*s == 'p');
return res;
}
I got ~31GB/s in GCC and ~33GB/s in Clang. This is without any padding, or SIMD intrinsics, or any such nonsense. This is just untying the compiler's hands and giving it permission to do its job properly.Don't want to pass the string length? That's fine, we can figure that out for ourselves. This code:
#include <stddef.h>
#include <string.h>
int run_switches(const char *s) {
int res = 0;
for (size_t n = strlen(s); n--; ++s)
res += (*s == 's') - (*s == 'p');
return res;
}
Is 27GB/s. With a little bit of blocking: #include <stddef.h>
int run_switches(const char *s, size_t n) {
int res = 0;
char tmp = 0;
for (size_t i = n & 63; i--; ++s)
tmp += (*s == 's') - (*s == 'p');
res += tmp;
for (n >>= 6; n--;) {
tmp = 0;
for (size_t i = 64; i--; ++s)
tmp += (*s == 's') - (*s == 'p');
res += tmp;
}
return res;
}
That's ~55GB/s.Anyway, the point is, you're pretty far from the point where you ought to give up on C and dive into assembly.
/* DON’T REFACTOR THIS FOR READABILITY IT WILL SLOW DOWN */
{.overflowChecks:off.}
proc run_switches*(input: cstring): int {.exportc.} =
result = 0
for c in input:
result.inc int('s' == c)
result.dec int('p' == c)
That gives a ~5x speedup on an Apple M1. Keeping overflow checks on only gets it up to ~2x the default C version. Always nice to know good ways to trigger SIMD opts. const __m256i zero = _mm256_setzero_si256();
const __m256i s = _mm256_set1_epi8( 's' );
const __m256i p = _mm256_set1_epi8( 'p' );
const size_t a = (size_t)input;
const size_t rem = a % 32;
const char* aligned = input - rem;
const __m256i v = _mm256_load_si256(( const __m256i*) input);
const __m256i z = _mm256_cmpeq_epi8( v, zero );
size_t m_plus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, s));
size_t m_minus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, p));
size_t m_zero = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, z));
size_t offset_zero = _mm_tzcnt_64(m_zero >> rem);
m_plus = _bzhi_u64(m_plus >> rem, offset_zero);
m_minus = _bzhi_u64(m_minus >> rem, offset_zero);
// Skip loop we already found the end of the string...
while (m_zero == 0) {
// ...
}
// ...
return m_plus + res - m_minus;However, this is only relevant for very small inputs. For longer inputs the vectorized portion of the function gonna dominate the performance.
Still, based on the documentation you have linked, I’m not sure it could possibly generate some code similar to my version. I could be wrong but I don’t see APIs which aggregate or accumulate the `simd_mask` vectors they output for results of vector comparisons.
I don’t recommend assembly. Intrinsics are typically good enough performance wise, and writing correct assembly is hard. For instance, Chromium has non-trivial dependencies with code written in assembly, and they caused tons of fun debugging issues like that https://bugs.chromium.org/p/chromium/issues/detail?id=121838... Using intrinsics would have solved that because modern compilers follow ABI conventions of the target platforms very carefully.
About that highway, I don’t have any experience but based on the documentation I don’t like it too much. They say that’s a thin wrapper over intrinsics, but I believe it still breaks things. Specifically, Intel and ARM don’t document highway, but they have decent documentation on their intrinsics. Similarly, stackoverflow has thousands of questions and answers with tags like SSE and AVX, most of them are related to intrinsics, but nothing related to highway.
https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...
test code is here: https://github.com/414owen/blog-code/blob/master/02-the-same... it randomly selects between 's' or 'p'. The characters can't be anything other than 's', 'p', or the terminating null. Knowing that particular fact about our input gives us this ...clever... optimization:
int run_switches(const char* s) {
int result = 0;
while (*s)
result += (1 | *s++) - 'r';
return result;
}
which compiles to: run_switches:
movzx eax, BYTE PTR [rdi]
xor edx, edx
test al, al
je .L1
.L3:
or eax, 1
inc rdi
movsx eax, al
lea edx, [rdx-114+rax]
movzx eax, BYTE PTR [rdi]
test al, al
jne .L3
.L1:
mov eax, edx
ret
This is too clever by half, of course, but it perfectly illustrates your point about exploiting properties of the data. int run_switches(const char* s) {
int s_count = 0;
const char *begin = s;
while(*s) {
s_count += (1 & *s++);
}
int count = s-begin;
return count - s_count;
}
which compiles to: .L49:
and edx, 1
add rax, 1
add ecx, edx
movzx edx, BYTE PTR [rax]
test dl, dl
jne .L49
edit: other variant int run_switches2(const char* s) {
const char *begin = s;
int sum = 0;
while(*s) {
sum += *s++;
}
int count = s-begin;
int s_count = sum - ('s'*count)/('p'-'s');
int p_count = count - s_count;
return p_count - s_count;
}
which compiles to: run_switches2(char const*):
movsx eax, BYTE PTR [rdi]
test al, al
je .L56
mov rdx, rdi
xor ecx, ecx
.L55:
add rdx, 1
add ecx, eax
movsx eax, BYTE PTR [rdx]
test al, al
jne .L55
sub rdx, rdi
imul esi, edx, 115
movsx rax, esi
sar esi, 31
imul rax, rax, 1431655766
shr rax, 32
sub eax, esi
add ecx, eax
sub edx, ecx
mov eax, edx
sub eax, ecx
ret
.L56:
xor eax, eax
ret
None of these will beat the clever blocked SIMD someone showed elsethread.So, the maximum amount of times you can hit '\0' is once in the string, because then the function returns, but you can hit the other characters many times, which seems to be information a compiler has access to without PGO.
PGO does help, of course, and on my machine gives me 2.80s, which is better than the code at the end of the `Rearranging blocks` section :)
> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo)
It's described under `Benchmarking setup`, and is in the repository here: https://github.com/414owen/blog-code/blob/master/01-six-time...
Side note: There's a part two to this post (linked at the bottom) where I make the C code as fast as I possibly can, and it beats all the assembly in this post.
I never said writing assembly is (necessarily) a good idea, I just find optimizing it, and deciphering compiler output, an interesting challenge, and a good learning opportunity.
Does the compiler know that this isn't true? No, it doesn't. The author of the article is making an assumption about the contents of the data that might seem reasonable but isn't necessarily true.
On my machine I'm getting 0.244s for `loop-5.x64.s` and 0.422s for your implementation above.
I'm not sure why exactly we're seeing this discrepancy, and for what it's worth your implementation looks faster to me. I guess this is why you need to always benchmark on the hardware you're going to be running the code on...
I would have expected yours to be faster given that it needs to execute fewer instructions per loop iteration. Though maybe the CPU can run `adc` on more ports compared to a load from memory?
Summary
'01-six-times-faster-than-c/bench-x64-8 1000 1' ran
1.00 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-7 1000 1'
1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
Summary
'01-six-times-faster-than-c/bench-x64-8 1000 1' ran
1.01 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-5 1000 1'
1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'The initial sum is easily vectorized as well.
If I've not made any mistakes it should work. Only issue is possible overflow on the running sum.
Can't be bothered to benchmark it though:).
edit: missed the decrement when you see 's'. So the final result is p_count - s_count.
With vectorization, I think the way to go is to have two nested loops, an outer advances by 32 * 255 elements at a time, and an inner one that loads 32 bytes, compares each character to 's', and accumulates on 8 bit lanes.
Then in the outer loop you do an horizontal sum of the 8 bit accumulators.
Using the original run_switches function, app took 3.554s (average of 10 runs).
With the SWAR-version with the string length passed in, app took 0.117s (average of 10 runs).
That's an overall 27.6x speedup.
Consider this statement: "However, we know some things about this loop. We know that the only time we break out of it is when we hit the null terminator (’\0’). The code clang generates checks for the null terminator first, but this makes no sense."
This statement contains huge assumptions about the lengths of the input strings and the frequency of the letters 's' and 'p' in the input. And then has the chutzpah to call the compiler's failure to read his mind about this as "making no sense."
Good first effort by the author, but has not sufficiently thought through the problem.
https://news.ycombinator.com/item?id=36622584
Optimal assembly (forgoing SIMD, at least) for this loop on modern x86 is highly dependent on the entropy of the runtime data.
Of course there are still some cases where non-zero strings are extremely rare and as such optimizing for those makes sense.
The normal load should just segfault if any loaded byte is outside of readable memory, same as with a scalar load which is similarly partly outside.
Oh, yeah, that was a big oversight, unfortunately, this didn't undo the performance regression.
> The normal load should just segfault if any loaded byte is outside of readable memory, same as with a scalar load which is similarly partly outside.
I don't quite understand how that plays out.
The reference memcpy implementation uses `vle8.v` and the reference strlen implementation uses `vle8ff.v`.
I think I understand how it works in strlen, but why does memcpy work without the ff? Does it just skip the instruction, or repeat it? Because in either case, shouldn't `vle8.v` work with strlen as well? There must be another option, but I can't think of any.
Also, does this mean I can get the original performance back, if I make sure to page align my pointers and use `vle8.v`?
And yeah, aligning the pointer manually would work (though then it wouldn't be portable code, as the spec does allow for rvv implementations with VLEN of up to 65536 (8KB per register; 64KB with LMUL=8), which'll be larger than the regular 4KB pages).
The cost of bound checking is second order effects like making vectorization harder, slightly higher instruction (and possibly data) cache pressure, or requiring higher decode bandwidth. For the vast majority of programs these bottlenecks do not really matter.
There is an interesting talk titled ‘the death of optimizing compilers’ that argues that for most code these optimizations are almost completely meaningless, and in the hot loops where it actually matters, they are not good compared to humans (and sometimes 100x or more improvements are possible and left on the table). While I don’t completely agree with its points, it is a good talk/slides to read through.
I wholly admit that this implies nothing about all optimizers. But it's a pretty reasonable one to expect.
and if so are the compile times worth it
There's nothing "magical" about paying attention before condescending to someone.
I know I'm the less knowledgeable here, and even then there's nothing to gain in criticizing someone like this online.
Sorry again :)
You'd unroll it to a loop on both C and Python. Fibonacci doesn't need a cache. It needs K previous values, where K=1.
Like, obviously you’re not going to be writing `fib(n)` for real. I still claim that other languages — not just Python, either — make it easier to express cleverer algorithms than C does. You can’t write anything in Rust you can’t write in C, but it’ll probably be easier to say it more efficiently, and more correctly, in Rust. And much of the time, using a better design is going to blow compiler improvements out of the water.
(The professor was right if you limit the scope of the statement to “programs written in C or assembler”, of course. Unless you’re a freaking genius, a compiler’s going to write better object code.)
That's an overall 57.5x speedup.
Need sleep now.
That's quite dismissive. What exactly "is possible today" and why aren't these top compilers using them?
E-graphs ameliorate phase ordering issues and allow for exploring the space of non-monotonic rewrites; recent research makes them computationally viable.
Put simply: it's legacy. Gcc and llvm are millions of lines of code, and they assume a particular architecture. Changing that is not easy.
Another issue, which I did not mention (but which is pertinent) is that c is a poor language for compilation. (Fran allen famously said 'c has destroyed our ability to advance the state of the art'.) In some respects, the optimisations performed automatically by modern high-performance cpus are more sophisticated than those done by c compilers, howbeit with less reach; the only reason they are able to do this is that they have direct control of the execution and hence have a greater ability to abstract over the side effects which are rampant in most c code.
Your example touches on the problems of inflexible ABI, namely caller saved registers and the unknowability of side effects of external functions. Very weird that it can't reorder `r = x+y` despite it having no "observable" side effect until `return r`, since that return dominates the assignment, and there's no real relation between (the return, assignment) and (eff()).
> true which expands to the integer constant 1,
And equality yields a 1 or 0. C11 §6.5.9 (3):
> The == (equal to) and != (not equal to) operators are analogous to the relational operators except for their lower precedence. Each of the operators yields 1 if the specified relation is true and 0 if it is false.
int num_empty_strings = !!(strlen(s1)) + !!(strlen(s2)) + !!(strlen(s3))
which is equivalent to: int num_empty_strings = (strlen(s1) != 0) + (strlen(s2) != 0) + (strlen(s3) != 0)
Which you use is really a matter of coding style. int num_empty_strings = !!*s1 + !!*s2 + !!*s3;Here’s a thing you could do (but I don’t know why):
+= !(c-’s’) - !(c-’p’)
If a branch is almost always taken or almost never taken, a compiler will want to emit a jump. The frontend will be able to predict the jump with high probability, and a successfully-predicted jump is "free." The cost of a misprediction is paid for by the near-zero cost of the many successful predictions.
If a branch is hard to predict (and taking versus not taking it would load a different value into a register/memory), the compiler wants to emit a conditional move (cmov). A conditional move is slightly "more expensive" in the backend because the CPU has to wait for the condition to resolve before it can execute instructions dependent on the output. However, it is much cheaper than many mispredicted branches (mispredicts around half of the time).
FDO (feedback-directed optimization) or PGO (profile-guided optimization) means "run the code on some sample input and profile how often branches are taken/not taken." It gives the compiler more information to generate better code.
The problem with the blog post is that the compiler has no idea what the function's input data will look like. It (arbitrarily) chose to generate branches instead of cmovs. However, if the benchmark input is better suited for cmovs, then the benchmark will (wrongly) show that the compiler generates "slow" assembly. But that's not a fair test, because with PGO/FDO the compiler would generate equivalent assembly to the "fast" assembly (actually, probably faster). Finally, the human (OP) is using their knowledge of the benchmark data "unfairly" to write better assembly than the compiler.
The takeaway is: most of the time, one can't optimize code/assembly in a vacuum. You also need to know what the input data and access patterns look like. FDO/PGO gives the compiler more data to understand what the input data/access patterns look like.
I might be missing a reason that this information of opaque to the compiler though, in which case, this section of the article is indeed lacking, but I'm happy to learn :)
String length tells you the frequency with which nul terminators will be found. Without knowing frequency of occurrence of the nul terminator, 's', and 'p' then you cannot know which one occurs most often.
Consider two benchmark cases: (1) every string tested contains exactly one character (2) every string tested is 1MB long and is composed entirely of 's' and 'p'.
The author's first "optimization" assumes nul is rare. It would make benchmark (1) worse, and (2) better.
The article is a good example of "specification is hard, code is easy." He insufficiently specified the problem to be solved, and his test cases contained information not in the code and not in the text of the article.
I would suggest the latter is what you want most of the time.
There's also the option of running a quick check for the null terminator before the loop, and then optimizing the loop for the other options.
But in any case, I think the demonstration of the technique of rearranging branches is interesting, and I needed a program to apply it to.
We know that E[# of '\0' in a string] == 1.
But what is E[# of 's' in a string]? Is it greater or less than E[# of '\0' in a string], and how should the compiler know this?
You haven't given the compiler any reason to assume that 's' or 'p' will appear more often than '\0'.
Think about this: a machine with infinite execution units and memory bandwidth, potentially could execute all iterations of a loop at the same time, in parallel.
Unless each loop iteration depends somehow on the result of the previous iteration. Then only independent instructions of that iteration can execute in parallel and the loop is latency-chain bound (especially when it involves memory accesses). This is often the case. Because branch prediction breaks dependencies, bound checking is never part of a dependency chain, so it is often free or nearly so. For more optimized code, the assumption of infinite resources is of course not warranted and execution bandwidth and possibly even memory bandwidth need to be taken into consideration.
(People within Google say "FDO", basically everyone else says "PGO".)
Keep at it! Just as every program is a chance to improve programming, every article written is a chance to improve writing. It was well written.
My example has nothing whatsoever to do with abi, and everything to do with ir. f and g are exactly semantically equivalent, and this equivalence is trivial to show; that the compilers generate different code for each demonstrates redundancies in their ir.
> it is a side effect to assign to a variable
But that variable is not aliased here.
int run_switches(const char *s) {
int res = 0;
uint8_t tmp = 0;
size_t n = strlen(s);
for (size_t i = n & 127; i--; ++s)
tmp += (*s == 's');
res += tmp;
for (size_t j = n >> 7; j--;) {
tmp = 0;
for (size_t i = 128; i--; ++s)
tmp += (*s == 's');
res += tmp;
}
return 2 * res - n;
}Edit: Also, there's an off by one error. should be:
#include <stddef.h>
#include <stdint.h>
int run_switches(const char *s, const size_t n) {
int res = 0;
uint8_t tmp = 0;
for (int i = n & 127; i--; ++s)
tmp += *s == 's';
res += tmp;
for (int size = n >> 7; size--;) {
tmp = 0;
for (int i = 128; i--; ++s)
tmp += *s == 's';
res += tmp;
}
return 2 * res - n + 1;
}
~90GB/s on my machine, compared to 4.5GB/s for his best effort on his blog. So 20x as fast.Which tricks in there are worth playing around with more widely?
Is the uint8_t just "no point in using something bigger" or does it likely help the compiler? Does/can the signedness matter as well as the size?
Ditto looping downwards -- is it often likely to improve things? Can it generalize to pointer/iterator ranges, or is it often worth trying to phrase them in terms of array/index accesses instead?
I guess the compiler's unrolling heuristics generally aren't as good as that blocking "mod then div" alternative to Duff's device? Obviously taking `s` out of the loop condition is part of the magic.
Not checking the 'p' character by comparison is an easy optimization to understand.
Any places to read about this sort of thing, or any tricks or guidelines that come to mind? I write a fair bit of performance-sensitive code but it's all probably 20x slower than it could be because I have no intuition for what transformations compilers will do beyond "this prob gets inlined" etc.
Also, someone else figured out that we can just use an and instruction instead of cmp. That gives us this version:
#include <stddef.h>
#include <stdint.h>
int run_switches(const char *s, const size_t n) {
int res = 0;
uint8_t tmp = 0;
for (int i = n & 127; i--; ++s)
tmp += 1 & *s;
res += tmp;
for (int i = n >> 7; i--;) {
tmp = 0;
for (int j = 128; j--; ++s)
tmp += 1 & *s;
res += tmp;
}
return 2 * res - n;
}
This is 111GB/s, up from 4.5GB/s in the blog. I'm going to try really hard to put this problem down now and work on something more productive.I've seen plenty of cases where replacing hand-written assembly with C (or similar) lead to a substantial performance increase because the assembly code was written for some old CPU and not the best way of doing things on current CPUs.
cmp dl, 's' ; Compare the character with 's'
sete dl ; If the character is 's', set dl to 1. Otherwise, set dl to 0.
sub al, dl ; Subtract the result from res
cmp dl, 'p' ; Compare the character with 'p'
sete dl ; If the character is 'p', set dl to 1. Otherwise, set dl to 0.
add al, dl ; Add the result to resThank you. I hope people who post random assembly listings on HN written in some extinct ISA will read your posts.
The magic in this case is the compiler autovectorizer. Making the length of the loop a loop invariant allows the autovectorizer to kick in.
The reason "blocking" by accumulating on uint8_t helps further is that it allows the compiler to accumulate on 8 bit SIMD lanes, instead 32 bit SIMD lanes. The same operation on 8 bit SIMD lanes will, to a first approximation, do x4 the work per cycle.
In a good world you could use just uint_fast8_t and compiler would optimize this question for you. In real world I don't think compilers are smart enough, or there are too many other constraints limiting them :(
The reason it helps performance is because it allows the compiler to accumulate in byte sized SIMD variables instead of int sized SIMD variables. My system has AVX-512 so 64 byte wide SIMD registers. With the non-blocking version, the compiler will load 16 chars into ints in a 64 byte ZMM register, then check if it's an 's', and then increment if so. With the blocked version, with the uint8_t tmp variable, the compiler will load 64 chars into uint8_ts in a 64 byte ZMM register instead. But there's a problem; we're gonna overflow the variables. So the compiler will stop every 128 iterations, and then move the 64 byte uint8_t accumulation variable into 4 64 byte int accumlations registers and sum them all up. Then do the next 128 iterations.
I'm pretty sure a similar thing will happen with SSE or AVX2 but I didn't check.
In general for x86, unaligned writes are worth doing work to avoid, but reads are in most situations not really an issue.
The difference is that branches have dedicated hardware (branch predictors) that will speculatively execute subsequent instructions based on their best guess about which way the branch will go. Whereas conditional moves cannot execute any subsequent instructions until the correct value is available.
Put another way, CPUs have control flow speculation, but not conditional move speculation. I don't know if conditional move speculation would be a feasible thing to implement or not, but I'm pretty sure that no mainstream CPUs have such a feature.
That is incorrect. Super-scalar processors have no problem executing subsequent instructions before the cmov writebacks. However, the register cmov writes to can of course not be read before cmov has has passed the execution unit. But that's not different from other arithmetic instructions.
That is, if there is an add instruction on rax and rbx, no matter what, the add instruction will not execute until both rbx and rbx are available. If the result went into rax, and there is an another instruction that uses that as a source, no matter what that instruction will not execute until the add has completed.
CMOV is implemented as an ALU instruction that always writes into it's output, and either writes the value that is already in there (which is why it depends on the value of it's output) or the value provided, depending on flags.
A conditional jump can put one of two values into the instruction pointer, they will either increment the instruction pointer (jump not taken) or put the immediate value into the instruction pointer. (jump taken)
cmov/sete are utterly deterministic; they always increment the instruction pointer. There's nothing to speculate on, there's nothing to predict. They just go to the next instruction.
A similar idea to what you're proposing (and a possible solution to the above issue) does come up in another part of the processor however! Specifically, high performance processors launch loads very aggressively and often times return data as soon as the address is known. This is because memory is often the bottleneck for performance. This, unfortunately, has some challenges. Namely, memory ordering violations. Take for example the following snippet (ARMv8):
mov x1, #1
udiv x3, x2, x1
str x2, [x3]
ldr x4, [x2]
add x5, x4, x4
This is a silly and somewhat contrived code sequence, but note here that both str x2 and ldr x4 access the same address and thus the value in x4 should be x2. Note, however, that since str x2's address (x3) is produced by a slow division operation but ldr x4's address (x2) is available much more quickly, ldr x4 likely will launch before the CPU even knows that str x2 conflicts with it. Thus, the data returned by the load will be whatever random old stale data is in the cache rather than the correct value that is currently sitting in x2. This means that the subsequent add which consumes this data will produce an incorrect value, leading the whole program to derail. Once the CPU detects this issue, it has to throw away all the state and restart execution of the program at ldr x4 in order to fix its mistake and fix up the memory ordering violation. In essence, the CPU is speculating that str x2 and ldr x4 are unrelated because doing so is very important for performance. Unfortunately, however, memory ordering violations are actually somewhat common and constantly having to restart execution has negative performance implication.Now, this is actually a very similar problem as we'd see with conditional instruction speculation! So how do we solve this issue for memory ordering violations? Well, we predict which pairs of stores and loads are dependent and block the load from launching until the address of its supposed dependent store resolves. If this predictor is functioning well, we are able to both aggressively launch loads while also avoiding many costly fixups!
So, how would we translate this to conditional instruction speculation? Well, one idea is that we could predict both whether a given instruction is predictable and, if so, which way we should predict it. If a conditional instruction is predicted as unpredictable, its result will not be speculated (thereby avoiding frequent costly restarts) but if it is predicted to be predictable, we can try to predict which one to take.
Would this work? Maybe. Will anyone actually do this? Likely not. As others have suggested, conditional instructions are almost exclusively used for hard to predict conditions specifically because CPUs don't speculate them. Thus, in most existing code the predictor would just say "yep can't predict it" and we'd just have ended up wasting a bunch of area and power on a predictor that never gets used.
If you're really dedicated to this cause though, feel free to write a paper on it. Spitballing performance numbers is easy but often wrong in quite surprising ways, so maybe this might just work for some weird reason I've missed :)
[19:13:34 user@boxer ~/src/looptest] $ diff -u bench.c bench-alls.c
--- bench.c 2023-07-06 16:04:16.000000000 -0400
+++ bench-alls.c 2023-07-06 19:13:34.000000000 -0400
@@ -17,7 +17,7 @@
int num_rand_calls = number / CHAR_BIT + 1;
unsigned char *buffer = malloc(num_rand_calls * CHAR_BIT);
for (int i = 0; i < num_rand_calls; i++) {
- buffer[i] = rand();
+ buffer[i] = 's'; //rand();
}
return buffer;
}
[19:13:37 user@boxer ~/src/looptest] $ gcc -O3 bench-alls.c loop2.s -o l2
[19:13:42 user@boxer ~/src/looptest] $ gcc -O3 bench-alls.c loop4.s -o l4
[19:13:47 user@boxer ~/src/looptest] $ time ./l2 1000 1
250001000
./l2 1000 1 0.69s user 0.00s system 99% cpu 0.699 total
[19:13:55 user@boxer ~/src/looptest] $ time ./l4 1000 1
250001000
./l4 1000 1 1.28s user 0.00s system 99% cpu 1.290 total
Jumps are slower.Similarly you might be busting the pipeline by chaining together the jumps so close together.
Not saying your point is wrong, just saying your proof isn't super solid.
It's murkier than that. Speculation also deals with the order in which instructions can be executed. Take for example memory ordering (discussed in a mini essay elsewhere here): we typically speculate that all loads are unrelated to any other older in-flight stores with unresolved addresses so that we can optimistically launch them. This is not a control flow issue but it is something we both speculate and predict (memory dependence predictors!) despite the next PC being essentially deterministic.
.. and all about what we can wheedle out of all the background speculation that will help us get root on this box.
n>>7 is equal to n/(2^7), and is a faster way to divide with a power-of-two.
But the compiler/CPU can process bytes one at a time or much faster in groups. The code is trying to process as much as possible in groups of 128.
But since the caller can pass in a string which is not a mulitple of 128 chars, the first for-loop (& 127) will figure out how much of the string to process such that the remaining string length is a multiple of 128.
The second for-loop (>> 7) calculates divides by 128 (>> 7) to find out how many multiples of 128 there are to process. The inner for-loop processes 128 chars looking for 's' chars.
Now the for-loop within a for-loop doesn't look any faster than the plain single for-loop, but I'd assume that the heuristics of certain compilers can intuit that it can generate code to operate on multiple chars at the same time (SIMD instructions), since the result of one operation are independent of others.
On a compiler that cannot generate SIMD code, the code won't be much faster, if at all, than the naive straightforward manner.
As best I can tell this case is rare enough that one shouldn't generally be afraid of cmov, and probably compiler authors should consider using it more frequently.
What one shouldn't do is to load values, that are likely in memory or L3, unnecessarily in order to be able to use cmov. It is the case that runs the greatest risk of degrading performance, and it puts extra load on resources that are shared between cores.
cmp x, y
je z
and cmp x, y
sete z
the actual speculative part is the same: speculating as to the result of cmp x, yIf that's true, why would it not simply pipeline sete and the following instructions and simply execute (or not execute) sete according to its prediction, and then double check itself and reverse (or apply) the operation if the prediction was wrong?
I probably just have a bad mental model of what's going on under the (under the) hood, so whatever patience you have to deal with my stupid questions would be greatly appreciated.
> If that's true, why would it not simply pipeline sete and the following instructions and simply execute (or not execute) sete according to its prediction, and then double check itself and reverse (or apply) the operation if the prediction was wrong?
You cannot just reverse or apply one operation. The way speculation works, when the frontend encounters a conditional jump, the entire architectural state of the current thread is stored, and all future memory writes are held in the store buffer and not written out. Then a long time, potentially dozens of cycles later, after the je is executed in the backend either the old state is restored and the pending writes are discarded, or the saved state is discarded and the pending writes are released.
In contrast, in ALUs, the inputs for instructions are always available before the instructions are scheduled to execute. It would be possible to implement sete like je, but this would imply significant changes to how and where it is executed. ALU ops cannot trigger speculation because there is no machinery for storing state at that part of the pipeline.
And no-one is ever going to implement cmov or sete like a jump, because moving the op from being an ALU op to being one that is speculatively executed in the frontend like jmp would make both positive and negative changes, and that would be a significant pessimization of existing software because for decades cmovs have been used for unpredictable values, where sequencing and waiting for the real value is a better idea than speculating and failing half the time. Using a cmov serializes execution when any following operations use the value, but if you can have independent work after it, you can always successfully execute that. Speculating at an unpredictable CMOV would cause that to be thrown away uselessly half the time.
cmpb $115, %cl
sete %dl
addl %edx, %eax
vs cmpb $115, %cl
jne _run_switches_jmptgt1
mov $1, %dl
_run_switches_jmptgt1:
addl %edx, %eax
The argument about why `jne` might be faster is that that in the former case, the CPU always executes a dependency chain of length 3: `cmpb` -> `sete` -> `addl`. Each of these instructions have to be computed one after the other, as `sete` depends on the result of `cmpb`, and `addl` depends on the result of `sete`.With `jne`, the CPU might predict the branch is not taken, in which case, the dependency chain is `mov` -> `addl` (the `mov` of an immediate might be handled by register renaming?).
Or that it is taken, in which case in which case the dependency chain is just `addl`.
I guess you're arguing that the CPU should handle `sete` the same way? That is, instead of treating `addl` as dependent on the result, predict what `sete` does and start executing `addl` before `sete` finishes, rewinding if that went wrong?
Microcode can set the EIP register based on its prediction of what the result of cmpb $115, %cl will be.
Why can't it set the EDX register based on its prediction of what the result of cmpb $115, %cl will be?
If each instruction was executed in one single clock cycle, the cost of executing a branch would be one cycle and that's it.
However since there is a maximum speed at which operations can happen in hardware, the period of such a clock cycle that can execute a whole instruction would be very long and so the amount of "instructions per second" the CPU could execute would be low.
Now, if you can break up each instruction in smaller steps and execute the smaller steps in an overlapping manner, such that while you're executing the second step of the first instruction you're executing the first step of the next instruction and so on (like on an assembly line in a factory) you can have a much shorter clock period for each of these steps, and at the end of each clock tick an instruction would complete execution. The CPU will be still running one instruction per clock cycle, but since each clock period is shorter the overall instruction per second rate will be higher.
But for this to work the next instruction you want to execute must be known in advance so that at each clock cycle the CPU can start step 1 of a new instruction.
That's easy when the program is executing sequentially but when there are branches involved it's more tricky.
And that's tricky also if the branch is not conditional! If the instruction execution is broken into many small steps, it may take one or more steps before figuring out that you have a branch in the first place, let alone decoding where you need to branch to. In the meantime the CPU will have happily started to execute the first "steps" of the next instruction.
This is called a "branch hazard"
Early CPU implementations handled branch hazards by just throwing away the intermediate states if the few instructions that we're half way through the pipeline and call it a day (stalling the pipeline).
Early RISC CPUs attempted to be clever and use a trick called "delay slots": the instruction(s) already in the pipeline will continue to execute as if they were logically before the branch. This puta the onus to the programmer (or the compiler) to make sure that only instructions that are safe to be executed regardless of whether the branch is taken or not, are actually put after the branch instruction (otherwise you can just write nops).
But branch delay slots are not a panacea. As pipelines got deeper it became I practical to have a large number of delay slots and even a small number of delay slots were often just filled with nops anyway.
Improving on UNconditional branches was done by "looking ahead" in the instruction stream for branch instructions. When the instructions are all of the same size it's easy to quickly look a few instructions ahead and tell when you found a branch. You also need an instruction encoding scheme that is relatively fast to decode, at the very least it should be fast to decode branches (the more complicated the logic to decode a branch is, the farther ahead you'd have to look in the instruction stream, which in turn would limit the size of the sequence of instructions you can fill your pipeline with between subsequent branches).
To further complicate the matter, even if you found the branch instruction and you decoded it, it doesn't mean you yet know where it will branch to!
Indirect jumps (where the address is in a register) are similar to conditional jumps in that you don't know the address you're jumping to by merely looking ahead in the instruction stream and noticing the branch instruction. You need to either wait until you execute the branch and stall the pipeline in the meantime, or keep them in the pipeline and flush the pipeline once you know the target of the branch.
The next trick that CPU designers came up way before speculative execution is "branch target prediction".
The CPU keeps a little associative memory that maps addresses of a branch instruction to branch targets. When the lookahead logic spots a branch instruction it looks in this map and gets a guess of the branch target and uses that immediately ad the next instruction so that the pipeline is kept fed with something.
If by the time the branch instruction is executed the guess turned out to be wrong, the pipeline is flushed in the same way it would have to be flushed anyway if we had no clever branch lookahead in the first place. But if the guess was right we paid only one cycle to execute the branch.
This works for indirect unconditional branches and also for conditional branches! The prediction logic can be more subtle and complicated, many many things gave been attempted but this the general idea.
> quite literally
You could have conveyed the close to the same thing by saying, "things like this are covered in Patterson and Hennessy"
> elementary text
Jesus, do you even lift? The rest of the discussion is amazing.
But Intel historically didn't do it as programs tend to use cmov when the condition is unpredictable , so there was little reason to optimize it.
After Spectre, I believe intel has given an architectural guarantee that cmov is never speculated so it can be used as part of speculation attack prevention.