Using the most unhinged AVX-512 instruction to make fastest phrase search algo(gab-menezes.github.io) |
Using the most unhinged AVX-512 instruction to make fastest phrase search algo(gab-menezes.github.io) |
AVX-512 was never part of the specification for those CPUs. It was never advertised as a feature or selling point. You had to disable the E cores to enable AVX-512, assuming your motherboard even supported it.
Alder Lake AVX-512 has reached mythical status, but I think the number of people angry about it is far higher than the number of people who ever could have taken advantage of it and benefitted from it. For general purpose workloads, having the E cores enabled (and therefore AVX-512 disabled) was faster. You had to have an extremely specific workload that didn't scale well with additional cores and also had hot loops that benefitted from AVX-512, which was not very common.
So you're right: They never wanted people to use it. It wasn't advertised and wasn't usable without sacrificing all of the E cores and doing a lot of manual configuration work. I suspect they didn't want people using it because they never validated it. AVX-512 mode increased the voltages, which would impact things like failure rate and warranty returns. They probably meant to turn it off but forgot in the first versions.
[1]:https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
Meaning... this entire effort was for something that's faster on only a single kind of CPU (Zen 5).
This article is honestly one of the best I've read in a long time. It's esoteric and the result is 99.5% pointless objectively, but in reality it's incredibly useful and a wonderful guide to low-level x86 optimization end to end. The sections on cache alignment and uiCA + analysis notes are a perfect illustration of "how it's done."
The blog post mentions it as one of the reasons for the inefficiency of the conventional algorithm.
A glance at the algorithm shows that the recursion in question is a tail call. This means that any overhead can be readily eliminated using a technique known for nearly fifty years already.
Steele, Guy Lewis (1977). "Debunking the "expensive procedure call" myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO". Proceedings of the 1977 annual conference on - ACM '77.
Regardless of how valid the excuse is, for such an obvious and old optimization, it’s very poorly supported.
But since R_1 C_2 C_3 is in the index as well, instead of searching for C_0 R_1, C_2 C_3 with a distance of 2, you can instead search for C_0 R_1, R_1 C_2 C_3 with a distance of 1 (overlapping), which hopefully means that the lists to intersect are smaller.
vs
"Galois Field 2^8 affine transform on quad binary words" (GF2P8AFFINEQB)
The compression factor isn't quite the same on character count, but it's still abbreviated. :)
https://www.youtube.com/watch?v=r_pPF5npnio&t=3300 (55:00)
"This is an instruction that doesn't exist in any computer right now, so why should I put it in a machine, if it's supposed to be realistic? Well, it's because it's ahead of time."
The only case where I've had use of GF(2^8) inverses is in FEC algorithms (Forney's algorithm) and then you need some kind of weird polynomial. But all of those needs are rarely in the hot-path, and the FEC algo's are way outdated
mary:
docs:
- 0:
posns: [0, 8]
- 1:
posns: [2]
- 3:
posns: [1]> The inverted index will look something like this:
He isn't wrong. It is indeed "something like".
> Imagine the scenario where "mary had" occurs in the following positions: [1, 6] and "a" appears in the position [2], so "mary had a" occurs in the positions [2]
Okay so basically this means "mary had a" ends at token position 2 (assuming "mary" is token position 0), and you're trying to create an efficient algorithm to do this backwards linking process.
It's not entirely clear from the article what's pre-computed ahead of time (when documents are added / indexed by the system) and what's done on-the-fly (in response to a user's specific search query).
Based on skimming the article it appears that you're doing this backwards linking process on the fly for a specific phrase the user enters (but I could be wrong about that).
> allows us to decompose this value
What value is being decomposed? It is not clear what "this value" refers to.
> one representing the group and the other the value
Ctrl+F is telling me that's the first occurrence of the word "group" in this post. What is the "group" being represented?
> pos = group * 16 + value
Okay so you're bit-packing two fields into pos. One of the things being packed is called "group," and it seems to be 16 bits. The other thing being packed is "value," and it seems to be 4 bits. So in total "pos" has 20 bits.
> the maximum document length is 1048576 tokens
It seems that "pos" simultaneously corresponds to a token position and our two-field bit-packed thing above. I can't figure out how these two things are in one-to-one correspondence.
I stopped reading there, given my confusion so far it seems unlikely I'll really be able to really understand much of what follows.
PS: I skimmed the article on roaring bitmaps. Seems they're basically bitmaps where each 1k (8192-bit) chunk has a storage format (dense, sparse, or RLE), and algorithms for quickly doing intersection, union, etc. with various optimized cases chosen roughly by the number of 1's. (Intersecting two dense bitmaps with say ~50% randomly distributed 1's you can't get faster than ANDing together the bit vectors. But if you're, say, intersecting a small sparse bitmap with ~5 1's against a big dense bitmap with ~4k 1's you can iterate over the 1's in the sparse bitmap checking whether each 1 is in the big bitmap.)
So in my mind I'm basically just blackboxing roaring bitmaps as "java.util.BitSet or vector<bool> with some extra optimizations that kick in if your data has sections where most of the bits are the same".
However, for the naive case where the recursion is a full-blown function call, without some kind of optimization, other security mitigations than ASLR will significantly affect the efficiency of recursion by adding function call overhead (and possible cache side effects) - for example, the stack cookie will still be verified and control-flow guard checks and the shadow/return stack will still be in play, if present.
It might be an esoteric feature today. But if it'll become an ubiquitous feature in a few years, its nice to learn about using it.
https://arxiv.org/abs/2112.06342
https://www.reddit.com/r/asm/comments/110pld0/fasterthannati...
I really think it's only AES since thats the only place I've seen that polynomial used. But of course maybe there's an obscure tape backup FEC algo used somewhere in datacenters?
The Intel-AMD x86-64 architecture is full of horrible things, starting with the System Management Mode added in 1990, which have been added by Intel only because every time Microsoft has refused to update Windows, expecting that the hardware vendors must do the work instead of Microsoft for enabling Windows to continue to work on newer hardware, even when that causes various disadvantages for the customers.
Moreover, even if Intel had not said that Alder Lake will support AVX-512, they also had not said that the P-cores of Alder Lake will not support AVX-512.
Therefore everybody had expected that Intel will continue to provide backward compatibility, as always before that, so the P-cores of Alder Lake will continue to support any instruction subset that had been supported by Rocket Lake and Tiger Lake and Ice Lake and Cannon Lake.
The failure to be compatible with their previous products has been a surprise for everybody.
Thus, SMM, because there was no other way to hook power management on a 386 laptop running " normal" DOS
In theory, there was: you could have a separate microcontroller, accessed through some of the I/O ports, doing the power management; it's mostly how it's done nowadays, with the EC (Embedded Controller) on laptops (and nowadays there's also the PSP or ME, which is a separate processor core doing startup and power management for the main CPU cores). But back then, it would also be more expensive (a whole other chip) than simply adding an extra mode to the single CPU core (multiple cores back then usually required multiple CPU chips).
Wait. I thought the article says only Tiger Lake supports the vp2intersect instruction. Is that not true then?
So it was expected that any future Intel CPUs will remain compatible. Removing an important instruction subset has never happened before in Intel's history.
Only AMD has removed some instructions when passing from a 32-bit ISA to a 64-bit ISA, most of which were obsolete (except that removing interrupt on overflow was bad and it does not simplify greatly a CPU core, since there are many other sources of precise exceptions that must still be supported; the only important effect of removing INTO is that many instructions can be retired earlier than otherwise, which reduces the risk of filling up the retirement queue).
As for the downsides of disabling the E-cores, there were Alder Lake SKUs that were P-core only and had no E-cores.
Not all workloads are widely parallelizable and AVX-512 has features that are also useful for highly serialized workloads such as decompression, even at narrower than 512-bit width. Part of the reason that AVX-512 has limited usage is that Intel has set back widespread adoption of AVX-512 by half a decade by dropping it again from their consumer SKUs, with AVX10/256 only to return starting in ~2026.
Also, it's been cheaper to implement various features through small piece of code instead of adding a separate MCU to handle them, including prosaic things like handling NVRAM storage for variables (instead of interacting with external MCU or having separate NVRAM, you end up with SMM code being "trusted" to update the homogenous flash chip that contains both NVRAM and boot code)