Introduction to AVX2 optimizations in x264(scribd.com) |
Introduction to AVX2 optimizations in x264(scribd.com) |
The first difference mentioned is that whereas the first SSE2 implementations were often implemented using 64-bit ALUs internally, yielding roughly the same performance as doing two equivalent MMX ops manually, this isn't the case with AVX2. However, it may be worth noting, that it largely _is_ the case with the current AVX ("AVX1", i.e. pre-Haswell) implementations.
The second cited difference is that there's a 128-bit "boundary" in many of the operations. This is effectively what can throw down the drain the hopes of getting 2x gains over SSE2 just by naïvely migrating into AVX2. For instance, you cannot do shuffles to/from arbitrary components anymore, but have to consider the 128-bit lane boundaries instead.
The third issue, i.e. data layouts of internal formats and the assumptions of various algorithms are probably the most significant factors that determine how large a benefit you are going to get. Typically the internal data layouts (i.e. is my pixel block size 2x2, 4x4, 16x8 or something else?) are married with the ISA. Thus, when migrating from one instruction set to another, these typically may need to be reconsidered if speed is paramount. Interestingly enough, this means that when the ISA changes, you most likely want to do some higher-level algorithmic optimizations as well.
If you want to test without a physical Haswell, the Intel Software Development Emulator should work okay, albeit somewhat slowly. I'd post overall numbers for real Haswells, but Intel has apparently said we can't do that yet.
Regarding FMA, FMA3/4 are floating point only. Since x264 has just one floating point assembly function, only two FMA3/FMA4 instructions get used in all of x264 (not counting duplicates from different-architecture versions of the function). An FMA4 version has been included for a while; the new AVX2 version does include FMA3, but of course that won't run on AMD CPUs (yet).
XOP had some integer FMA instructions, but I generally didn't find them that useful (there's a few places I found they could be slotted in, though).
Note: I'm not trying to question your engineering chops, just trying to correct my own misconceptions.
sde -- ./myprogram myargs
instead of
./myprogram myargs
There's also probably a decent number of people at this point who have prerelease CPUs; they tend to breed quite explosively in the month or two before the official release.
I'm not an SIMD expert, but it seems like this implements similar primitives to those that are available to assembly (and not C). My question is basically whether the algorithms you're talking about could be implemented with these primitives. Although I guess no such library yet exists for AVX2.
In return, you are stuck with an extremely ugly syntax and a much less functional preprocessor, with the added bonus of a compiler that mangles your code.
Those are not C code, rather inline assembly or compiler intrisics, nothing of which has anything to do with C.
Do any production compilers schedule instructions to maximize superscalar performance?
The pain of not having a proper macro assembler in C intrinsics is orders of magnitude worse than having to do my own register allocation in yasm, so for now, yasm is the lesser of two evils.
In my (admittedly limited) experience [1], the compiler has actually done pretty decently at optimizing register allocation in intrinsic-heavy loops. I wrote out the assembly loop in [2] with manual allocation into all 16 XMMs and then noticed the compiler managed to optimize 1 of them out.
[1] https://github.com/simtk/IRMSD
[2] https://github.com/SimTk/IRMSD/blob/master/python/IRMSD/theo...
The problem has many parts:
1. Autovectorization in general is just extremely difficult and even trivial code segments often get compiled very badly. It feels like the compiler is trying to fit the code into a few autovectorization special cases -- for example, a 16x16->16 multiply gets compiled into a 16x16->32 multiply, and then it laboriously extracts the 16 bits, probably because nobody wrote code to explicitly handle the former variant. A good autovectorizer would have to have a vast array of these sort of things to "know what to do" in a particular case, I'd imagine.
A lot of autovectorization resources seem to be tuned towards floating point math (which typically doesn't need the same sort of tricks), which probably exacerbates the problem in x264's case.
2. The compiler doesn't know enough. It can't guarantee alignment, it doesn't know about the possible ranges of the input values or the relationship among them, it doesn't know the things the programmer knows.
3. SIMD algorithms are often wildly different from the original C. Much of the process of writing assembly is figuring out how to restructure, reorder, and transform algorithms to be more suitable for SIMD. The compiler can't really realistically do this; its job is to translate your C operations into machine operations, not rewrite your algorithm.
Part of this problem is that C is just not a great vector math language, but part of it is also that the optimal algorithm structure will depend on the capabilities of your SIMD instructions and their performance. For example, when the pmaddubsw instruction is available, it's faster to do x264's SATD using a horizontal transform first, but if not, it's faster to do it with a vertical transform first. The Atom CPU has pmaddubsw, but only has a 64-bit multiplier, making it too slow to utilize the horizontal version (so it gets the vertical version instead).
You can definitely finagle code into getting hit by the autovectorizer, especially with Intel's compiler, but it takes a lot of futzing to make it happy, and even when it is happy, it can be many times slower than proper assembly. Of course, it's not useless -- it can get you some relatively free performance improvements without writing actual SIMD code. But it's not a replacement.