Introduction to AVX2 optimizations in x264

Introduction to AVX2 optimizations in x264(scribd.com)

75 points by DarkShikari 13 years ago | 23 comments

jasin 13 years ago |

A couple of comments/elaborations on the "core differences" mentioned in the article:

The first difference mentioned is that whereas the first SSE2 implementations were often implemented using 64-bit ALUs internally, yielding roughly the same performance as doing two equivalent MMX ops manually, this isn't the case with AVX2. However, it may be worth noting, that it largely _is_ the case with the current AVX ("AVX1", i.e. pre-Haswell) implementations.

The second cited difference is that there's a 128-bit "boundary" in many of the operations. This is effectively what can throw down the drain the hopes of getting 2x gains over SSE2 just by naïvely migrating into AVX2. For instance, you cannot do shuffles to/from arbitrary components anymore, but have to consider the 128-bit lane boundaries instead.

The third issue, i.e. data layouts of internal formats and the assumptions of various algorithms are probably the most significant factors that determine how large a benefit you are going to get. Typically the internal data layouts (i.e. is my pixel block size 2x2, 4x4, 16x8 or something else?) are married with the ISA. Thus, when migrating from one instruction set to another, these typically may need to be reconsidered if speed is paramount. Interestingly enough, this means that when the ISA changes, you most likely want to do some higher-level algorithmic optimizations as well.

lmm 13 years ago |

Anyone have a non-scribd copy?

DarkShikari 13 years ago | |

There's one attached to my newsletter that goes along with the latest changes; see http://mailman.videolan.org/pipermail/x264-devel/2013-April/....

mkenyon 13 years ago | |

Here you go.

https://dl.dropboxusercontent.com/u/574869/137419114-Introdu...

Osiris 13 years ago |

Are there binary builds available with AVX2 support compiled in for testing? I'm curious if FMA(3/4) support available in AMD processors would increase performance. A quick Google search shows that there are some patches available for FMA support.

DarkShikari 13 years ago | |

I only pushed the code a few minutes ago, but binaries should probably be up at http://x264.nl/ relatively soonish (it's not my site though, so I wouldn't know exactly).

If you want to test without a physical Haswell, the Intel Software Development Emulator should work okay, albeit somewhat slowly. I'd post overall numbers for real Haswells, but Intel has apparently said we can't do that yet.

Regarding FMA, FMA3/4 are floating point only. Since x264 has just one floating point assembly function, only two FMA3/FMA4 instructions get used in all of x264 (not counting duplicates from different-architecture versions of the function). An FMA4 version has been included for a while; the new AVX2 version does include FMA3, but of course that won't run on AMD CPUs (yet).

XOP had some integer FMA instructions, but I generally didn't find them that useful (there's a few places I found they could be slotted in, though).

jamesaguilar 13 years ago | | |

I've heard that there are c libraries for things like SSE2. I assume the same is true of AVX2. If this is so, why do you write so much of x264 in assembly? Do you find that there are significant gains versus c-code that uses SIMD libraries? Have I been misled that C is nearly as fast as assembly 99% of the time?

Note: I'm not trying to question your engineering chops, just trying to correct my own misconceptions.

ajross 13 years ago | |

You want to test software on a device that doesn't exist in the market yet, but you don't want to build it yourself? The time you'll spend figuring out whatever emulator you're going to use is far longer than the time it takes to build x264 from a developer branch...

DarkShikari 13 years ago | | |

In all fairness, Intel's emulator is incredibly easy to use; it literally works like this:

sde -- ./myprogram myargs

instead of

./myprogram myargs

There's also probably a decent number of people at this point who have prerelease CPUs; they tend to breed quite explosively in the month or two before the official release.

zobzu 13 years ago |

Nice gains. Thanks for the writeup and explanations!