A Tale of Two Optimisations

A Tale of Two Optimisations(articles.foletta.org)

22 points by gjf 4 years ago | 17 comments

powersnail 4 years ago |

> What bothered me about the original implementation was the lookup table. Even though I knew they’d be cached, I still thought the memory accesses might have a detrimental affect on performance.

The encoding lookup table is an array of four chars. I'd be surprised if accessing such an array has a detrimental effect on any program.

I also wonder why the graphs say "Encode / Decode", as if you are combining the performance of the encoding function and the decoding. Have you considered separating them?

It would also help reproducibility if you include your compiler, versions, and flags. You mentioned that you've turned off all optimizations, but I wonder why.

"O0" would certainly produces a lot of jumps for the switch. But O2 is certainly going to eliminate those jumps. In fact, with O2, gcc11 seems to produce identical code for switch and lookup table.

https://godbolt.org/z/jdx645MsY

glandium 4 years ago | |

clang does something smart with the switch, equivalent to this C code:

  unsigned char lookup2_encode(const unsigned char dibit) {
    return ' \r\n\t' >> (dibit * 8);
  }

I'd expect that to be faster just from having the lookup table in the code itself, and not having to use a different cache line from the data segment (as well as avoiding pointer indirection).

nathell 4 years ago |

Interesting approach with the polynomials! I wonder whether the performance of that one can be improved – while the coefficients aren’t integer, they become integer when multiplied by 3. So I wonder whether evaluating a whole-coefficient polynomial (using Horner’s scheme) and dividing by 3 at the end (which the compiler might optimize to a multiplication) might provide a speed boost:

    unsigned char poly_encode_2(const unsigned char dibit) {
        return (27 + x * (14 + x * (-18 + x * 7))) / 3;
    }

Haven’t measured.

germanjoey 4 years ago |

"whitespacer" reminds me of Damian Conway's classic "Acme::Bleach" perl module...

https://metacpan.org/pod/Acme::Bleach

nanis 4 years ago |

When I first saw this, I thought it ought to be possible to express the encoding/decoding in simple computations, but I only just got a chance to try things out[1].

There are many things I'd fiddle with in the original source. I noticed that since I first saw this post, someone even contributed an AVX2 implementation[2]. I stayed strictly in the standard C land:

This function converts a given character to the corresponding half-nibble (or dibit, don't really know what to call a pair of bits):

    uint8_t
    ws_to_half_nibble(uint8_t ws)
    {
      return ((ws >> 1) & 3) | (ws >> 4) | (ws >> 5);
    }

Encoding a half-nibble to one of the four whitespace characters can be done using:

    uint8_t
    half_nibble_to_ws(uint8_t b)
    {
      const uint8_t b0 = b & 1;
      const uint8_t b1 = b & 2;
      return '\t' + b0 + 2 * b1 + 18 * (b0 & (b1 >> 1));
    }

I haven't actually had a chance to investigate any performance gains, but this answers the question of whether we can replace the lookup tables with simple functions with no branching.

[1]: https://www.nu42.com/2021/10/another-optimization-tale.html

[2]: https://news.ycombinator.com/item?id=28859061

kardos 4 years ago |

Horner's method is known to reduce the number of operations. Maybe the coefficients will turn out to be more favourable with this method?

[1] https://en.wikipedia.org/wiki/Horner%27s_method

lifthrasiir 4 years ago | |

Horner's method serializes the entire computation so it is penalized in modern hardwares. When the accuracy is less important it is frequent to split the computation into a few chunks where each chunk uses Horner's method but they can be computed in parallel.

ridiculous_fish 4 years ago |

I would expect the switch implementation to not emit any branches. Example:

https://godbolt.org/z/Gox7nYfxP

So it is surprising that OP saw branch mispredictions. What compiler and flags were used?

spaetzleesser 4 years ago |

This brings back fond memories of the time when I did video processing code. So much more fun doing this kind of work on self contained code versus server side stuff dealing with dozens of misbehaving external APIs.

Const-me 4 years ago |

I wonder how does it compare to BMI2 + AVX2 version:

https://godbolt.org/z/xcT3exenr

The code doesn’t look great because no inlining.

In reality, if you have many bytes in the buffer and calling these functions in a loop, compilers should inline the function, loading all these magic vectors outside the loop. This way it won’t be any memory loads, except to fetch the source data.

lifthrasiir 4 years ago |

While I applaud the efforts to document the optimization processes, they are fundamentally slow. The easiest and portable optimization would be converting 8 full bits at a time:

    for (int i = 0; i < 256; ++i) {
        const char c[4] = {
            encode_lookup_tbl[i >> 0 & 3],
            encode_lookup_tbl[i >> 2 & 3],
            encode_lookup_tbl[i >> 4 & 3],
            encode_lookup_tbl[i >> 6 & 3],
        };
        encode_lookup_tbl2[i] = *(uint32_t*)c;
    }

    for (x = 0; x < bytes; x++) {
        ((uint32_t*)ws_out)[x] = encode_lookup_tbl2[bytes_in[x]];
    }

Note that the resulting encode_lookup_tbl2 depends on the endianness, but as long as we make the table on demand it can remain portable. Also `ws_out` might have to be aligned for the potability (unaligned memory access can be slow at best and trap at worst) but it is easy to fulfill when those functions are only used internally. In my measurement this alone made it ~3.5x faster.

The next obvious approach is to use SIMD. Pretty much every modern CPU gives you 16-byte shuffle, which can be used to convert 4 bits at a time. (EDIT: This turned out to be not very correct, see below.) I tried to write a quick example but it takes quite a bit of time to write and debug [1] ;-) (I may update this post if I get a working version.)

Lastly, it might be possible to use mmap to eliminate the intermediate buffer at all. I think the actual performance gain with and without mmap can vary much and you need to profile it, but it's worth considering.

[1] I realized that there is no bytewise shift operator in x86 SIMD (_mm_srli_epi8). Oops!

---

Added later: So the following is my crude AVX2 implementation. It is 20% faster than the bytewise version and 10% faster than the original SSSE3 version. It is not that satisfactory though because it had to do a lot of unpacking to get bytes into correct positions (I'm not a great SIMD coder after all).

    const __m256i MASK = _mm256_set1_epi8(0xf);

    _Alignas(32) char c[32];
    for (int i = 0; i < 32; ++i) c[i] = encode_lookup_tbl[i & 3];
    const __m256i MAPLO = _mm256_load_si256((void*)c);
    for (int i = 0; i < 32; ++i) c[i] = encode_lookup_tbl[i >> 2 & 3];
    const __m256i MAPHI = _mm256_load_si256((void*)c);

    for (x = 0; x < bytes; x += 32) {
        // abcdefghijklmnop ABCDEFGHIJKLMNOP
        __m256i block = _mm256_loadu_si256((void*)(bytes_in + x));
        // ah bh ch dh ... mh nh oh ph
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(block, 4), MASK);
        // al bl cl dl ... ml nl ol pl
        __m256i lo = _mm256_and_si256(block, MASK);
        // MAPLO[al] ... MAPLO[pl]
        // MAPHI[al] ... MAPLO[pl]
        // MAPLO[ah] ... MAPHI[ph]
        // MAPHI[ah] ... MAPHI[ph]
        __m256i mapped1 = _mm256_shuffle_epi8(MAPLO, lo);
        __m256i mapped2 = _mm256_shuffle_epi8(MAPHI, lo);
        __m256i mapped3 = _mm256_shuffle_epi8(MAPLO, hi);
        __m256i mapped4 = _mm256_shuffle_epi8(MAPHI, hi);
        // MAPLO[al] MAPLO[ah] ... MAPLO[hl] MAPLO[hh]
        // MAPLO[il] MAPLO[ih] ... MAPLO[pl] MAPLO[ph]
        // MAPHI[al] MAPHI[ah] ... MAPHI[hl] MAPHI[hh]
        // MAPHI[il] MAPHI[ih] ... MAPHI[pl] MAPHI[ph]
        __m256i shuf1 = _mm256_unpacklo_epi8(mapped1, mapped3);
        __m256i shuf2 = _mm256_unpackhi_epi8(mapped1, mapped3);
        __m256i shuf3 = _mm256_unpacklo_epi8(mapped2, mapped4);
        __m256i shuf4 = _mm256_unpackhi_epi8(mapped2, mapped4);
        // MAPLO[al] MAPHI[al] MAPLO[ah] MAPHI[ah] ... MAPLO[dl] MAPHI[dl] MAPLO[dh] MAPHI[dh] etc.
        // bytewise:
        // aaaabbbbccccdddd AAAABBBBCCCCDDDD
        // eeeeffffgggghhhh EEEEFFFFGGGGHHHH
        // iiiijjjjkkkkllll IIIIJJJJKKKKLLLL
        // mmmmnnnnoooopppp MMMMNNNNOOOOPPPP
        __m256i out1 = _mm256_unpacklo_epi8(shuf1, shuf3);
        __m256i out2 = _mm256_unpackhi_epi8(shuf1, shuf3);
        __m256i out3 = _mm256_unpacklo_epi8(shuf2, shuf4);
        __m256i out4 = _mm256_unpackhi_epi8(shuf2, shuf4);
        _mm_store_si128((void*)(ws_out + x * 4),       _mm256_extracti128_si256(out1, 0));
        _mm_store_si128((void*)(ws_out + x * 4 + 16),  _mm256_extracti128_si256(out2, 0));
        _mm_store_si128((void*)(ws_out + x * 4 + 32),  _mm256_extracti128_si256(out3, 0));
        _mm_store_si128((void*)(ws_out + x * 4 + 48),  _mm256_extracti128_si256(out4, 0));
        _mm_store_si128((void*)(ws_out + x * 4 + 64),  _mm256_extracti128_si256(out1, 1));
        _mm_store_si128((void*)(ws_out + x * 4 + 80),  _mm256_extracti128_si256(out2, 1));
        _mm_store_si128((void*)(ws_out + x * 4 + 96),  _mm256_extracti128_si256(out3, 1));
        _mm_store_si128((void*)(ws_out + x * 4 + 112), _mm256_extracti128_si256(out4, 1));
    }

d23 4 years ago |

Cool post! Thanks for submitting. I would never in a million years have thought to try out a polynomial fit for something like this. Unsurprisingly, it was slower, but still cool to even see you thinking that way at all.

klyrs 4 years ago | |

While I'm big on lookup tables for jobs like this, I'd have reached for bitmasks long before that polynomial...

kardos 4 years ago | | |

Hard to compete with a 4-byte lookup table, but for decoding the 256-byte table can become: return ((dibit | (dibit -1))>>1) & 3;

for (int i = 0; i < 256; ++i) { const char c[4] = { encode_lookup_tbl[i >> 0 & 3], encode_lookup_tbl[i >> 2 & 3], encode_lookup_tbl[i >> 4 & 3], encode_lookup_tbl[i >> 6 & 3], }; encode_lookup_tbl2[i] = *(uint32_t*)c; } for (x = 0; x < bytes; x++) { ((uint32_t*)ws_out)[x] = encode_lookup_tbl2[bytes_in[x]]; }

const __m256i MASK = _mm256_set1_epi8(0xf); _Alignas(32) char c[32]; for (int i = 0; i < 32; ++i) c[i] = encode_lookup_tbl[i & 3]; const __m256i MAPLO = _mm256_load_si256((void*)c); for (int i = 0; i < 32; ++i) c[i] = encode_lookup_tbl[i >> 2 & 3]; const __m256i MAPHI = _mm256_load_si256((void*)c); for (x = 0; x < bytes; x += 32) { // abcdefghijklmnop ABCDEFGHIJKLMNOP __m256i block = _mm256_loadu_si256((void*)(bytes_in + x)); // ah bh ch dh ... mh nh oh ph __m256i hi = _mm256_and_si256(_mm256_srli_epi16(block, 4), MASK); // al bl cl dl ... ml nl ol pl __m256i lo = _mm256_and_si256(block, MASK); // MAPLO[al] ... MAPLO[pl] // MAPHI[al] ... MAPLO[pl] // MAPLO[ah] ... MAPHI[ph] // MAPHI[ah] ... MAPHI[ph] __m256i mapped1 = _mm256_shuffle_epi8(MAPLO, lo); __m256i mapped2 = _mm256_shuffle_epi8(MAPHI, lo); __m256i mapped3 = _mm256_shuffle_epi8(MAPLO, hi); __m256i mapped4 = _mm256_shuffle_epi8(MAPHI, hi); // MAPLO[al] MAPLO[ah] ... MAPLO[hl] MAPLO[hh] // MAPLO[il] MAPLO[ih] ... MAPLO[pl] MAPLO[ph] // MAPHI[al] MAPHI[ah] ... MAPHI[hl] MAPHI[hh] // MAPHI[il] MAPHI[ih] ... MAPHI[pl] MAPHI[ph] __m256i shuf1 = _mm256_unpacklo_epi8(mapped1, mapped3); __m256i shuf2 = _mm256_unpackhi_epi8(mapped1, mapped3); __m256i shuf3 = _mm256_unpacklo_epi8(mapped2, mapped4); __m256i shuf4 = _mm256_unpackhi_epi8(mapped2, mapped4); // MAPLO[al] MAPHI[al] MAPLO[ah] MAPHI[ah] ... MAPLO[dl] MAPHI[dl] MAPLO[dh] MAPHI[dh] etc. // bytewise: // aaaabbbbccccdddd AAAABBBBCCCCDDDD // eeeeffffgggghhhh EEEEFFFFGGGGHHHH // iiiijjjjkkkkllll IIIIJJJJKKKKLLLL // mmmmnnnnoooopppp MMMMNNNNOOOOPPPP __m256i out1 = _mm256_unpacklo_epi8(shuf1, shuf3); __m256i out2 = _mm256_unpackhi_epi8(shuf1, shuf3); __m256i out3 = _mm256_unpacklo_epi8(shuf2, shuf4); __m256i out4 = _mm256_unpackhi_epi8(shuf2, shuf4); _mm_store_si128((void*)(ws_out + x * 4), _mm256_extracti128_si256(out1, 0)); _mm_store_si128((void*)(ws_out + x * 4 + 16), _mm256_extracti128_si256(out2, 0)); _mm_store_si128((void*)(ws_out + x * 4 + 32), _mm256_extracti128_si256(out3, 0)); _mm_store_si128((void*)(ws_out + x * 4 + 48), _mm256_extracti128_si256(out4, 0)); _mm_store_si128((void*)(ws_out + x * 4 + 64), _mm256_extracti128_si256(out1, 1)); _mm_store_si128((void*)(ws_out + x * 4 + 80), _mm256_extracti128_si256(out2, 1)); _mm_store_si128((void*)(ws_out + x * 4 + 96), _mm256_extracti128_si256(out3, 1)); _mm_store_si128((void*)(ws_out + x * 4 + 112), _mm256_extracti128_si256(out4, 1)); }