Let's Make a Varint

Let's Make a Varint(carlmastrangelo.com)

26 points by saturncoleus 10 years ago | 20 comments

Strilanc 10 years ago |

This post starts off good, but...

- The motivation behind this post is storing generated IDs for a site the author is working on. Generated IDs don't have the "smaller values are exponentially more likely" property that make varints useful in the first place.

- The post rejects UTF-8's variable-length encoding because "it can only store numbers up to 1,112,064". That's the maximum number of unicode code points (...ish). The UTF-8 encoding's actual limit is >10^16. A much better reason to not use UTF-8's encoding is that it was created under pretty heavy backwards-compatibility restrictions that the author doesn't have. For example, there's no need for being self-synchronizing or to be compatible with Boyer-Moore substring searches.

- The post goes with a length-prefixed encoding, but then uses that length prefix as part of the folder structure (it's using the generated IDs as filepaths). Which is a great way to create an exponential distribution in the sizes of your directories, instead of the uniform one that was the goal.

It's not a bad post, there's facts in there, but I wouldn't recommend bothering with it.

saturncoleus 10 years ago | |

Those are all really good points actually. In response to them individually:

- The problem with generating id's is that it isn't known ahead of time how many there will be. This forces a solution that is suboptimal in all circumstances.

- The reason for rejecting UTF-8 is mostly backwards compatibility with existing software. Being able to use encoded UTF-8 strings that exceed the million is possible, but really burns a lot of bridges along the way. The point about boyer moore is really cool, I had no idea that was a goal!

- Having the length in folder structure is exponential, but only at the top most level. It will be uniform under each length dir. This is an acceptable price to pay when typing "ls ./dir", since removing the prefix would make it hard to read quickly:

    0/
    0.jpg
    1/
    1.jpg

JonathonW 10 years ago |

Something probably not particularly important to the author (given that he's just using these essentially as a serial number), but worth mentioning for general use:

> Varints can represent negative numbers too, by casting them to unsigned numbers first.

But, assuming you're working with twos-complement numbers, it's going to be really inefficient, because small negative numbers are going to end up represented as extremely large positive numbers (e.g. -1 will map to MAX_INT - 1, where MAX_INT is the largest unsigned value representable in your underlying integer type).

You also lose the unlimited length property if you need to handle twos-complement integers, and your encoding is dependent on the underlying integer type you're serializing from/deserializing to (especially problematic in languages where types are architecture-dependent, like C/C++).

Protobuf handles this by doing what they call "ZigZag encoding" if you declare a value as a signed integer type. This maps small negative numbers to small positive numbers, alternating back and forth like so:

    Original    Encoded
       0          0
      -1          1
       1          2
      -2          3
       2          4

and so on.

The catch is that you lose sortability if you do this... if you're going to have a bunch of negative numbers and need their encoded versions to sort correctly, you'll need to do something else. (Of course, you'll need to with twos-complement numbers, too, given that -1 will appear to be greater than 1).

Protobuf documentation on ZigZag encoding: https://developers.google.com/protocol-buffers/docs/encoding...

daurnimator 10 years ago | |

FWIW the protobuf varint encoding was seen as a mistake by the inventor. https://groups.google.com/d/msg/capnproto/Qxw7YSoPP18/HS06Kc...

jholman 10 years ago | | |

First of all, I believe that Kenton Varda is not the inventor of protobuffs, but rather a Google engineer who did a bunch of work on Protobuffs v2 (I believe on both of the v2 specs for protobuffs, because there are two different v2s). He is certainly quite knowledgeable, but not "the inventor".

Second, while he does characterize it as a "huge mistake" for encoding/decoding performance reasons, there are a few caveats that are obvious here.

1) Half of his remarks here explain why varints are inappropriate for CapnProto, a separate project of his, inspired by his experience with protobuffs. CP has different design goals than PB, and in that context varints are inappropriate. Fine.

2) His argument here against varints in protobuffs (as contrasted with in capnproto) is only that they cost CPU time (and thus latency). That's surely true, but it's not at all clear to me that he's right that this is, in general, the wrong performance tradeoff.

3) In general, if you want to represent a long-tailed distribution of numbers, some variable-length encoding has merits, because the alternative (that protobuffs promote, IIUC) are fixed-width ints, which means you waste a lot of space to accommodate your long tail, and even so you might not have allocated enough space for the largest integer you ever need to handle.

arielweisberg 10 years ago | | |

The mistake wasn't that particular varint encoding, but that it was variable length. That means you need to perform more complex decoding which is not the tradeoff chosen for Cap N Proto.

kentonv 10 years ago | | |

I'm not the inventor of protobuf, just a guy who rewrote and open sourced it. My rewrite did not change the underlying encoding.

groovy2shoes 10 years ago | |

Any good reason not to just use a single sign bit?

kentonv 10 years ago | | |

Honestly, no, there's no "good" reason.

In practice the reason it ended up designed this way is because varint was introduced first with no special support for negative numbers, and then zigzag was introduced years later. Zigzag worked well because it stacked on top of the existing varint parser without changing it.

But for a new format with no backwards-compatibility constraints, it would probably make much more sense to use sign extension (so, the first encoded bit becomes the sign bit).

(I am the person who rewrote and open sourced protobufs at Google, though I am not the inventor.)

ccleve 10 years ago |

The references in the article don't reflect the latest research. So far as I know, this is close to state of the art:

https://github.com/lemire/JavaFastPFOR

The links to the papers in the readme are useful if you want to learn more.

hyperpape 10 years ago |

"No “u”, (and no “o” or “i”) so accidental profanity cannot happen."

h0ly fvcking sh1t!

It's a good try, and making all the letters lowercase makes the numbers as vowels less legible. But if you're creating random strings you will create profanity and slurs.

rspeer 10 years ago | |

If you're generating alphanumeric IDs, removing I, O, and U from your alphabet really does decrease the likelihood of the ID being perceived as English profanity.

Sure, nothing involving letters will be perfect, but this is something to take into account. You can choose a level of risk in your IDs somewhere between purely numeric IDs and the Automated Curse Generator [1].

[1] http://thedailywtf.com/articles/The-Automated-Curse-Generato...

hyperpape 10 years ago | | |

That's why I said it was a good try. I don't think it's a bad idea, but if you ever wanted to push the needle close to zero, you would have to add some other layer of checking.

deckar01 10 years ago | |

I find the entire premise of the chosen alphabet rediculous. Long, random IDs are not for typing nor reading. It doesn't sound like they are even being used for user facing URLs.

btrask 10 years ago |

Varints are widely used in databases. https://www.sqlite.org/src4/doc/trunk/www/varint.wiki