Let's Make a Varint(carlmastrangelo.com) |
Let's Make a Varint(carlmastrangelo.com) |
- The motivation behind this post is storing generated IDs for a site the author is working on. Generated IDs don't have the "smaller values are exponentially more likely" property that make varints useful in the first place.
- The post rejects UTF-8's variable-length encoding because "it can only store numbers up to 1,112,064". That's the maximum number of unicode code points (...ish). The UTF-8 encoding's actual limit is >10^16. A much better reason to not use UTF-8's encoding is that it was created under pretty heavy backwards-compatibility restrictions that the author doesn't have. For example, there's no need for being self-synchronizing or to be compatible with Boyer-Moore substring searches.
- The post goes with a length-prefixed encoding, but then uses that length prefix as part of the folder structure (it's using the generated IDs as filepaths). Which is a great way to create an exponential distribution in the sizes of your directories, instead of the uniform one that was the goal.
It's not a bad post, there's facts in there, but I wouldn't recommend bothering with it.
- The problem with generating id's is that it isn't known ahead of time how many there will be. This forces a solution that is suboptimal in all circumstances.
- The reason for rejecting UTF-8 is mostly backwards compatibility with existing software. Being able to use encoded UTF-8 strings that exceed the million is possible, but really burns a lot of bridges along the way. The point about boyer moore is really cool, I had no idea that was a goal!
- Having the length in folder structure is exponential, but only at the top most level. It will be uniform under each length dir. This is an acceptable price to pay when typing "ls ./dir", since removing the prefix would make it hard to read quickly:
0/
0.jpg
1/
1.jpg> Varints can represent negative numbers too, by casting them to unsigned numbers first.
But, assuming you're working with twos-complement numbers, it's going to be really inefficient, because small negative numbers are going to end up represented as extremely large positive numbers (e.g. -1 will map to MAX_INT - 1, where MAX_INT is the largest unsigned value representable in your underlying integer type).
You also lose the unlimited length property if you need to handle twos-complement integers, and your encoding is dependent on the underlying integer type you're serializing from/deserializing to (especially problematic in languages where types are architecture-dependent, like C/C++).
Protobuf handles this by doing what they call "ZigZag encoding" if you declare a value as a signed integer type. This maps small negative numbers to small positive numbers, alternating back and forth like so:
Original Encoded
0 0
-1 1
1 2
-2 3
2 4
and so on.The catch is that you lose sortability if you do this... if you're going to have a bunch of negative numbers and need their encoded versions to sort correctly, you'll need to do something else. (Of course, you'll need to with twos-complement numbers, too, given that -1 will appear to be greater than 1).
Protobuf documentation on ZigZag encoding: https://developers.google.com/protocol-buffers/docs/encoding...
Second, while he does characterize it as a "huge mistake" for encoding/decoding performance reasons, there are a few caveats that are obvious here.
1) Half of his remarks here explain why varints are inappropriate for CapnProto, a separate project of his, inspired by his experience with protobuffs. CP has different design goals than PB, and in that context varints are inappropriate. Fine.
2) His argument here against varints in protobuffs (as contrasted with in capnproto) is only that they cost CPU time (and thus latency). That's surely true, but it's not at all clear to me that he's right that this is, in general, the wrong performance tradeoff.
3) In general, if you want to represent a long-tailed distribution of numbers, some variable-length encoding has merits, because the alternative (that protobuffs promote, IIUC) are fixed-width ints, which means you waste a lot of space to accommodate your long tail, and even so you might not have allocated enough space for the largest integer you ever need to handle.
In practice the reason it ended up designed this way is because varint was introduced first with no special support for negative numbers, and then zigzag was introduced years later. Zigzag worked well because it stacked on top of the existing varint parser without changing it.
But for a new format with no backwards-compatibility constraints, it would probably make much more sense to use sign extension (so, the first encoded bit becomes the sign bit).
(I am the person who rewrote and open sourced protobufs at Google, though I am not the inventor.)
https://github.com/lemire/JavaFastPFOR
The links to the papers in the readme are useful if you want to learn more.
h0ly fvcking sh1t!
It's a good try, and making all the letters lowercase makes the numbers as vowels less legible. But if you're creating random strings you will create profanity and slurs.
Sure, nothing involving letters will be perfect, but this is something to take into account. You can choose a level of risk in your IDs somewhere between purely numeric IDs and the Automated Curse Generator [1].
[1] http://thedailywtf.com/articles/The-Automated-Curse-Generato...
Protobuf -- like most successful systems -- was not so much "designed" as it was "evolved" as a series of decisions made out of expediency rather than because they were the best possible answer. Overall it has obviously worked extremely well, but I would not assume that just because protobuf does something, it is good. (But I'd say that about just about everything, not just protobuf.)
That's in addition to the encoding-independent overhead inherent to any variable-length integer scheme, which is what mandates the encoding/decoding step in protobuf. Combined, this makes protobuf pretty expensive to decode (protobuf varints are everywhere; they're used for field keys and length delimiters for length-delimited field types).
That's unrelated to ZigZag encoding of negative numbers, though, which is fairly straightforward and computationally inexpensive (it's a couple bitshifts and an XOR).
That said, the fact that you can't skip forward in a protobuf without parsing through all the previous fields is unfortunate in some use cases, hence Cap'n Proto.
Cap'n Proto would've been a pretty strong contender if it had existed when we started using protobuf, though. Serialization/deserialization was on the critical path for our server-side code, so we most likely would have been able to get at least a bit of a performance boost out of Cap'n Proto.