Cap’n Proto(capnproto.org) |
Cap’n Proto(capnproto.org) |
Just wanted to note that although Cap'n Proto hasn't had a blog post or official release in a while, development is active as part of the Sandstorm project (https://sandstorm.io). Cap'n Proto -- including the RPC system -- is used extensively in Sandstorm. Sandboxed Sandstorm apps in fact do all their communications with the outside world through a single Cap'n Proto socket (but compatibility layers on top of this allow apps to expose an HTTP server).
Unfortunately I've fallen behind on doing official releases, in part because an official release means I need to test against Windows, Mac, and other "supported platforms", whereas Sandstorm only cares about Linux. Windows is especially problematic since MSVC's C++11 support is spotty (or was last I tried), so there's usually a lot of work to do to get it working.
As a result Sandstorm has been building against Cap'n Proto's master branch so that we can make changes as needed for Sandstorm.
I'm hoping to get some time in the next few months to go back and do a new release.
Right now, my project's rather small 100 kLOC codebase compiles (with very minimal #ifdef hackery) on Android, iOS, Mac, and Linux; but Windows is still very much a WIP, even some of my third-party libraries don't compile on VC. I'm actually considering trying the MINGW toolchain at this point, and I'd be curious to hear anyone's thoughts?
I'm sure part of it is that the compiler itself probably developed with the tight feedback loop from developing and maintaining large codebases in house at Microsoft (like Windows and Office). It's pretty hard when part of the internal pressure to support large codebases like that conflicts with the need to conform to outside third-party standards. I've heard great things about the compiler people at Microsoft, and I'm sure they have a technically strong team, but they are most likely caught in the middle of this organizational deadlock.
I'm sure it doesn't help that I'm not mentioning any specifics, I'm going to be revisiting the Windows desktop port of my mobile app soon (which I gave up and haven't touched for a couple months), and everything will be fresh in my mind again. I do vaguely remember something about having to explicitly add more #include statements to pull in header files that were already getting pulled in by my other compilers.
- Do you guys have an RPC library written in anything other than C++? If not, could you point me to protocol specs so I can start writing my own?
- Since it uses a streaming model to support random access, what encryption method do you think would work best with Cap ' n Proto that would keep it speedy and still retain all functionality?
Thanks!
People have written implementations in Rust, Go, and Erlang, and wrappers around the C++ library in Javascript and Python: https://capnproto.org/otherlang.html
Scroll down that page for some info on how to start writing an implementation in another language.
The RPC protocol spec is here:
https://github.com/sandstorm-io/capnproto/blob/master/c++/sr...
> - Since it uses a streaming model to support random access, what encryption method do you think would work best with Cap ' n Proto that would keep it speedy and still retain all functionality?
Hmm, I'm not clear on what you mean by "streaming model" -- I think of "streaming" as the opposite of random access.
Regarding encryption, this is a very big question and there are a lot of different needs and use cases to consider. Mostly I don't think that use of Cap'n Proto affects encryption decisions much, but if you want to make sure you don't lose random access, you should of course use a cipher that supports random access, like chacha20 or AES-CTR.
I've also built another message format recently (yes I know, they are never ending). It can't do everything Cap'n Proto can do, although it shares a lot of the same values. One thing I chose to do is to have order preservation on types, which can be very useful. It does mean that the wire format is largely BE though. Anyway, that's an aside.
I'm curious what you think of Amazon ION by the way?
Good stuff!
Cross platform libraries that really only care about a single platform is an endless source of frustration. :(
There are a few major reasons I didn't go that route:
* The .proto language has a lot of weird quirks that I don't like. Some of the quirks are specific to the protobuf encoding (e.g. int32 vs. sint32 vs. fixed32 being different types), while other quirks have no particular rationale behind them. I didn't want that baggage.
* The .proto language does not treat interfaces (aka services) as a first-class type. That is, you cannot define a field whose type is an RPC interface type -- a reference to a remote object. The ability to do this is a critical part of Cap'n Proto's interface design.
* It's highly unlikely that the protobuf team would be interested in accepting changes to the language which were not actually supported by protobuf. This means that if I shared the language, I would have my hands tied when it comes to new features -- or I'd also have to implement Protobuf equivalents to make them happy.
(Note that Sandstorm is focused on making Cap'n Proto work well for Sandstorm. We welcome contributions, but we generally don't have resources available ourselves to work on third-party feature requests unrelated to Sandstorm. That is, unless you want to pay us a bunch of money, in which case, feel free to contact me. ;) )
The FlatBuffers encoding is based on vtables and is relatively straightforward (the runtime library is tiny). This also means it's inefficient for small messages, but in my testing its vtable deduplication worked great for my use case (~100k messages of the same type per memory-mapped file), in that the vtable overhead tends quickly to zero.
Cap'n Proto has a more complex encoding that is probably more efficient in terms of wire size, and particularly for small/standalone messages, but the runtime is larger as a result.
Cap'n Proto is both. I.e. you have the serialisation, which can be used on its own, but you can also define interfaces with methods that can pass those serialised structs as parameters, and which return asynchronous promises.
Right now I can only see a way to create lists, so access will be in linear time.
However, I am obviously very biased. :)
In general, because I felt there was too much resistance to me pursuing ideas that I wanted to pursue. Google has become increasingly top-down, whereas it was fairly bottom-up when I started in 2005. That's not necessarily a bad thing, but I felt that I personally would be happier running a startup where I could call the shots.
FWIW, if you just want to write code, be comfortable, and make a crapton of money, and you don't care if you're implementing someone else's ideas, I highly recommend working for Google. That's not meant to be sarcastic or disparaging -- I totally respect that approach and there are days when I wish that were me. But if you have ideas of your own and you won't be happy unless you see them implemented... it probably won't happen at Google.
(To be fair, some people at Google would surely argue that my problem is that my ideas are crazy and bad, and I don't have any firm evidence -- yet -- that they are wrong.)
Anyways, it seems like a cool project, so I'll be sure to follow its development closely.
[1] http://www.art.net/~hopkins/Don/unix-haters/x-windows/danger...
So, uh… I have a confession to make.
I may have rewritten Protocol Buffers.
Again.
What's an efficient binary representation? C Structs. What's an efficient text representation? Javascript objects. But casting arbitrary data to a struct is a horrible idea. And eval'ing anonymous javascript is a horrible idea. Back to the drawing board.
Then N years later someone has the brilliant idea of just... not parsing these formats that idiotic way. And wrote a code generator because the smart way is tedious.
Of course, the hardest part is convincing everyone that it's not your bespoke type-length-value struct, but that you have good reasons for what you're doing. I think the humorous, not-so-self-serious presentation has worked in its favor (but that's just a subjective opinion and I can't back it up with data).
https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-...
"Protocol Buffers" has been the go-to for a long time but there are more options now.
For uses where serialization/deserialization CPU time is a concern, it seems to really a question of Cap'n Proto versus flatbuffers ( https://google.github.io/flatbuffers/ ).
In 2008, Joel Spolsky wrote about 1990s-era Excel file formats and how they used this technique to deal with how slow computers were then [1]. Same technique, new problem set.
I guess it's a way to send data to your front end java script but not use json and this compresses it so it's faster? How much better than using json is it?
Cap'n Proto generates you some code that contains some data structures. You put data into these structures, and they will automatically be in the right shape to put directly on the wire. And then that data can be pulled right off the wire and right into memory and be fully ready to access, with no intermediate step.
It's, in a sense, infinitely faster than JSON serialization or deserialization. Because it doesn't even perform any serialization. It's just data.
There are some other tricks at play here, but I won't go into them. This is plenty cool.
But for me at least, the real advantage over JSON isn't the performance but the schema compatibility. You have a spec for your data and generate code from that, which means the spec is guaranteed to be correct, and there's clear documentation about what changes to the spec are or aren't forward or backward compatibile. (You get the same thing from the original Protocol Buffers though).
Eh, I'd rather pay the cost of serialisation once and deserialisation once, and then access my data for as close to free as possible, rather than relying on a compiler to actually inline calls properly.
> Integers use little-endian byte order because most CPUs are little-endian, and even big-endian CPUs usually have instructions for reading little-endian data.
sob There are a lot of things Intel has to account for, and frankly little-endian byte order isn't the worst of them, but it's pretty rotten. Writing 'EFCDAB8967452301' for 0x0123456789ABCDEF is perverse in the extreme. Why? Why?
As pragmatic design choices go, Cap'n Proto's is a good one (although it violates the standard network byte order). Intel effectively won the CPU war, and we'll never be free of the little-endian plague.
It's all so depressing.
Is it captain? Is it cap+n+proto?
A lot of collaboration is verbal - people sit around and talk about stuff. I don't know if it is a fun take on an American word... But it is impossible to use in the rest of the world.
I really wish you would call it something else... Unless it is personal for you :(
But "Captain Proto" is acceptable if you have trouble with the contraction.
Or you can also think of it as "Cap-and-Proto". Which is an intentional pun ("capabilities and protocols", or something).
Googling either of these will get you to the right place, so I think it gets the job done.
I never realized that! I like the name much more now.
Btw, you rank in the 5th result for "protocap" on Google.
Now that's a name all of us can pronounce!
The pronunciation would thus be "cap [the sound the letter 'n' makes] crunch".
More importantly, how do you Google for it?
you can search for cap'n proto in any number of ways, including its literal name [ cap'n proto ] or [ capn proto ] or [ captain proto ].
The interfaces and inheritance relate to the RPC system. The interfaces are for remote objects.
This sounds like a cool idea, but so far I haven't seen any good explanation of how it works, and why it will save me from rolling my own ACL system. For bragging about it in the very first sentence, there is surprisingly little detail about how it works.
Here is some reading:
https://capnproto.org/rpc.html#security
With that said, here are some considerations:
- msgpack is usually used as a binary encoding of JSON, with no schemas. That means that textual field names are included in the encoded message. Formats like Protobuf and Cap'n Proto that have schemas known in advance can avoid this bloat, making them faster and smaller.
- msgpack is not a zero-copy encoding. It's necessary to parse the whole message upfront before you can use it, like with protobuf. Cap'n Proto is zero-copy, the advantages of which are described extensively on the page. For example, if you have a multi-gigabyte file containing a massive Cap'n Proto message, and you just want to read one field from one place in that message, you can do that by memory-mapping the file. No need to read it all in. That's not possible with Protobuf or Msgpack.
I think it's best to focus on these kind of paradigm-shifts when trying to reason about performance. You can always micro-optimize the encoding path later on, but you can't suddenly switch to zero-copy later if your data format wasn't designed for it.
The thing that irks me about these methods is that if you're using a capnproto Int rather than a regular Int, doesn't that mean that you're basically forgoing a lot of functionality that was built around and works with the regular old data types?
For example, we also do that with numpy data types in python, but there the performance benefit is super clear - numerical operations dominate. I guess it really depends on your use case. If most of your time is spend on serde, then perhaps it's worth it.
In terms of data layout, 32-bit vs. 64-bit architecture only really affects pointer size. But Cap'n Proto does not encode native pointers (that obviously wouldn't work), so this turns out not to matter.
> endianness,
It turns out almost everything is little-endian now. Also, big-endian architectures almost always have efficient instructions for loading little-endian data. So Cap'n Proto just has to make sure to use those instructions in the getters/setters for integer fields.
> not to mention differences between how languages store things, etc.
Cap'n Proto actually doesn't attempt to match how any language stores things. Instead, it defines its own layout that is appropriate for modern CPUs. It ends up being very similar to the way many languages store things (especially C), but isn't intended to exactly match.
The C++ implementation of Cap'n Proto generates inline getter/setter methods that do pointer arithmetic that is equivalent to what the compiler would generate when accessing a struct.
For Java, Cap'n Proto data is stored in a ByteBuffer, which effectively allows something like pointer arithmetic. Again, getters/setters are generated which use the right offsets.
Most other languages end up looking like either C++ or Java.
In C/C++ ya can! When making games in college that is exactly what we did. Take the struct, dump it into the socket. I was rather shocked when trying to recreate the same system in C#. "I can't? I CAN'T?"
Use C if that's not what you want. Don't hammer with a screwdriver.
Proto3 removes some features from proto2 which were deemed overcomplicated relative to their value (unknown field retention, non-zero default values, extensions, required fields) and adds some features that people have wanted for a long time (maps). But all of these features are things that are "on top" of the core, not really fundamental changes.
I think the only change which affects my comparison post (linked by GP) is removal of unknown field retention. This is actually noted in the comparison grid. I'm honestly very surprised that they chose to remove this feature since it is critical to many parts of Google's infrastructure.
Ultimately we went with flatbuffers because we found the API much cleaner across languages (C++/Go), but it's ultimately going to depend on your use case which one you use. Performance is pretty much identical. We populate our own custom structs from the capn'proto/flatbuffers structs anyways so we're never really doing zero-copy. That said, since both of these formats transmit numerical data as little-endian memory-representations of ints/float/longs/doubles their performance is fantastic.
Once you've spent as much time twiddling bits as I have (as the author of proto2 and Cap'n Proto), you start to realize that little endian is much easier to work with than big-endian.
For example:
- To reinterpret a 64-bit number as 32-bit in BE, you have to add 4 bytes to your pointer. In LE, the pointer doesn't change.
- Just about any arithmetic operation on integers (e.g. adding) starts from the least-significant bits and moves up. It's nice if that can mean iterating forward from the start instead of backwards from the end, e.g. when implementing a "bignum" library.
- Which of the following is simpler?
// Extract nth bit from byte array, assuming LE order.
(bytes[n/8] >> (n%8)) & 1
// Extract nth bit from byte array, assuming BE order.
(bytes[n/8] >> (7 - n%8)) & 1
There's really no good argument for big-endian encoding except that it's the ordering that we humans use in writing.I think the correct answer won here.
For some reason humans seem to want high powers on the left, even if it makes no sense in a left-to-right language.
Take polynomials, they are typically written big-endian
ax^2 + bx + c
But infinite series have to be little-endian. c_0 + c_1*x + c_2*x^2 ....
If you think for a moment about how you would write multiplication, you will see the latter form is much easier to reason about and program with.https://www.ietf.org/rfc/ien/ien137.txt
Also more reader-friendly here: https://www.computer.org/csdl/mags/co/1981/10/01667115.pdf
But one shouldn't do that very often: those are two different types. The slight cost of adding a pointer is negligible.
> Just about any arithmetic operation on integers (e.g. adding) starts from the least-significant bits and moves up. It's nice if that can mean iterating forward from the start instead of backwards from the end, e.g. when implementing a "bignum" library.
-- is a thing, just as ++ is.
> There's really no good argument for big-endian encoding except that it's the ordering that we humans use in writing.
That's like saying, 'there's really no good argument for pumping nitrogen-oxygen mixes into space stations except that it's the mixture we humans use to breathe.'
It's simplicity itself for a computer to do big-endian arithmetic; it's horrible pain for a human being who has to read a little-endian listing. A computer can be made to do the right thing. Who is the master: the computer or the man?
The thing is, as close to free as possible is surprisingly expensive. Protobuf's varint encoding is extremely branchy, and hurts performance in a datacenter environment (where bandwidth is free, and CPU is expensive).
> As pragmatic design choices go, Cap'n Proto's is a good one (although it violates the standard network byte order). Intel effectively won the CPU war, and we'll never be free of the little-endian plague.
Did they though? Arguably there are far more ARM CPUs (like the one in your pocket) than there are server CPUs. Since cellphones and other low power devices are almost all big endian, it seems like network byte order would have been better to use. High powered servers can pay the cost of coding them, but battery powered devices cannot afford to do so.
> sob There are a lot of things Intel has to account for, and frankly little-endian byte order isn't the worst of them, but it's pretty rotten. Writing 'EFCDAB8967452301' for 0x0123456789ABCDEF is perverse in the extreme. Why? Why?
Little endian means that a CPU designer can make buses shorter, which makes the CPU more efficient and smaller. There are also several benefits from the programming side. So it is actually better than big endian in /some/ cases, at the cost of being less intuitive to humans.
So while Intel did choose little endian, they had very good reason to (and it's probably why everything except SystemZ and POWER use it).
The fact that it was "built-in" to Javascript, of course, helped, by allowing it to compete on even footing (XML is also built in via XMLHttpRequest).
I may have rewritten Protocol Buffers, but infinitely faster.
Lack of serialization is probably useful for JavaScript, but "exact in-memory data formats" probably don't fit well with dynamic-typed languages ;)
Because all modern computer architectures assign addresses to bytes, not bits, it's up to us to decide which way to number the bits. But we should always number the bits the same way we number the bytes.
Unless you're programming in C or C++ (or using a library from them), where size of int, long, etc. may change depending on architecture and compiler.
That was a special kind of hell :/
Oh, that's rude. There's a huge difference between flipping a few bytes and committing a public indecency against God and man like JSON.
I don't have a preference for either one, but using little-endian when most/every processor you will be targeting supports it makes more sense than using big-endian + extra work on x86 just so you can read it with less effort in a memory dump.
Cap'n Proto generates classes which wrap a byte buffer and give you accessor methods that read the fields straight out of the buffer.
That actually works equally well in C++ and Javascript.
In general I then think the difference (for non-C++) between your method and others (protobuf, thrift, ...) is that yours would require the cost of a field serialization in the moment the field is accessed. In others all fields are deserialized at once. But in the end it should have the same cost if I need all fields, e.g. in order to convert the data into a plain Java/Javascript/C#/... object, or am I missing something there? For C++ is absolutely believe that you can have a byte-array backed proxy-object with accessor methods that have the same properties as accessing native C++ structures.
- Making one pass instead of two is better for the cache. When dealing with messages larger than the CPU cache, memory bandwidth can easily be the program's main bottleneck, at which point using one pass instead of two can actually double your performance.
- Along similar lines, when you parse a protobuf upfront, you have to parse it into some intermediate structure. That intermediate structure takes memory, which adds cache pressure. Cap'n Proto has no intermediate structure.
- Protobuf and many formats like it are branch-heavy. For example, protobuf likes to encode integers as "varints" (variable-width integers), which require a branch on every byte to check if it's the last byte. Also, protobuf is a tag-value stream, which means the parser has to be a switch-in-a-loop, which is a notoriously CPU-unfriendly pattern. Cap'n Proto uses fixed widths and fixed offsets, which means there are very few branches. As a result, an upfront Cap'n Proto parser would be expected to outperform a Protobuf parser. The fact that parsing happens lazily at time of use is a bonus.
All that said, it's true that if you are reading every field of your structure, then Cap'n Proto serialization is more of an incremental improvement, not a paradigm shift.
Is it still possible to realise benefits of the encoding when translating Python objects?
But because most architectures don't provide any way to address individual bits, only bytes, it's entirely up to the observer to decide in which order they want to imagine the bits. When using little-endian, you imagine that the bits are in little-endian order, to be consistent with the bytes, and then everything is nice and consistent.
But isn't that kind of at odds with how shifting works? (i.e. that a left shift moves towards the "bigger" bits and a right shift moves toward the "smaller" ones.) Perhaps for a Hebrew or Arabic speaker this all works out nicely, but for those of us accustomed to progressing from left to right it seems a bit backwards...
Right, sorry, I re-read my comment and confused myself too. Seems like it's bad for me to go on HN without a fresh cup of coffee (it's 10am now here in the Philippines).
Thanks for your swift response, I'll experiment with AES-CTR (since I'm more familiar with it than chacha20). And thanks for pointing out that there are wrappers for Python/Go already, the programming language I use daily and was thinking of building libs for! Again, great work, and I'll stay posted.
https://download.libsodium.org/libsodium/content/secret-key_...
The big problem was Wine, which can behave quite differently to real Windows in some places. It's certainly not bug-for-bug compatible. So I kept finding myself having to boot into Windows anyway for debugging, so eventually I just installed an msys development system on Windows with gvim and used that.
There was also something else wrong with mingw for Linux, but for the life of me I can't remember what. There was something about mingw being different to mingw32? Missing headers? It's been too long.
I've run into a difference between the two recently. It turns out that MinGW implemented POSIX glob() post-fork, which hasn't been ported to MinGW-w64.
[1] As I side note, I found that MSYS2 makes a huge difference in how you develop software for Windows with a POSIXy toolchain. Anyone who uses MinGW on Windows should try it.
The reasons you gave make sense, although I'm not super familiar with the RPC aspect of protocol buffers so that's new to me.
(Proto2 definitely didn't have any built-in notion of maps when I was working on it. I thought maps were added as a proto3 feature...)
I don't think any lookup table is provided (the wire order of entries is undefined). They are not lookup maps, they are syntactic sugar for repeated key/value pairs.
We rejected Cap'n Proto for the same reasons. I don't see a need to use bleeding edge C++ features when the same(or better) results can be achieved with code generation.
Kudos on the effort but if you're looking for wide adoption you've missed the mark.
Well... We're not, actually. We're looking to support the needs of Sandstorm.
It's nice if my code is useful to others, and if people want to contribute better Windows support I'll be totally happy to review and merge those changes! But wide adoption of Cap'n Proto (outside of the Sandstorm context) is not part of our business model, unfortunately.
(Note: The poster you were replying to isn't associated with Cap'n Proto nor Sandstorm.)
For some reason, Sun never got around to making a bi-endian version of SPARC called CRAPS.
Speaking of Sun, I have a confession to make: I'm bi-stellar. I love both Star Wars and Star Trek.
[1] https://chortle.ccsu.edu/AssemblyTutorial/Chapter-15/ass15_4...
(Also, a lot of the issues we face affect headers, which need to be compiled into the client projects.)
I thought of the hashtable, but that means I have to keep the hash-table in RAM. By contrast, the entire capnproto structure itself could be memory mapped, thus not having any RAM constraints.
I'm thinking larger datasets here, large enough where overhead of e.g. JSON becomes a problem with RAM (on mobile devices), but not yet large enough to have to use SQLite.
I might be overthinking it, and could move to SQLite straight away. Just trying to keep my stuff as simple as possible. Conceptually simple, that is. Capnproto is conceptually simple: a tree dumped to disk, allowing on-demand memory mapped access to requested parts.
I'm saying you create a capnp list, and then you store elements into the list at position according to their hash -- i.e. how you'd build a hashtable, but the capnp list itself is the backing array.
You would have to do this at write time, of course, and make sure the hashing is consistent between runs.
I'd still prefer the framework do it for me. It seems quite involved.
Thanks for taking time to answer my questions!
FTFY
It may not be that your ideas are crazy and bad, but rather that your sane good ideas simply aren't appropriate in the context of advertising.
Enough with the Stockholm syndrome. There are pleanty of perfectly great ideas that Google would never pay their employees to work on, however explicable or inexplicable the reason might be.
>A blow for mobile advertising: The next version of Safari will let users block ads on iPhones and iPads [1] : [...] An Apple realist might argue that its great rival Google makes more than 90 percent of its revenue from online advertising — a growing share of that on mobile, and a large share of that on iPhone. Indeed, Google alone makes about half of all global mobile advertising revenue. So anything that cuts back on mobile advertising revenue is primarily hurting its rival. (Google has been less friendly to adblockers than its “open” positioning would suggest.)
[1] http://www.niemanlab.org/2015/06/a-blow-for-mobile-advertisi...
Your post was snide. Snide remarks are commonly downvoted on HN, regardless of their accuracy.
Regarding your follow-up, you're taking a very simplistic view of Google that doesn't really capture reality. Google obviously builds lots of technology that doesn't directly affect advertising revenue. Most decision-makers at Google are not asking "how does this affect advertising revenue" for every decision. You'd know that if you ever worked there.
Several reasons, but most critically:
The phrase "advertising company" generally refers to a firm in the business of creating advertising. Google's main business is an online media company selling advertising space (both in its own advertising-supported services and alongside online media provided by others.)
To what extent are such disagreements purely technical (e.g. byte-oriented wire encodings vs. roughly-like-memory) and to what extent are they "political".
By "political" I am talking about your claims about decentralisation and diversity, as if you are selling Sandstorm as the IBM-PC for the cloud millennium: a neutral platform any old software developer. That is very different from Google's way of doing doing business.
That sounds like a pretty major advertising company to me.
Can you name a bigger advertising company than Google?
If they're actually a technology company driven by the demands of their users and striving for technical excellence, then do you expect Google to support built-in ad-blocking in Chrome and Android like Apple already supports it in Safari and iOS any time soon?
Consider javascript talking to common lisp. Of course JSON has a canonical mapping to javascript, but it does not for common lisp. Should a JS array be a lisp list or vector? Should lisp's NIL be false or null? Should a JS object decode to an alist, plist, or hash-table? &ct.
For many years I was in the schemaless camp before JS came along. Then for a number of years I was in the self-describing camp because I was thinking that if we don't accept JavaScript and JSON are pretty fundamental on the web we're fools and everyone seemed to be passing around JSON. So in that period was thinking that MsgPack was pretty damn good.
Recently I've switched back to the schemaless view, but with strict order preservation and richly typed fields. Very "tuple" based... so works well with Lisp and JavaScript but also C++. Highly inspired by Linda.
I don't do what Cap'n Proto does and lay out the fields and all that good stuff so that you can kind of memory map it onto structs.
That is nice and I understand the motivation for sure, and I have worked on systems that do that in the past with very good results, but currently my thinking is that compactness without additional compression is a good balance.
Also, since the protocol is order preserving (where it matters) you can do radix sort operations or hash maps on the server side extremely quickly. That was the ultimate motivating factor.
[SomeDateTime, "hello", 1.0f, [10,20], "Foo", false, [SomeMatrix]] etc etc...
Inner tuples or BLOB's are length prefixed of course so you can skip them, but basic types and strings are not. Strings are zero terminated while Ints/Floats/Doubles are BE and complement encoded to preserve sorts, and also Integers are packed to minimum size.
Memory mapping is possible on this system too, but it's very "functional"... there's no attempt at pointer preservation. I don't go that far and Cap'n P seems to be preserving some of the semantics of ProtoBuf at least in that regard, which I'm sure is a good thing for many scenarios.
You could still do that with what I'm doing but it would be at the application level. Same with self-description actually, you could easily build something that looks like JSON if you wanted... either:
[["foo",1.0],["boo","cat"]...] etc
or
[[1.0,"cat"],["foo",boo"]
the choice is really up to the application programmer.
That might be a bit "loose" for a lot of people to stomach, but it works very well for what I'm doing, has a lot of flexibility and packs really well
Like I was suggesting, protocols are something of an art and I don't think we're at a final solution yet, which is why many people are constantly inventing new ones :-)
Hopefully we will get to some consensus one day!
Really hope that it can be fixed at some point, but not holding my breath on that one.
I think it will take a major effort to reform that format although I am hopeful now that we at least have UInt8Array and friends that are starting to expose a broader set of machine friendly types.
I have this crazy idea that there's a middle ground: self-describing and schemas?
We can kind of glue this together (there's lots of json schemas floating around out there now, it seems), but it would be awfully interesting to see these well-supported as a pair (with explicitly language agnostic schema definitions -- which seems to be a sticking point for most of the json schema strapons).
</tangent>
https://github.com/sandstorm-io/capnproto/blob/master/c++/sr...
(There are also similar libraries for Protobuf.)
(I use CBOR a lot -- I'm otherwise quite happy with it!)
EDIT: I guess there's a "CDDL" listed on the tools page, but... It's still single implementation (ruby) and I don't see a clear link to a grammar for it.
I'll definitely give it a whirl. Cheers!
The thing is that it is least common denominator, and when you are dealing with high perf, cross language systems, it really isn't a good wire format or storage format.
It takes ages to parse, it's lossy, lacks commonly used types (or you have to annotate it with non standard attributes)... or worse guess the intention, and it's pretty verbose.
But again, that said it is a widely used standard and one that we have to live with. So there is that.