Prototool – A Swiss Army Knife for Protocol Buffers

Prototool – A Swiss Army Knife for Protocol Buffers(github.com)

215 points by _mway 8 years ago | 62 comments

In another decade or so the world might replicate half of the very nice internal tools Google has.

Suggestion for a project: make a tool that, given a proto description and a file that contains concatenated proto messages stored as binary strings (sort of like RecordIO at Google) lets you run simple SQL queries on the data and extract a subset of the fields from messages matching a predicate, and maybe even do simple aggregations. That was pretty handy. I really wish Google would open source some or most of this stuff. It’s not like keeping it closed source creates any kind of insurmountable competitive advantage, especially compared to the advantages that would accrue from broader adoption of protobufs.

puzzle 8 years ago | |

Other tools and features that don't exist outside:

- a tee loadbalancer for gRPC, forwarding the same requests to both A and B backend pools, but only returning results from A. I don't think Envoy has this, but it should.

- load balancing dashboards showing traffic between frontends and backends

- load balancer support for dynamic sharding

- gnubbyd under ChromeOS: https://groups.google.com/a/chromium.org/forum/m/#!msg/chrom... (I think most of this is doable these days, but the initial setup requires a Linux system)

- Kubernetes: server-specific custom hyperlinks on dashboards (e.g. links to POD_IP:PORT/stats, /debug, etc. for each individual pod you are looking at)

- Kubernetes: multiple Docker images in the same container or pod. E.g. the first container could be your code, while the second one might be data or the JVM runtime, etc., without having to bundle them together or doing costly copies in init containers.

- Kubernetes: canaries and automatic rollbacks

necubi 8 years ago | | |

> - a tee loadbalancer for gRPC, forwarding the same requests to both A and B backend pools, but only returning results from A. I don't think Envoy has this, but it should.

Envoy can do this, via its shadowing feature. See the docs here: https://www.envoyproxy.io/docs/envoy/v1.6.0/api-v2/api/v2/ro....

kodablah 8 years ago | | |

> Kubernetes: canaries and automatic rollbacks

Hot off the presses: https://cloudplatform.googleblog.com/2018/04/introducing-Kay.... Though you have to use Spinnaker.

philsnow 8 years ago | | |

> a tee loadbalancer for gRPC, forwarding the same requests to both A and B backend pools, but only returning results from A

I would call that a "(live) traffic replayer" rather than a load balancer. "load balance" implies to me that the upstream traffic is divvied up among the downstream sinks, not that the upstream traffic gets copied to multiple downstreams.

smarterclayton 8 years ago | | |

Image based volumes (second last to bullet) has long been blocked on the container runtime having a really clean way to enable and keep the container filesystems mounted. Definitely something I want to see fixed since otherwise you just end up doing hacky copies via emptydir.

akhilcacharya 8 years ago | | |

Should you really be detailing the functionality of internal tools like this?

zellyn 8 years ago | |

When I was at Google, I kept an eye on the open sourcing of RecordIO. Apparently there was no desire not to open source it: it was simply that nobody had the time to disentangle and/or clean it up for release.

Looks like some parts of it have escaped… https://github.com/eclesh/recordio

haberman 8 years ago | | |

I think the open-source equivalent of RecordIO is the leveldb log format:

https://github.com/google/leveldb/blob/master/doc/log_format...

https://github.com/google/leveldb/blob/master/db/log_reader....

https://github.com/google/leveldb/blob/master/db/log_writer....

I think the decision not to open-source RecordIO is likely related to legacy baggage that's baked into the format. The LevelDB format above avoids that.

It doesn't appear that the headers for this are public though.

vinkelhake 8 years ago | | |

If you were interested in RecordIO, then this project might also be of interest to you: https://github.com/google/riegeli

dekhn 8 years ago | | |

TFRecords are the closest thing to recordio that has Google support.

Willson50 8 years ago | |

You might be interested in KSQL, SQL queries that run on Kafka streams. https://www.confluent.io/product/ksql/

throwaway84742 8 years ago | | |

Nah. I’m interested in quickly querying on-disk data specifically, ie proto-based application logs and the like (another thing the world needs to adopt more broadly imo).

reacharavindh 8 years ago | |

Have limited protobuf knowledge.

Why not use SQLite[1] for storing this data? Storing structured data in binary format, and being able to run SQL queries on it, is already possible with SQLite right?

[1] - https://www.sqlite.org/appfileformat.html

SOLAR_FIELDS 8 years ago | |

While not exactly what you are describing, I work for another company that uses protobufs extensively and we have some nice internal tools similar to what you describe. I really wish we could open source those too. I feel like the wheel is reinvented a lot with protobuf in several of the large companies who use it.

endymi0n 8 years ago | |

I‘m smelling the SQL case could be reasonably easily thrown together with PostgreSQL and a custom Foreign Data Wrapper based on protobuf-c (prior art: cstore_fdw by the Citus folks). Proto definitions then should compile rather cleanly to table definitions, at least one level down (PG isn‘t so good with nested structures).

The main thing stopping this endeavour is probably that to the best of my knowledge, there isn‘t any standardization in the Protobuf community about file formats serializing multiple of these together like RecordIO - that, and my C skills are pretty rusty by now :)

grandinj 8 years ago | |

You could add a TableEngine extension to H2 (h2database.com), pretty easily which would give you full SQL query functionality over such a file

throwaway84742 8 years ago | | |

Nope. Protos have repeated fields and can be hierarchical (that is, can contain other protos) and even recursive (that is, contain themselves, possibly as repeated fields). H2 is not going to work.

chrissnell 8 years ago | |

I would also love to see a protobuf/gRPC decoder for wireshark. Bonus: the ability to filter sniffed packets based on a field value.

lobster_johnson 8 years ago | |

How does RecordIO compare with Parquet and Arrow? Different use cases?

throwaway84742 8 years ago | | |

Don’t know about Arrow, but Parquet is a columnar format. Such formats can’t write record-by-record, they need a large number of records to shred into columns in order to realize their columnar benefits. In contrast, appending to RecordIO is little more than writing a binary string. The downside of RecordIO is that you can’t just read some fields in a message and not others. You have to deserialize the whole message. RecordIO is cheap to write and well suited for cases where reading the entire message is not that big a deal. Columnar formats are more suited for the cases where it’s ok to pay the relatively substantial up front encoding cost for vastly greater performance in analytical workloads. Advanced ones contain additional metadata (such as range and hash constraints, the former can be both per file and per block) which the analytical runtime will be able to take advantage of in order to avoid doing the work that doesn’t need to be done.

erik_seaberg 8 years ago | |

Sounds a lot like a Hive query over self-describing Avro files.

adam_gyroscope 8 years ago |

We built https://github.com/GyroscopeHQ/grpcat at my company, which takes text-format protos as input and sends them to a gRPC endpoint. Looking at Prototool I think I should just merge the functionality into Prototool. This is cool!

kodablah 8 years ago |

Says "Handle installation of protoc [...] behind the scenes in a platform-independent manner without any work on the part of the user", doesn't support Windows yet [0]. Granted, as pre-1.0 I should probably read the features as goals.

0 - https://github.com/uber/prototool/issues/9

flippmoke 8 years ago |

Here are some other great tools that is quite useful with protobufs, one in C++ and one in pure javascript.

https://github.com/mapbox/protozero

https://github.com/mapbox/pbf

grizzles 8 years ago |

danby - grpc for the browser :: is looking for testers https://github.com/ericbets/danby There are two upcoming features. The first one is streaming support. The second is a callback API template that mirrors the grpc node API exactly. Or you will have the choice to stick with the current promise API. It's not a priority for us at the moment but adding a simple load balancer that distributed traffic randomly across a set of servers would be a ~5 line patch.

hurricaneSlider 8 years ago |

Always wished there was a tool for protobuf which could test whether a changes to any .proto files were backwards compatible and if not raise an error

ris 8 years ago |

Yet another tool trying to "manage" "packages" on my machine!

mabynogy 8 years ago |

I can't use something with a CoC.

durkie 8 years ago | |

this seems like it's only relating to people wishing to contribute to prototool. also, it's uber, so it's nice/expected that they would have this sort of thing.

also the code is basically about not being a jerk to other people. seems like a low bar to meet.

n42 8 years ago | |

Why not?

recursive 8 years ago | | |

I'm guessing the code disallows their conduct.

mabynogy 8 years ago | | |

It's a political tool. It means they are into politics.