Suggestion for a project: make a tool that, given a proto description and a file that contains concatenated proto messages stored as binary strings (sort of like RecordIO at Google) lets you run simple SQL queries on the data and extract a subset of the fields from messages matching a predicate, and maybe even do simple aggregations. That was pretty handy. I really wish Google would open source some or most of this stuff. It’s not like keeping it closed source creates any kind of insurmountable competitive advantage, especially compared to the advantages that would accrue from broader adoption of protobufs.
- a tee loadbalancer for gRPC, forwarding the same requests to both A and B backend pools, but only returning results from A. I don't think Envoy has this, but it should.
- load balancing dashboards showing traffic between frontends and backends
- load balancer support for dynamic sharding
- gnubbyd under ChromeOS: https://groups.google.com/a/chromium.org/forum/m/#!msg/chrom... (I think most of this is doable these days, but the initial setup requires a Linux system)
- Kubernetes: server-specific custom hyperlinks on dashboards (e.g. links to POD_IP:PORT/stats, /debug, etc. for each individual pod you are looking at)
- Kubernetes: multiple Docker images in the same container or pod. E.g. the first container could be your code, while the second one might be data or the JVM runtime, etc., without having to bundle them together or doing costly copies in init containers.
- Kubernetes: canaries and automatic rollbacks
Envoy can do this, via its shadowing feature. See the docs here: https://www.envoyproxy.io/docs/envoy/v1.6.0/api-v2/api/v2/ro....
Hot off the presses: https://cloudplatform.googleblog.com/2018/04/introducing-Kay.... Though you have to use Spinnaker.
I would call that a "(live) traffic replayer" rather than a load balancer. "load balance" implies to me that the upstream traffic is divvied up among the downstream sinks, not that the upstream traffic gets copied to multiple downstreams.
Looks like some parts of it have escaped… https://github.com/eclesh/recordio
https://github.com/google/leveldb/blob/master/doc/log_format...
https://github.com/google/leveldb/blob/master/db/log_reader....
https://github.com/google/leveldb/blob/master/db/log_writer....
I think the decision not to open-source RecordIO is likely related to legacy baggage that's baked into the format. The LevelDB format above avoids that.
It doesn't appear that the headers for this are public though.
Why not use SQLite[1] for storing this data? Storing structured data in binary format, and being able to run SQL queries on it, is already possible with SQLite right?
The main thing stopping this endeavour is probably that to the best of my knowledge, there isn‘t any standardization in the Protobuf community about file formats serializing multiple of these together like RecordIO - that, and my C skills are pretty rusty by now :)
also the code is basically about not being a jerk to other people. seems like a low bar to meet.
Anything at Google that does not support load balancing is doomed to melt fairly quickly.
But H2 has ARRAY for repeated fields, and with some custom functions for decomposing other functions, you could get pretty far.
Just saying, not perfect, but could be useful without too much effort.
cat /path/to/log.file | prototool binary-to-json path/to/proto/files foo.bar.Baz - | jq .search.term
(Source: I'm a SWE at an Alphabet company)
As someone who has spent quite a bit of time at Google working on a high performance file format (not RecordIO):
1. I'd also add LZ4 and/or Snappy for the cases where they are more Pareto-optimal (i.e. fast, network attached, remote storage, such as SSD Colossus, or its external proxy: SSD Persistent Disk).
2. IMO HighwayHash is overkill here, and the author should have used CRC32C instead. You don't particularly care about collisions in this case, you're detecting data corruption. CRC32C is perfect for that, and it's hardware accelerated in almost all recent Intel and ARM CPUs, and it's half the size on disk.
3. It'd be pretty cool to introduce some kind of metadata which would tell the user what type of message is encoded in the file. This is not something RecordIO has, but internal tools can guess most of the time because they have all the proto definitions at their disposal. There's no need to store it in every header, just the first one. I would advise against storing the full schema (that can get very gnarly in the presence of proto dependencies and extensions), but just have something lightweight, i.e. message name and perhaps SCM revision number or hash in the file header, so that the user (or the external system consuming the files) could somewhat reliably establish what the format is later on, when the proto definition drifts. Otherwise, this being a binary serialized file format, it's very easy to end up in a situation where you have some files from years ago and you no longer know how to read them. And yes, I'm aware that SCM hash can change if history is edited.
[Edit, after looking a bit.]
Pretty different. If I remember correctly, RecordIO is re-synchronizing, whereas Riegeli seems to break things up into 64KB chunks, splitting messages across chunks if necessary.
[Edit, after finding more information.]
Interesting… looks like Riegeli is intended to compress well, rather than just store sequentially. https://encode.ru/threads/2895-Riegeli-%E2%80%94-a-new-compr...
Also, setting up Spinnaker is pretty much as complicated as Kubernetes itself. :-)
Configmaps don't really exist, although something similar is achieved with a job that has a second package holding just the data. This is why I think multi-image containers should be implemented, but also a reason why they haven't been yet: configmaps cover some use cases. When a job replica (task) gets updated or rolled back, both packages change in sync. On Kubernetes, you'd use version numbers in the configmap name (but you need to worry about garbage collecting unused ones).
Services live outside of Borg entirely. GSLB has its own push mechanisms and only consumes Borg's lists of containers that comprise a given Borg job.
From the paper:
> A user can change the properties of some or all of the tasks in a running job by pushing a new job configuration to Borg, and then instructing Borg to update the tasks to the new specification. This acts as a lightweight, non-atomic transaction that can easily be undone until it is closed (com- mitted). Updates are generally done in a rolling fashion, and a limit can be imposed on the number of task disruptions (reschedules or preemptions) an update causes; any changes that would cause more disruptions are skipped. Some task updates (e.g., pushing a new binary) will al- ways require the task to be restarted; some (e.g., increasing resource requirements or changing constraints) might make the task no longer fit on the machine, and cause it to be stopped and rescheduled; and some (e.g., changing priority) can always be done without restarting or moving the task.
Agreed 100%. Although... in my mind k8s is still pretty young, and this would definitely be a great feature to have by default in the future.
Cooking, if there is more than one person involved in preparing and eating the food, will involve politics.
IMHO people to avoid.
I am curious to understand how it contributes to your decision not to integrate the tool into your workflow. I don't typically choose products based on the politics of the company that create them, which is the closest analogy I can come up with.
My point of view is the one of Morgan Freeman (something like "stop talking about it you make it getting worse").
Why is a document describing the standards for social interaction that contributors pledge to live up to bad? Open source has been and continues to be rife with social interactions that are bad for individual contributors and for the project as a whole. CoCs strive to head that off and document a process for when people seem to violate their pledge.
It's not about common sense. You don't need to tell people to be polite (and you shouldn't). If someone isn't everybody will notice it.
Unless it's done in private. If someone is a jerk to me in private (over email or Slack or whatever), what do I do?
If I'm contributing to a project without a CoC and someone is a jerk to me, I'm much more likely to just contribute to a different project, or start my own fork, or just stop contributing to open source entirely.
Projects with CoCs have a stated procedure in place for dealing with abusive contributors. That's what the whole thing is about. Of course you shouldn't have to tell people to be polite, but that's not how the real world works. There are jerks everywhere, and more often than not they're going to express that in private.
You're looking for a way to be less implicated. It's because you're accustomed to live in systems based on laws and rules.
People aren't bad. Most are kind and they don't want to harm anonybody. The bad guys are very few (like 1% or 0.1%) and as they are few they should be handled case by case. Building a complicated system of rules for them will just bother the "not-bad" majority.