They also seriously need to give CloudWatch a UI/UX overhaul.
1. https://opencensus.io/introduction/#partners-contributors
E.G. Datadog is basing their newer tracing libraries on OpenTracing, and Prometheus devs are behind OpenMetrics.
OpenTracing and OpenMetrics are more like API specs with libraries left to others to implement, and they're never really used standalone for them to be separate projects. The best option for the industry would be to fold OT and OM into OC and make a single stack, and hopefully include structured logging as well.
E.g. Linkerd gives you service "golden metrics" (success rate, latency distribution, request volumes) without any app changes. It can draw the service topology too, since it's observing everything in realtime. https://linkerd.io/2/features/telemetry/
There is literally nothing else quite like it in the market, and it gives you distributed tracing, automatic metric collection, and pre-defined alerts for a reasonable price.
https://docs.instana.io/core_concepts/tracing/#supported-tec...
The last thing I would want in a production environment is to have some 3rd party software monkey-patching the code at runtime.
What happens when: - a bug only occurs (due to timing or some other extremely subtle issue) when this monkey-patching is applied. - there's a bug in the monkey-patching itself (sounds like a fun debugging session!) - a library is accidentally monkey-patched with a slightly different version, or falsely detected as a known library (maybe it is a fork)
Give me statically compiled, reproducible, dependency free, bit-for-bit identical with what has been thoroughly tested in CI, musl binaries any day. That's how you avoid getting woken up at 4am.
This kind of magic should happen at compile time, if at all.
Can we please stop the buzzword train?
I apologize if this is a naive question but how come this wasn't included as part of the Kubernetes project given that it has the same Google origins?
That being said, I have been looking for a while and I can't find anyone who uses it in production on a platform other than Kubernetes.
Meshes are a lot more than just sidecar proxying -- they are what make sidecar proxying manageable, and they add a lot of other features like authentication, network policies, various other traffic control policies, service discovery, etc. They are an attempt to do for service-to-service communication what Kubernetes has done for container deployment -- make it abstract and declarative, with configurations that are independent from the underlying implementation.
The underlying implementation that works right now is the Kubernetes API and etcd, and alternate implementations need to be provided for those features to work well outside of Kubernetes. I think it will happen sometime in the next few years.
In a monolith you need to implement some of this stuff only once and you don't need a lot of it at all because you are not making remote procedure calls.
(There are also obvious technical reasons for decoupling something like this from Kubernetes, mostly the opinionated nature of forcing a service mesh over other potential solutions).
Or more precisely, which, industry?
There are 0 advantages offered by opentracing and openmetrics over opencensus to defend having separate projects.
OpenCensus began in 2018.
Don't get me wrong, it's better for Google and Microsoft to collaborate on OpenCensus instead of continuing to develop separate client libraries for Stackdriver and Application Insights. It would be lovely if AWS joined too.
I just want to point out that there was a lot of community effort and vendor adoption around OpenTracing by 2018 that Google chose to ignore. If you want to reduce fragmentation and reimplementation, criticize that decision, not the existence of OpenTracing.
It's OK that you're not, but I hope you can agree that engineering observability isn't cheap nor easy - and if you're using standard libraries, frameworks, and tooling (and not going way off the rails) we have observed that, for the most part, our agent works as intended.
We always recommend our customers run the agent in their test and integration environments, but you are correct, there are always risks involved. Other than the automation how is this any different then putting a New Relic jar into your Java app, or including a Datadog library? We simply figured out how to do it automatically at runtime.
Testing with the agent would certainly help, but then you lose some of the "ease of use" benefits as I expect you would have to run a mini cluster in CI in order to run your agent?
There are few important difference between this and a "normal" dependency:
- Even if the application is fully tested with your agent, it could be something as simple as turning your agent off that could break things.
Hypothetical scenario: multiple instances of the application are running with your agent enabled. Someone decides to turn off monitoring for some reason - nothing bad happens and they go home at the end of the day. Later on, some instances are restarted, or the cluster is re-scaled. Now you have half your cluster on a different code-base and your serialisation breaks because you were doing something silly like using pickle or a java object stream.
- The examples I mentioned in my previous comment would not happen with a normal dependency, because the version of that dependency would already be managed through standard means. If I were to go an look at the code, I would be able to see the actual code that is running, and the exact versions of all dependencies used.
Anyways, I feel like we’ve come to an impasse, there is no monitoring solution out there which is bug-free (even opentracing and it’s various implementations have caused performance/stability issues, re: https://github.com/opentracing-contrib/java-spring-web/pull/...)
Regarding CI, our agent has no requirements other than a supported OS - you could be running your integration tests as a bare JVM and our agent would detect, instrument, and monitor it the same way if it were running inside a CRI-O container on K8S (though I’d question why you would run your integration tests in that manner).
IRT your examples, I’ll be brief in my responses because you’re not wrong, but the engineers on our team have taken great care to ensure we don’t break our customers environments (we run on systems which process 10’s of millions of requests an hour and where minutes of downtime cause losses in the 100s of thousands).
We dynamically unload our sensors/instrumentation when the agent is unloaded - so the likelihood of the issue which was mentioned earlier happening is slim (though nothing is impossible)
We also don’t instrument serialization methods (unless you were to decide to use our SDK to do so) so that’d literally never happen. We hook onto methods which handle communication between systems — HTTP request handlers, DB handlers, Messaging System handlers, Schedulers, etc.
Our sensors are open source, so you can check out the code if you’d like (https://github.com/instana). As I said earlier, we live in a world of trade-offs. I’d argue that systems which require the use of a service mesh are significantly complex enough to warrant the use of this level of automation to provide visibility that quite frankly 99.9% of organizations don’t have the time to do themselves.
At a certain level of scale you're running code you didn't write, anyway (some mix of open-source code and code from previous team members) and having exact source with exciting surprises you've never seen before isn't going to save you from getting woken up at 4 AM. Though it might make it easier to fix the problem.
Monkey-patching third party software will totally void the warranty on it. I've been involved in cases like this before and if there's any kind of weird bug that's conceivably related to the monkey-patching, it's hard to get help until you disable it.