The problem with OpenTelemetry

168 points by robgering 2 years ago | 174 comments

I understand what the author is saying, but vendor lock-in with closed-source observability platforms is a significant challenge, especially for large organizations. When you instrument hundreds or thousands of applications with a specific tool, like the Datadog Agent, disentangling from that tool becomes nearly impossible without a massive investment of engineering time. In the Platform Engineering professional services space, we see this problem frequently. Enterprises are growing tired of big observability platform lock-in, especially when it comes to Datadog's opaque nature of your spend on their products, for example.

One of the promises of OTEL is that it allows organizations to replace vendor-specific agents with OTEL collectors, allowing the flexibility of the end observability platform. When used with an observability pipeline (such as EdgeDelta or Cribl), you can re-process collected telemetry data and send it to another platform, like Splunk, if needed. Consequently, switching from one observability platform to another becomes a bit less of a headache. Ironically, even Splunk recognizes this and has put substantial support behind the OTEL standard.

OTEL is far from perfect, and maybe some of these goals are a bit lofty, but I can say that many large organizations are adopting OTEL for these reasons.

zeeg 2 years ago | |

I totally agree I just wish we could do it in a way that doesn’t try to lump every problem into the same bucket. I don’t see what it achieves personally, and I think it’s limiting the ability for the original goals of the project to be as successful as they could be.

unscaled 2 years ago | | |

I'm not sure I get what's the problem with OpenTelemetry as it is then? I'm not familiar with the JavaScript implementation, but it seems to be modular. You can just import @opentelemetry/api and @opentelemetry/sdk-trace-web, and as far as I understand you'll get the API (annotations) and the tracing implementation, but without the exporter (OTLP). You can plugin your own exporter or even just use the API - am I missing something?

I think the only issue is that the OpenTelemetry API also includes Metrics and Logs. I just tend to ignore these parts when using OpenTelemetry.

codereflection 2 years ago | | |

Well, telemetry is defined as logs, metrics, traces... So it kinda makes sense that OTEL supports the major aspects of telemetry.

pdimitar 2 years ago | | |

I'm curious as to what do you mean by "lump every problem into the same bucket"?

As a backender and half platform engineer I appreciate OTel a lot, it allows me to install OTel ingesting code and it then gets sent to wherever our platform guys and girls think it's best. It allows me to only think about it once and leave the details to the people who have to maintain the infra.

I mean sure, parts (or maybe all?) of the problems in this area have other solutions i.e. we don't use OTel for logging because we already have Grafana + Loki and basically everything every app outputs in stdout / stderr gets captured and can be queried but I like the flexibility for us to fully migrate to all aspects of OTel one day if the scales tilt in another direction.

So what's your beef with all this?

(For the record, I used Sentry many times in the past and I loved it, it's a very no-BS product that I appreciated a lot -- and it adding OTel ingester / collector I viewed as something very positive.)

andrewmcwatters 2 years ago | |

Yeah, it's the primary reason we used it. If OpenTelemetry's raison d'être was simply to give Datadog a reason to not bullshit their customers on pricing, it would fulfill a major need in platform services.

doctorpangloss 2 years ago |

I don’t know what the Sentry guy is really saying - I mean you can write whatever code you want, go for it man.

But I do have to “pip uninstall sentry-sdk” in my Dockerfile because it clashes with something I didn’t author. And anyway, because it is completely open source, the flaws in OpenTelemetry for my particular use case took an hour to surmount, and vitally, I didn’t have to pay the brain damage cost most developers hate: relationships with yet another vendor.

That said I appreciate all the innovation in this space, from both Sentry and OpenTelemetry. The metrics will become the standard, and that’s great.

The problem with Not OpenTelemetry: eventually everyone is going to learn how to use Kubernetes, and the USP of many startup offerings will vanish. OpenTelemetry and its feature scope creep make perfect sense for people who know Kubernetes. Then it makes sense why you have a wire protocol, why abstraction for vendors is redundant or meaningless toil, and why PostHog and others stop supporting Kubernetes: it competes with their paid offering.

ankitnayan 2 years ago |

I think all of us agree that OpenTelemetry's end-goal of making Observability vendor neutral is futuristic and inevitable. We can complain about it being hard to get started, bloated, etc but the value it provides is clear, esp, when you are paying $$$ to a vendor and stuck with it.

OpenStandards also open up a lot of usecases and startups too. SigNoz, TraceTest, TraceLoop, Signadot, all are very interesting projects which OpenTelemetry enabled.

The majority of the problem seems like sentry is not able to provide it's sentry like features by adopting otel. Getting involved at the design phase could have helped shaped the project that could have considered your usecases. The maintainers have never been opposed to such contributions AFAIK.

Regarding, limiting otel just to tracing would not be sufficient today as the teams want a single platform for all observability rather than different tools for different signals.

I have seen hundreds of companies switch to opentelemetry and save costs by being able to choose the best vendor supporting their usecases.

lack of docs, learning curve, etc are just temporary things that can happen with any big project and should be fixed. Also, otel maintainers and teams have always been seeking help in improving docs, showcasing usecases, etc. If everyone cares enough for the bigger picture, the community and existing vendors should get more involved in improving things rather than just complaining.

phillipcarter 2 years ago | |

> If everyone cares enough for the bigger picture, the community and existing vendors should get more involved in improving things rather than just complaining.

Speaking as one of these maintainers, I would absolutely love it if even half of the vendors who depend heavily on OTel contributed back to the project that enables their business.

My own employer has done this for years now (including hiring people specifically so they can continue to contribute), and we're only at about 200 employees total. I like to imagine how complete the project would feel if Google or AWS contributed to the same degree relative to the size of their business units that depend on OTel.

no_circuit 2 years ago |

IMO this boils down how one gets paid to understand or misunderstand something. A telemetry provider/founder is being commoditized by an open specification in which they do not participate in its development -- implied by the post saying the author doesn't know anyone on the spec committee(s). No surprise here.

Of course implementing a spec from the provider point of view can be difficult. And also take a look at all the names of the OTEL community and notice that Sentry is not there: https://github.com/open-telemetry/community/blob/86941073816.... This really isn't news. I'd guess that a Sentry customer should just be able to use the OTEL API and could just configure a proprietary Sentry exporter, for all their compute nodes, if Sentry has some superior way of collecting and managing telemetry.

IMO most library authors do not have to worry about annotation naming or anything like that mentioned in the post. Just use the OTEL API for logs, or use a logging API where there is an OTEL exporter, and whomever is integrating your code will take care of annotating spans. Propagating span IDs is the job of "RPC" libraries, not general code authors. Your URL fetch library should know how to propagate the Span ID provided that it also uses the OTEL API.

It is the same as using something like Docker containers on a serverless platform. You really don't need to know that your code is actually being deployed in Kubernetes. Use the common Docker interface is what matters.

serverlessmom 2 years ago |

An argument that OpenTelemetry is somehow 'too big' is an example of motivated reasoning. I can understand that A Guy Who Makes Money If You Use Sentry dislikes that people are using OTel libraries to solve similar problems.

Context propagation and distributed tracing are cool OTel features! But they are not the only thing OTel should be doing. OpenTelemetry instrumentation libraries can do a lot on their own, a friend of mine made massive savings in compute efficiency with the NodeJS OTel library: https://www.checklyhq.com/blog/coralogix-and-opentelemetry-o...

zeeg 2 years ago | |

Author here.

OpenTelemetry is not competitive to us (it doesn’t do what we do in plurality), and we specifically want to see the open tracing goals succeed.

I was pretty clear about that in the post though.

serverlessmom 2 years ago | | |

I think that it's disingenuous to say OpenTelemetry and Sentry aren't in competition. I think it would be good news for Sentry if DT were split from the project, and instrumentation and performance monitoring weren't commoditized by broad adoption of those parts of the OpenTelemetry project.

I think you, the author, stand to benefit directly from a breakup of OpenTelemetry, and a refusal to acknowledge your own bias is problematic when your piece starts with a request to 'look objectively.'

wdb 2 years ago |

Personally, I like OpenTelemetry, nice standardised approach. I just wished the vendors would have better support for the semantic conventions defined for a wide variety of traces.

I quite like the idea of only need to change one small piece of the code to switch otel exporters instead of swapping out a vendor trace sdk.

My main gripe with OpenTelemetry I don't fully understand what the exact difference is between (trace) events and log records.

AndreasBackx 2 years ago |

I have been trying to find an equivalent for `tracing` first in Python and this week in TypeScript/JavaScript. At my work I created an internal post called "Better Python Logging? Tracing for Python?" that basically asks this question. OpenTelemetry was also what I looked at and since I have looked at other tooling.

It is hard to explain how convenient `tracing` is in Rust and why I sorely miss it elsewhere. The simple part of adding context to logs can be solved in a myriad of ways, yet all boil down to a similar "span-like" approach. I'm very interested in helping bring what `tracing` offers to other programming communities.

It very likely is worth having some people from the space involved, possibly from the tracing crate itself.

zeeg 2 years ago | |

We’ll fund solving this as long as the committees agree with the goal. We just want standard tracing implementations.

(Speaking on behalf of Sentry)

wvh 2 years ago |

I have surveyed this landscape for a number of years, though I'm not involved enough to have strong opinions. We're running a lot of Prometheus ecosystem and even some OpenTelemetry stacks across customers. OpenTelemetry does seem like one of these projects with an ever expanding scope. It makes it hard to integrate parts you like and keep things both computing-wise and mentally lightweight without having to go all-in.

It's not anymore about hey, we'll include this little library or protocol instead of rolling our own, so we can hope to be compatible with a bunch of other industry-standard software. It's a large stack with an ever evolving spec. You have to develop your applications and infrastructure around it. It's very seductive to roll your own simpler solution.

I appreciate it's not easy to build industry-wide consensus across vendors, platforms and programming languages. But be careful with projects that fail to capture developer mindshare.

pdimitar 2 years ago | |

Could you clarify further on your reservations, please? As a programmer I appreciate only including a library in my project, give it a set OTLP settings (host, port, URI) and move on.

What difficulties did opting into OTel give you?

fractalwrench 2 years ago |

The main interest I've seen in OTel from Android engineers has been driven by concerns around vendor lock-in. Backend/devops in their organisations are typically using OTel tooling already & want to see all telemetry in one place.

From this perspective it doesn't matter if the OTel SDK comes bundled with a bunch of unnecessary code or version conflicts as is suggested in the article. The whole point is to regain control over telemetry & avoid paying $$$ to an ambivalent vendor.

FWIW, I don't think the OTel implementation for mobile is perfect - a lot of the code was originally written with backend JVM apps in mind & that can cause friction. However, I'm fairly optimistic those pain points will get fixed as more folks converge on this standard.

Disclaimer: I work at a Sentry competitor

markl42 2 years ago |

At the risk of hijacking the comments, I've been trying to use OTel recently to debug performance of a complex webpage with lots of async sibling spans, and finding it very very difficult to identify the critical path / bottlenecks.

There's no causal relationships between sibling spans. I think in theory "span links" solves this, but afaict this is not a widely used feature in SDKs are UI viewers.

(I wrote about this here https://github.com/open-telemetry/opentelemetry-specificatio...)

diurnalist 2 years ago | |

I don't believe this is a solved problem, and it's been around since OpenTracing days[0]. I do not think that the Span links, as they are currently defined, would be the best place to do this, but maybe Span links are extended to support this in the future. Right now Span links are mostly used to correlate spans causally _across different traces_ whereas as you point out there are cases where you want correlation _within a trace_.

[0]: https://github.com/opentracing/specification/issues/142

hinkley 2 years ago | |

I was underwhelmed by the max size for spans before they get rejected. Our app was about an order of magnitude too complex for OTEL to handle.

Reworking our code to support spans made our stack traces harder to read and in the end we turned the whole thing off anyway. Worse than doing nothing.

phillipcarter 2 years ago | | |

As per the spec there's no formal limits on size, although in practice there can be in several levels:

- Your SDK's exporter

- Collector processors and general memory limitations based on deployment

- Telemetry backend (this is usually the one that hits people)

Do you know where the source of this rejection happened? My guess would be backend, since some will (surprisingly) have rather small limits on spans and span attributes.

pdimitar 2 years ago | | |

Sounds like a knob you can turn, from my practice at least.

tnolet 2 years ago |

A recent example of OTel confusion.

I could for the life of me not get the Python integration send traces to a collector. Same URL, same setup same API key as for Nodejs and Go.

Turns out the Python SDK expect a URL encoded header, e.g. “Bearer%20somekey” whereas all other SDKs just accept a string with a whitespace.

The whole split between HTTP, protobuf over HTTP and GRPC is also massively confusing.

hinkley 2 years ago | |

The silent failure policy of OTEL makes flames shoot out of the top of my head.

We had to use wireshark to identify a super nasty bug in the “JavaScript” (but actually typescript despite being called opentelemetryjs) implementation.

And OTEL is largely unsuitable for short lived processes like CLIs, CI/CD. And I would wager the same holds for FaaS (Lambda).

In the end I prefer the network topology of StatsD, which is what we were migrating from. Let the collector do ALL of the bookkeeping instead of faffing about. OTEL is actively hostile to process-per-thread programming languages. If I had it to do over again I’d look at the StatsD->Prometheus integrations, and the StatsD extensions that support tagging.

pdimitar 2 years ago | | |

> And OTEL is largely unsuitable for short lived processes like CLIs, CI/CD. And I would wager the same holds for FaaS (Lambda).

Not necessarily true f.ex. in one of my hobby Golang projects I found out that you can cleanly shutdown the OTel collector so it flushes its backlog of traces / metrics / logs so I was able to get telemetry reading even for CLI tool invocations that lasted 5-10 secs (connect to servers, get data, operate on it, put it someplace else, quit).

But now that you mention it, it would be nasty if that's not the default behavior indeed.

> OTEL is actively hostile to process-per-thread programming languages

Can you explain why, please?

tnolet 2 years ago | | |

Yeah. And Otel has actually pretty nice debugging. You just need to set the right environment variable. But on prod it will blow up your logs

hahn-kev 2 years ago | |

Sounds like a problem with the Python sdk

tnolet 2 years ago | | |

Well actually. They (python SDK maintainers) argue their implementation is the correct one according to the spec. See this issue thread for example.

https://github.com/open-telemetry/opentelemetry-specificatio...

There are more. This is a symptom of a how hard it is to dive into Otel due to its surface area being so big.

NeutralForest 2 years ago |

It resonates. As an intern I had to add OTEL to a Python project and I had to spend a lot of time in the docs to understand the concepts and implementation. Also, the Python impl has a lot of global state that makes it hard to use properly imo.

chipdart 2 years ago | |

> As an intern I had to ${DO_SOME_PROJECT} and I had to spend a lot of time in the docs to understand the concepts and implementation

That sounds like every single run-of-the-mill internship.

NeutralForest 2 years ago | | |

That's fair but I'll say that the time and number of concepts you have to deal with before going into the code, per the docs; is quite big and I think the critic in the article is warranted.

zaphar 2 years ago | |

Tracing requires keeping mappings for tracing identifiers per request. I don't know you do that without global state unless you want the tracing identifiers to pollute your own internal apis everywhere.

bigblind 2 years ago | | |

Many frameworks have the idea of a context" for this, that holds per-request state, following your reques through the system. Functions that don't care about the context just pass it on to whatever they call.

I think Go was smart to make this concept part of the standard library, as it encouraged frameworks to adopt it as well.

NeutralForest 2 years ago | | |

I understand that but if you look at the Python implementation (or at least as it was 1-2 years ago), you have a lot of god objects that hack __new__ which leads to hidden flows when you create new instances of tracers for example. I'm not saying I have a better idea but when you put that together with the docs and the (at the time) very bare examples, it's just annoying.

BiteCode_dev 2 years ago |

100% agree.

Every time I tried to use OT I was reading the doc and whispering "but, why? I only need...".

Karrot_Kream 2 years ago | |

Yeah I was going down this path for a side project I was getting going and spent a couple days of after-work time exploring how to get just some basic traces in OT and realized it was much more than I needed or cared about.

spullara 2 years ago |

There is a huge whole in using spans as they are specified. Without separating the start of a span from the end of a span you can never see things that never complete, fail hard enough to not close the span, or travel through queues. This is a compromise they made because typical storage systems for tracing aren't really good enough to stitch them all back together quickly. Everyone should be sending events and stitching it all together to create the view. But instead we get a least common denominator solution.

drewbug01 2 years ago |

As a contributor to (and consumer of) OpenTelemetry, I think critique and feedback is most welcome - and sorely needed.

But this ain’t it. In the opening paragraphs the author dismisses the hardest parts of the problem (presumably because they are human problems, which engineers tend to ignore), and betrays a complete lack of interest in understanding why things ended up this way. It also seems they’ve completely misunderstood the API/SDK split in its entirety - because they argue for having such a split. It’s there - that’s exactly what exists!

And it goes on and on. I think it’s fair to critique OpenTelemetry; it can be really confusing. The blog post is evidence of that, certainly. But really it just reads like someone who got frustrated that they didn’t understand how something worked - and so instead of figuring it out, they’ve decided that it’s just hot garbage. I wish I could say this was unusual amongst engineers, but it isn’t.

shaqbert 2 years ago |

Otel is indeed quite complex. And the docs are not meant for quick wins...

Otelbin [0] has helped me quite a bit in configuring and making sense of it, and getting stuff done.

[0]: https://www.otelbin.io/

wdb 2 years ago | |

That looks pretty cool! OpenTelemetry Collector configuration files are pretty confusing. Do like the collector, though. Makes it easy to sent a subset of your telemetry to trusted partners.

epgui 2 years ago |

Anyone else finding this very difficult to read? I’d really recommend feeding this through a grammar checker, because poor grammar betrays unclear thinking.

zeeg 2 years ago | |

So you’re saying it makes my thinking more clear? :)

This is what happens when you use a tool designed for authoring code to also author content.

kaashif 2 years ago | | |

"betrays" means to expose, to be evidence of, particularly unintentionally.

i.e. "poor grammar unintentionally exposed unclear thinking"

grenbys 2 years ago |

I think there are two separate perspectives. For developers Open Telemetry is a clear win - high-quality vendor agnostic instrumentation backed by a reputable orgs. I instrumented with traces many business critical repos at my company (major customer support SaaS) with OTEL in Ruby, Python, JS. Not once was I confused/blocked/distracted by the presence of logs/metrics in the spec. However, can’t say much from the observability vendor perspective trying to be fully compatible with OTEL spec including metrics/logs. Article mentions customers having issues with using tracing instrumentation - it would’ve been great to back this up with corresponding github issues explaining the problems. Based on the presented JS snippet (just my guess) maybe the issue is with async code where the “span.operation” span gets immediately closed w/o waiting for the doTheThing()? Yeah - that’s tricky in JS given its async primitives. We ended up just maintaining a global reference to the currently active span and patching some OTEL packages to respect that. FWIW Sentry JS instrumentation IS really good and practical. Would have been great if Sentry could donate/contribute/influence to OTEL JS SIG with specific improvements - would be a win-win. As much as I hate DataCanine pricing they did effectively donated their Ruby tracing instrumentation to OTEL which I think is one of the best ones out there.

hobofan 2 years ago |

This seems to be more of a branding problem than anything.

OP (rightfully) complains that there is a mismatch between what they (can) advertise ("We support OTEL") and what they are actually providing to the user. I have the same pain point from the consumer side, where I have to trial multiple tools and service to figure out which of them actually supports the OTEL feature set I care about.

I feel like this could be solved by introducing better branding that has a clearly defined scope of features inside the project (like e.g. "OTEL Tracing") which can serve as a direct signifier to customers about what feature set can be expected.

zeeg 2 years ago | |

Yes! Its a bit deeper than that but its fundamentally a packaging issue.

antonyt 2 years ago |

OTel is flawed for sure, but I don't understand the stance against metrics and logs. Traces are inherently sampled unless you're lighting all your money on fire, or operating at so small a scale that these decisions have no real impact. There are kinds of metrics and logs which you always want to emit because they're mission-critical in some way. Is this a Sentry-specific thing? Does it just collapse these three kinds of information into a single thing called a "trace"?

dboreham 2 years ago |

I've used Otel quite a bit (in JVM systems) and honestly didn't know it did more than tracing.

That said, I think this rot comes from the commercial side of the sector -- if you're a successful startup with one product (e.g. graphing counters), then your investors are going to start beating you up about why don't you expand into other adjacent product areas (e.g. tracing). Repeat previous sentence reversed. And so you get Grafana, New Relic, et al). OpenTelemetry is just mirroring that arrangement.

edenfed 2 years ago |

You can absolutely use just the OTel APIs and use something else besides the OTel SDK. Here is a blog post about how we did it with eBPF: https://odigos.io/blog/Integrating-manual-and-auto

prymitive 2 years ago |

I only learned about OT after Prometheus announced some deeper integration with it. Reading OT docs about metrics feels like every little problem has a dedicated solution in the OT world, even if a more generalised one already covers it. Which is quite striking coming from the Prometheus world.

PeterZaitsev 2 years ago |

OpenTelemetry is interesting, On one side it is designed as the "commodity feeder" to number of proprietary backends as DataDog, on other hand we see good development of Open Source solutions as SigNoz and Coroot with good Otel support.

ris 2 years ago |

1. The main reason I want to use otel is so I can have one sidecar for my observability, not three, each with subtly different quirks and expectations. (also the associated collection/aggregation infrastructure)

2. I honestly think the main reason otel appears so complex is the existing resources that attempt to explain the various concepts around it do a poor job and are very hand-wavey. You know the main thing that made otel "click" for me? Reading the protobuf specs. Literally nothing else explained succinctly the relationships between the different types of structure and what the possibilities with each were.

pdimitar 2 years ago | |

Your point 2 would make for a very interesting blog post worthy of HN submitting. :)

esafak 2 years ago |

This caught my eye:

> Logs are just events - which is exactly what a span is, btw - and metrics are just abstractions out of those event properties. That is, you want to know the response time of an API endpoint? You don't rewind 20 years and increment a counter, you instead aggregate the duration of the relevant span segment. Somehow though, Logs and Metrics are still front and center.

Is anyone replacing logs and metrics with traces?

zeeg 2 years ago | |

imo Honeycomb pioneered this, and its the right baseline. There are limitations to it of course, and certainly its been done before at BigCo's that can afford to build the tech, but its extremely powerful.

The main argument for metrics beyond traces is simply a technology implementation - its aggregation because you cant store the raw events. That doesnt mean though you need a new abstraction on those metrics. They're still just questions you're asking of the events in the system, and most systems are debuggable by aggregation data points of spans or other telemetry.

As for logs, they're important for some kinds of workloads, but for the majority of companies I dont think they're the best solution to the problem. You might need them for auditability, but its quite difficult to find a case where logs are the solution to debug a problem if you had span annotations.

dalyons 2 years ago | |

Absolutely yes

dtjohnnymonkey 2 years ago |

> That means what we actually want is a way to say “hey OpenTelemetry SDK, give us all the current spans in the buffer”.

Isn’t this exactly what the SpanExporter API is for? This is in the Go SDK, I suppose it may not be available in other SDKs.

I have used this API to convert OTel spans into log messages as we currently don’t have a distributed tracing vendor.

dan-allen 2 years ago |

I keep checking in on OpenTelemetry every few months to see if the bits we need are stable yet. There’s been very little progress on the things we’re waiting for.

I don’t follow closely enough to comment on possible causes.

What I do know is that the surface area of code and infrastructure that telemetry touches means adopting something unfinished is a big leap of faith.

phillipcarter 2 years ago | |

What pieces are you looking to be stable (and what's your definition of stable)?

Asking because some pieces, like the Collector, aren't technically a stable 1.0 yet, but the bar for stability is extremely high, and in practice it's far more stable than most software out there.

But there are other pieces, such as a language's support for a specific concept, that are truly experimental or even still in-development.

pdimitar 2 years ago | |

IMO you might be looking at the wrong signals, OTel is quite successful today and I had zero breakages or compatibility problems for at least 2 years at this point.

cogman10 2 years ago |

Perhaps the real problem with OTel (IMO) is it's trying to be everything for everyone and every language. It's trying to have a common interface so that you can write OTel in Java or Javascript, python or rust, and you basically have the exact same API.

I suspect OP is seeing this directly when talking about the cludgyness of the Javascript API.

mikeshi42 2 years ago | |

The Otel spec does give leeway for language-specific details, and the SDKs are not as uniform as you'd expect (ex. Java's agent configuration is very different from Node's auto instrumentation). I'm not denying that there's SDK specs to adhere to, but the abstraction complexity in Otel is really from the amount of flexibility they've tried to build into the SDK for better or for worse.

The flexibility benefits vendors (I work for HyperDX, based on otel) - as it allows for a lot of points of extensibility to build a better experience for end users by extending the vanilla SDK functionality. However, it creates a lot of overhead for end-users trying to adopt the "vanilla" SDKs out of the box as there's 5 layers of abstractions that need to be understood before getting things started (which is bad!)

I've only seen the DX of Otel improve over time across the ecosystems they support - so I suspect we'll get there soon enough.

zellyn 2 years ago |

Are they basically just saying that the OpenTelemetry client APIs should be split from the rest of the pieces of the project, and versioned super conservatively?

The simple API they describe is basically there in OTel. The API is larger, because it also does quite a few other things (personally, I think (W3C) Baggage is important too), but as a library author I should need only the client APIs to write to.

When implementing, you're free to plug in Providers that use OpenAPI-provided plumbing, but you can equally well plug in Providers from DataDog or Sentry or whatever.

Unless I'm missing something, any further complaints could be solved by making sure the Client APIs (almost) never have backward-incompatible changes, and are versioned separately.

zeeg 2 years ago | |

It’s a bit deeper than that. The SDKs that library authors implement need to be extemely minimal. The collection libraries that vendors implement based on imo should also be minimal.

OTLP imo doesn’t even need to be part of the spec.

But minimal would also mean focusing on solving fewer problems as a whole. Eg OpenTracing plus OpenMetrics plus OpenLogs. I only need one of those things.

pdimitar 2 years ago | | |

Well, on OTLP they seem to agree with you: https://opentelemetry.io/blog/2023/otel-arrow/

arccy 2 years ago | | |

that just sounds like a branding problem though...

OTLP has been quite useful especially in metrics to get a format that doesn't really have any sacrifices/limitations compared to all the other protocols.

EdSchouten 2 years ago |

> Its not a hard problem, [...]. At its core its structured events that carry two GUIDs along with them: a trace ID and a parent event ID. It is just building a tree.

I've always wondered, what's the point of the trace ID? What even is a trace?

- It could be a single database query that's invoked on a distributed database, giving you information about everything that went on inside the cluster processing that query.

- Or it could be all database calls made by a single page request on a web server.

- Or it could be a collection of page requests made by a single user as part of a shopping checkout process. Each page request could make many outgoing database calls.

Which of these three you should choose merely depends on what you want to visualize at a given point in time. My hope is that at some point we get a standard for tracing that does away with the notion of trace IDs. Just treat everything going on in the universe as a graph of inter-connected events.

remram 2 years ago | |

I think they meant "an event ID and a parent event ID".

zeeg 2 years ago | | |

I actually meant trace ID and parent event ID (and ID was inferred). Parent comment is correct in that trace ID isnt technically needed, and is in fact quite controversial. Its an implementation level protocol optimization though, and unfortunately not an objective one. It creates an arbitrary grouping of these annotations - which is entirely subjective, and the spec struggles to reconcile - but its primarily because the technology to aggregate and/or query them would be far more difficult if you didn't keep that simple GUID.

It does have one positive benefit beyond that. If you lose data, or have disparate systems, its pretty easy to keep the Trace ID intact and still have better instrumentation than otherwise.

noname120 2 years ago |

tl;dr OpenTelemetry eats Sentry's cake by commoditizing what they do and the reaction of the founder of Sentry is to be very upset about it rather than innovating.

jiveturkey 2 years ago |

> Everyone and their mother is running a shoddy microservice-coupled stack,

buried the lede!

syngrog66 2 years ago |

Up my alley. I'm the author of a FOSS Golang span instrumentation library for latency (LatLearn in my GitHub.) And part of the team that back in 2006/2007 made an in-house distributed tracing solution for Orbitz.