this is exactly the problem we keep running into. the cost isn't just "how many tokens did this call use," its "how many tokens did this entire user action consume across all the agent loops, retries, tool calls, and embeddings."
most observability tools show you the LLM call as one flat span. you can see it cost X tokens but you cant correlate it with the API request that triggered it, or see that the agent looped 4 times because the first 3 outputs failed validation. so you end up building custom logging and hoping the numbers add up.
we've been building an APM (immersivefusion.com) where cost is a first-class dimension on every trace. so you can see one request flow from the UI through your backend through the agent workflow, and each span carries its token cost. the idea is you should be able to answer "what does a checkout cost when the recommendation agent is in the loop" without stitching together 3 different tools.
for the forecasting question specifically, i think the answer is you need a few weeks of production data with good instrumentation and then you can build a distribution. the variance is real but its not random, its usually a few specific flows that blow up (retries on bad structured output like @hkonte mentioned, or RAG queries that hit the wrong chunk size). once you can see which flows are expensive the guardrails become obvious.
also wrote a longer piece on this if anyone's interested: immersivefusion.com/blog/end-to-end-observability-from-ui-to-ai-agent-to-invoice