I built a Rust-based agent framework because I wanted something modular and lightweight for production deployments. My goals were simple: reduce runtime overhead, make executors and memory replaceable, and measure the real cost of the framework (not the model).
I ran a reproducible benchmark across seven agent frameworks with the same machine, same LLM, same prompts, and the same concurrency. The benchmark captures latency distributions (P50/P95/P99), CPU, and memory so you can inspect trade-offs rather than rely on anecdotes.
Key takeaway (short): at small scale the differences are already meaningful — Rust frameworks in my tests used roughly ~1 GB RAM at 10 concurrent sessions, while the Python-based frameworks were in the ~5–9 GB range at the same load. That gap grows with scale, so it becomes a real operational cost and reliability factor for production deployments.
If you want the data and scripts, both the framework and the benchmark are public:
AutoAgents — framework repo (liquidos-ai/AutoAgents)
autoagents-bench — benchmark repo (liquidos-ai/autoagents-bench)
I’m not arguing that Python ecosystems are wrong — the DX is excellent and they drove this space — but at production scale the runtime cost of convenience matters.
I’d appreciate: Suggestions on places people typically miss when evaluating frameworks for production (metrics, failure modes, observability, cost modeling).
Feedback on the framework design and API for composability and safety. Any ideas for additional benchmarking scenarios you’d like to see.
Thanks — I’m mostly interested in technical feedback and hard criticism.