| user: | mezark |
| created: | February 27, 2023 |
| karma: | 23 |
| 1. | Moe inference optimizations: 15% lower expert load by request reordering(blog.doubleword.ai) |
| 2. | |
| 3. | Tensor Network Attention(mainlymatmul.com) |
| 4. | Redundant Information in LLM Weights(fergusfinn.com) |
| 5. | Tans: Precomputing RANS(fergusfinn.com) |
| 6. | Also-RANS: Asymmetric Numeral Systems for Entropy Coding(fergusfinn.com) |
| 7. | 70x faster cold(ish) starts for SGLang(fergusfinn.com) |
| 8. | QueueSpec – drafting speculation tokens while a request queues(blog.doubleword.ai) |
| 9. | ZeroDP: Just-in-Time Weight Offloading over NVLink for Data Parallelism(mainlymatmul.com) |
| 10. | Parallel Primitives for Multi-Agent Workflows(fergusfinn.com) |
| 11. | |
| 12. | Should GPUs Make Free Trade Agreements?(doubleword.ai) |
| 13. | |
| 14. | |
| 15. | Takeoff Inference Server Is Now Open Source(github.com) |
| 16. | Falcon 7B running real time on CPU(youtube.com) |