Estimating required GPU memory for serving LLMs(substratus.ai) |
Estimating required GPU memory for serving LLMs(substratus.ai) |
Wrote a blog post to demystify the process of GPU memory usage estimating.
Understanding the vram to simply load the weights is easy enough. When you are allowing for something like content generation with varying lengths of input/output tokens, how do you even begin to identify the GPUs you need?