A guide to LLM inference and performance
Using the full power of the GPU during LLM inference requires knowing if the inference is compute-bound or memory-bound. Calculating the operations per byte possible on a given GPU and comparing it to the arithmetic of a model's attention layers reveals where the bottleneck is. This information can be used to pick an appropriate GPU for model inference and use techniques like batching to better utilize GPU resources.
Comments
Post a Comment