- The AI Timeline
- Posts
- Understanding Floating Points in LLMs
Understanding Floating Points in LLMs
Premium Insights: Introduction into Floating Points through DeepSeek-V3

just floating around…
Since the dawn of deep learning and neural networks, there has never been enough compute. More specifically, compute is concerned with the physical hardware device running the network, the size of the model, the availability of optimized software, and numerical precision.
Today we will dig into one branch of this compute issue involving the structure of floating point numbers, the role they play in determining an LLM model’s performance and the importance of understanding them thoroughly.
In the late December of 2024, the Chinese AI lab "DeepSeek" trained a 671 billion parameter MoE transformer language model with less than 6 million dollars that gave the US tech stock a scare. Because typically, the training costs of such a model (that also performs exceptionally well) will cost on the order of tens to hundreds of millions of dollars, which DeepSeek has not only done in an order of magnitude cheaper, but also open sourced them.

passage taken from DeepSeek-V3 paper
We can calculate the price of 1 training run based on the price per 1 x H800 GPU hour (~$2 USD).
2.788M hrs * (2 dollars / hr) = 5.576M USD
One of the core reasons they use so little GPU hours is because they integrate a lower precision training framework. By storing weights & biases, inputs, activations, etc in extremely low precisions like FP8 and FP16 (bf16 inferred, but not explicitly mentioned), they are able to train in a fraction of the compute, while still maintaining great precision in the values as highly parallelized learning is done.

passage taken from DeepSeek-V3 paper
Let's look further into some common NVIDIA consumer GPU specs to understand the speed of 8-bit and 16-bit floats.

TL;DR
Sooo... that's a lot of information. Long story short, turns out you can just use lower precision values and effectively double or quadruple the training throughput for relatively little error.
This likely isn't new information for you but this tells us that we typically only need to operate on lower floating point numbers. So let’s actually step up our intuition in the floating point format that DeepSeek has been using by refreshing with the easiest format (int), then diving into floats afterwards.

typical integer types you'd see in deep learning
(int32 mainly in hyperparameters rather than neural net ops)
Reply