Why Your LLM Training Eats Data 2MB at a Time And What We're Doing About It

A perspective from the frontlines of foundation model training

Oct 14, 2025

A recent technical blog post from Thinking Machines Lab caught my attention: they demonstrated that RL post-training of LLMs works remarkably well with LoRA using as little as rank 1. Their information-theoretic analysis suggests each RL trace provides only 1-bit of information, which can be reasonably captured in the low-rank weight matrices. This insight resonated deeply with observations from my work at Liquid AI and our research into the fundamentals of LLM training.

The Storage Paradox

Having spent years in academia working extensively with image and video data, I have experienced firsthand the massive storage demands of computer vision research. ImageNet alone requires about 150GB of storage. The video datasets I worked with for autonomous driving and robotics routinely exceeded terabytes despite containing only a few dozen videos, a constant challenge considering I only had access to academic infrastructure, ie., not only GPU-poor but also storage-poor.

This experience made the contrast even more striking when I started working with LLM pretraining corpora. The entire text of Wikipedia across English and the top 10 languages? Roughly 30GB, a mere fraction of ImageNet. Modern open-source web-scale datasets derived from CommonCrawl (which aims to crawl the entire internet), notably FineWeb-Edu and DCLM, tell a similar story. FineWeb-Edu’s 10TB fits comfortably on a $200 HDD.

Let me put this in perspective with some back-of-the-envelope calculations.

The Information Upper Bound

Our Liquid AI tokenizer uses a 64k vocabulary, meaning each token fits in a uint16_t variable (2 bytes). The LFM2 series was trained on 10T tokens which is standard for modern pretraining runs. This translates to 20TB of tokenized data, still fitting on that single HDD.

Note that this is an upper bound since it’s uncompressed data where each token is independent. We could further compress the dataset by exploiting statistical dependencies between the tokens. The Thinking Machines Lab analysis suggests each token holds less than 1-bit of actual information. Factor that in, and our 10T dataset effectively contains around 1.25TB of information.

Think about that from an infrastructure perspective: an academic video datasets with mere dozens of encoded videos occupy the same storage as a web-scale LLM pretraining corpus.

The Rate of Learning Reality Check

Digging deeper into the training process reveals something even more striking. With a hypothetical training context length of 4k and batch size of 256 (approximately 1M tokens per batch), we are feeding just 2MB of uncompressed data into the network at each weight update step.

This rate of information flowing into the training machinery feels slow.

Why We Bet Early on Knowledge Distillation

This realization was one of the many reasons why we invested heavily in knowledge distillation (KD) for LLMs. As a matter of fact we were running ablations with KD for LLMs before we even had our first full-time employee at Liquid AI. The premise was simple: augment the sparse-information token sequences of each training sample with high-dimensional teacher output distributions.

I won’t go into details on estimating how much information there actually is in the teacher signal, since it highly depends on the teacher model, underlying data, and training objective, but from an infrastructure perspective, with KD training for LLMs, we ran into the exact opposite problem as before: data is massive.

Storing the output distribution from a teacher model requires 64k fp16 variables (one logit per possible next token). That’s 128KB per token. Scale that to a pretraining dataset and you’re looking at 1.2 million TB or 1.2 Exabytes, skipping the Peta scale entirely.

The alternative, of computing the teacher logits on-the-fly without storing them, comes also at a cost of taking up precious GPU memory and the additional inference compute. This adds up when running hundreds of ablation training runs during our ML experimentation cycle.

Our Solution: KD at top-k

These experimentation runs very valuable for us as a company since we believe that the standard Transformer architecture powering in Llama and other LLMs isn’t the optimal design for efficient and capable language models. Developing LFM2 required us to run scalable and reproducible research ablations.

Our solution, which we will discuss in more detail in the upcoming LFM2 Technical Report, was simple: store only the top-k logits of the teacher. This cuts the storage requirements to only k fp16 logit values and k uint16_t indices (since it’s a sparse vector we also have to store which are the non-zero indices). With this we were able to distill our entire LFM2 series using our 7B LFM1 backbone as the teacher model across their full 10T token pretraining run, a feat that would have been computationally prohibitive with naive approaches.

Call for Fellow Researchers and Engineers

We don’t believe this is the final paradigm for pretraining LLMs. The disconnect between data size, learning algorithm, and information content, the inefficiency of current learning rates, and the architectural redundancies all point to massive opportunities for improvement.

At Liquid AI, we are actively exploring these frontiers and rethinking how to train the next generation of foundation models with radical data and algorithmic efficiency at scale. If these challenges excite you as much as they excite us, consider joining our team. We’re looking for researchers and engineers excited about imagining and developing the next-generation training pipelines for foundation models.

Mathias is CTO at Liquid AI, where the team is building the next generation of efficient foundation models. For more technical details on our approach, watch for our upcoming LFM2 Technical Report.

Deep Manifold

Oct 17

1. Big Data: Foundation model training data is big data; for big data, there is the CAP theorem.

2. Batch Size: Yann LeCun said in 2020, before foundation models, 'friends don’t let friends use batch size over 32.' There is some wisdom to it if you really understand derivatives. From a fixed-point theory perspective, current backpropagation is way too primitive.

3. Token Embedding size: DeepSeek is 7,168, OAI GPT-OSS-120B is 8,192. Why such large is needed? Because transformer (dense or MoE) is way too rigid; recently mathematicians discovered that for the most complex manifold, 128 dimensions is enough. Something is very wrong transformer architecture.

4. Knowledge Distillation: It is knowledge normalization, has research value. Can you normalize the entire world? What about bitter lessons?

Mathias Lechner

Discussion about this post

Ready for more?