Pulling Multiple Levers: What It Actually Takes to Build a Competitive LLM
There’s a common refrain you hear about large language models: “LLMs are just next-token prediction machines”. It’s a outdated summary, and it does not grasp of what goes into building one.
Training a competitive foundation model in 2026 requires progress on many fronts at once: architecture, training recipes, data strategies, alignment, and inference optimization. No single lever gets you there.
In this post, I’ll walk through how we built LFM2.5 at Liquid AI which is our latest on-device model. Think LFM2.5 as a case study for this post here. Each section covers a different lever we pulled, and the point is that all of them mattered.
Architecture: Let the Hardware Tell You What’s Fast
Most architecture comparisons rely on FLOPs or parameter counts as proxies for efficiency. We found these proxies to be insufficient.
For LFM2.5, we ran a neural architecture search (NAS) with a key ingredient: hardware-in-the-loop evaluation. Instead of estimating speed from theoretical compute budgets, we measured actual latency and memory footprint on the target device: a Samsung Galaxy S24 with a Qualcomm SoC.
Why does this matter? Because real-world performance depends on factors that FLOPs don’t capture: memory access patterns, which kernels are actually available on the hardware, cache hit rates, and operator fusion opportunities. A model that looks efficient on paper can be slow in practice if it hammers memory bandwidth or relies on operations without optimized kernel implementations.
The NAS with hardware in the loop lets us explore a large design space while making sure the found architecture that is fast on the actual target device.
Pretraining: Beyond Next-Token Prediction
This is where the “LLMs are just next-token predictors” narrative breaks down.
For LFM2.5, we used knowledge distillation (KD) as the primary pretraining objective. We trained LFM2.5 (the student) to match the output distribution of our existing LFM1-7B model (the teacher), rather than training against next-token labels.
Why? The teacher’s output probabilities encode richer information than a single correct-token label. The soft distribution carries signals about which tokens are plausible alternatives, relative likelihoods, and the model’s uncertainty structure. The student receives a denser gradient signal per training step, which translates to better sample efficiency, the model reaches a given quality level with fewer tokens seen and fewer GPU hours spent.
I have covered this in more detail in a recent substack post: https://substack.com/home/post/p-174936349.
Context Extension and Synthetic Data
We keep the initial pretraining cheap and scalable by training on a relatively short context window of 4k tokens using internet-scale data. W then pull two additional levers:
Context extension. We run a continued pretraining (CPT) phase to extend the native context length to 32k tokens. From there, we can push further to 128k using extrapolation techniques like YaRN. This staged approach is far more compute-efficient than pretraining on long contexts from the start due to the quadratic cost of the grouped-query attention (GQA) layers.
Synthetic data. Web-scraped data is broad but shallow. It covers many topics at a surface level but lacks the depth needed for strong performance in domains like mathematics or structured reasoning. We address this by generating targeted synthetic data with our larger models, filling in the gaps that web data leaves behind. This lets us steer the model’s capabilities toward specific domains without needing to find (or pay for) large quantities of high-quality human-written data in those areas.
Supervised Fine-Tuning: Less Data, More Curation
At the SFT stage, data volume matters less than data quality and topic coverage.
Data filtering. We run a multi-stage filtering and selection pipeline to identify the training examples with the highest signal. At this stage, a few thousand well-chosen examples outperform a large corpus of mediocre ones — the model has already learned general capabilities during pretraining, so SFT is about shaping behavior, not teaching knowledge.
Model merging. We train multiple SFT models on different data distributions and merge their weights. This produces a model that combines the strengths of each specialist without the distribution gaps any single one would have. We apply this iteratively throughout post-training, re-merging as individual components improve.
Preference Optimization: Mixing Student and Teacher Signals
In our preference optimization stage, we combine two sources of rollout data:
On-student-policy rollouts. We generate candidate responses from the student model itself. This is standard since you need to optimize on the distribution the model actually produces.
On-teacher-policy rollouts. We also generate responses from the teacher model. This brings in responses the student wouldn’t have produced on its own, expanding the coverage of the preference dataset.
By mixing both sources, the student-policy data keeps training grounded in the model’s actual behavior, while the teacher-policy data provides targets the model can grow toward.
For scoring, we use an LLM-as-a-judge setup with larger models. This gives us a shapeable reward signal across multiple axes, such as correctness, helpfulness, safety, style, etc, rather than collapsing everything into a single scalar.
Reinforcement Learning: Critic-Free and Curriculum-Driven
Our RL stack uses critic-free, group-relative policy-gradient optimization in the style of GRPO. The implementation is reference-free and incorporates several techniques: asymmetric ratio clipping, dynamic filtering of zero-variance prompt groups, overlong-sample masking, no advantage normalization, and truncated importance sampling.
Beyond the optimization mechanics, we use curriculum RL: training starts on simpler problems and gradually moves to harder ones. For example, in math we begin with arithmetic and algebra before moving to competition-level problems. This ordering helps the model build reliable intermediate skills before facing tasks that depend on them, and we observe more stable training dynamics as a result.
Bonus: Inference Optimizations
Training a good model is only half the battle, you also need to serve it efficiently.
Quantization-aware training (QAT). Standard post-training quantization often degrades accuracy. With QAT, we simulate the effects of low-precision arithmetic during training itself, so the model learns weight representations that are robust to quantization noise. We use stochastic rounding to quantize LFM2.5 down to 4-bit weights while keeping accuracy close to the full-precision baseline.
Prefix caching. For repeated or partially overlapping queries, we cache the computed key-value pairs for static portions of the input, avoiding redundant computation. This speeds up response times in common deployment patterns where system prompts or context prefixes are shared across requests. Since LFM2.5 is a hybrid architecture with both a KV cache and a convolutional cache, implementing prefix caching was a non-trivial engineering effort but the latency reduction in production workloads justified it.
The Takeaway
Building a competitive LLM requires pulling many levers at once, including architecture search guided by real hardware measurements, knowledge distillation for pretraining, staged context extension, synthetic data for domain depth, careful SFT curation, mixed-policy preference optimization, curriculum RL, and inference optimizations like QAT and prefix caching.
These levers are only a subset of what we have considered and will explore in the future. None is sufficient on its own. The skill is in combining them under your specific constraints, in our case, building a capable model that fits on a phone and runs under 1GB of memory.
If one of these levers excites you or are interested to explore others, we are hiring: liquid.ai/careers
References:

Good read!