Bringing the future today

One thing that I didn't cover in the presentation 2 weeks ago was that, there is a 3rd way to make LLMs more powerful. The first two are:

Data & Compute: bigger models are more powerful. But the bigger size has a penalty of speed and cost when running inference (as oppose to training cost, which is likely only about 10% of the cost of a model's lifetime).
Test-time scaling: make the model think longer to generate more intelligence

The third is:

Distillation: Have a smarter model teach a smaller model, by feeding the output to train the smaller model.
- This is equivalent to selectively making the smaller model smarter, on certain activities that makes economic sense (coding, physics, math, etc).
- This means, you keep the model small, inference cost lower, speed high, and increase intelligence on the things you care about.

In the past few weeks, there has been rumors that the AI frontier labs have stopped publishing their next gen models (GPT-5 and Claude 3.5 Opus, which was scrapped)
Word on the street (twitter) is that, rather than releasing the models, which would be expensive to run, the AI Labs are just using it to teach (distill) its smaller models
Last week, this approach appears to be proven! A Lab out of China released Deepseek V1 R --> it's equivalent to Open AI's most advanced model, O1 (although O3 is coming out in the coming weeks).
They made the model open source and released the wieight for anyone to download! You're not going to be able to run it locally unfortunately, it takes ~400 GB of VRAM.
But, accompanying the release, the also released several smaller models that have been souped up by distillation!
- See image for the list. The distilled Qwen series model is, according to this benchmark, is 2x more powerful than Claude, which is my daily driver, a GPT-4o model.
- Oh, and these models are small enough to run locally on your macbook! (next post)

With (a) data & compute + test-time scaling + distillation, theoretically, you can have a self-recursive LLM system that can be daisy chained to be more and more powerful
Here is the recipe:
- (a) have your biggest smartest model think a long time (b) --> distill (c) its knowledge of useful thing to a smaller model, making it more powerful
- the smaller model is now almost as powerful as the original powerful model, but because it's smaller and cheaper to run, you can run it even longer for test-time (b) compute (and we know that running the LLM longer --> more intelligence) --> distill the answers to train a larger and more powerful model,
- rinse and repeat.
- I think this is why the pace of progress of model improvement is accelerating. O1 came out 2-3 months ago. And now O3 is coming out...