Discussion about this post

User's avatar
Deep Manifold's avatar

1. Big Data: Foundation model training data is big data; for big data, there is the CAP theorem.

2. Batch Size: Yann LeCun said in 2020, before foundation models, 'friends don’t let friends use batch size over 32.' There is some wisdom to it if you really understand derivatives. From a fixed-point theory perspective, current backpropagation is way too primitive.

3. Token Embedding size: DeepSeek is 7,168, OAI GPT-OSS-120B is 8,192. Why such large is needed? Because transformer (dense or MoE) is way too rigid; recently mathematicians discovered that for the most complex manifold, 128 dimensions is enough. Something is very wrong transformer architecture.

4. Knowledge Distillation: It is knowledge normalization, has research value. Can you normalize the entire world? What about bitter lessons?

No posts

Ready for more?