Discussion about this post

User's avatar
Chandu Tadanki's avatar

Cameron,

My first takeaway is a refresher on Decoder-only Transformer, and the second is how a MoE is built on top of it. When we look at the final code, it looks simple, but you brought to light the need for experiments (for stabilizing and pushing away the divergence point). That shows the iterative nature of the evolution of stability.

Am an ML-DL enthusiast. I was exposed to neural networks thirty years ago during my Masters, but never really worked hands-on on technology (am more towards business processes). This article took me more than a week to read and digest, but worth the efforts. I took my own notes (and realized that most are just copy/pastes) to fit my own thoughts.

Thanks a lot.

Expand full comment
Paul's avatar

Hi Cameron,

again, brilliant article.

You're my favorite writer, based on research, technical and applicable AI, yet clearly understandable.

Question for the softmax in the router:

Why not using safe softmax instead of using float32?

Expand full comment
10 more comments...

No posts