12 Comments
User's avatar
Chandu Tadanki's avatar

Cameron,

My first takeaway is a refresher on Decoder-only Transformer, and the second is how a MoE is built on top of it. When we look at the final code, it looks simple, but you brought to light the need for experiments (for stabilizing and pushing away the divergence point). That shows the iterative nature of the evolution of stability.

Am an ML-DL enthusiast. I was exposed to neural networks thirty years ago during my Masters, but never really worked hands-on on technology (am more towards business processes). This article took me more than a week to read and digest, but worth the efforts. I took my own notes (and realized that most are just copy/pastes) to fit my own thoughts.

Thanks a lot.

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

That's wonderful to hear! I hope you enjoyed diving back into hands-on neural net work, and I'm happy that the article helped you to do that! :)

Expand full comment
Paul's avatar

Hi Cameron,

again, brilliant article.

You're my favorite writer, based on research, technical and applicable AI, yet clearly understandable.

Question for the softmax in the router:

Why not using safe softmax instead of using float32?

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Thanks! I think safe softmax is trying to address a different problem:

"One of the issues that commonly comes up is the necessity for a safe softmax – that is, if there is an entire batch that is “masked out” or consists entirely of padding (which in the softmax case translates to being set to -inf, then this will result in NaNs, which can lead to training divergence." - PyTorch docs page

But, I'm not super familiar with this approach, so it's possible that this is another option (depending upon how it works).

Expand full comment
Paul's avatar

alright, alright, thank you for the response.

to be fair: it's just a detail and the amount of VRAM you save with safe softmax may be limited if applicable at all.

just FYI: I learned about safe softmax in the context of flash attention as the measure to prevent overflow, e.g. here under "2 (Safe) Softmax" https://courses.cs.washington.edu/courses/cse599m/23sp/notes/flashattn.pdf

"[...] Note that x_i might be very large and e^x_i can easily overflow. For instance, the maximum number that float16 can support is 65536, which means that for x>11, e^x would exceed the effective range of float16. To mitigate this issue, mathematical software often employs a trick known as the safesoftmax: [...]"

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Thanks for sharing! I had not seen this before!

Expand full comment
Pierre de Lacaze's avatar

Excellent article. The refresher course on decoder only architectures is always welcome.

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Glad you liked it!

Expand full comment
Dr. Ashish Bamania's avatar

Brilliant!

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Thank you!

Expand full comment
Patryk's avatar

You are genuinely the best Cameron. Your posts rekindle the love to ML in me that the current hype and industry is constantly butchering.

Expand full comment
Amitabha Chakraborty's avatar

Excellent article

Expand full comment