nanoMoE: Mixture-of-Experts (MoE) LLMs from…

Cameron R. Wolfe, Ph.D.

Mar 10

137

An introductory, simple, and functional implementation of MoE LLM pretraining...

Read →

12 Comments

Chandu Tadanki

Apr 7

Cameron,

My first takeaway is a refresher on Decoder-only Transformer, and the second is how a MoE is built on top of it. When we look at the final code, it looks simple, but you brought to light the need for experiments (for stabilizing and pushing away the divergence point). That shows the iterative nature of the evolution of stability.

Am an ML-DL enthusiast. I was exposed to neural networks thirty years ago during my Masters, but never really worked hands-on on technology (am more towards business processes). This article took me more than a week to read and digest, but worth the efforts. I took my own notes (and realized that most are just copy/pastes) to fit my own thoughts.

Thanks a lot.

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Apr 7

That's wonderful to hear! I hope you enjoyed diving back into hands-on neural net work, and I'm happy that the article helped you to do that! :)

Expand full comment

Paul

Mar 25

Hi Cameron,

again, brilliant article.

You're my favorite writer, based on research, technical and applicable AI, yet clearly understandable.

Question for the softmax in the router:

Why not using safe softmax instead of using float32?

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Mar 25

Thanks! I think safe softmax is trying to address a different problem:

"One of the issues that commonly comes up is the necessity for a safe softmax – that is, if there is an entire batch that is “masked out” or consists entirely of padding (which in the softmax case translates to being set to -inf, then this will result in NaNs, which can lead to training divergence." - PyTorch docs page

But, I'm not super familiar with this approach, so it's possible that this is another option (depending upon how it works).

Expand full comment

Reply (1)

Paul

Mar 25Edited

alright, alright, thank you for the response.

to be fair: it's just a detail and the amount of VRAM you save with safe softmax may be limited if applicable at all.

just FYI: I learned about safe softmax in the context of flash attention as the measure to prevent overflow, e.g. here under "2 (Safe) Softmax" https://courses.cs.washington.edu/courses/cse599m/23sp/notes/flashattn.pdf

"[...] Note that x_i might be very large and e^x_i can easily overflow. For instance, the maximum number that float16 can support is 65536, which means that for x>11, e^x would exceed the effective range of float16. To mitigate this issue, mathematical software often employs a trick known as the safesoftmax: [...]"

Expand full comment

Reply (1)