My first takeaway is a refresher on Decoder-only Transformer, and the second is how a MoE is built on top of it. When we look at the final code, it looks simple, but you brought to light the need for experiments (for stabilizing and pushing away the divergence point). That shows the iterative nature of the evolution of stability.
Am an ML-DL enthusiast. I was exposed to neural networks thirty years ago during my Masters, but never really worked hands-on on technology (am more towards business processes). This article took me more than a week to read and digest, but worth the efforts. I took my own notes (and realized that most are just copy/pastes) to fit my own thoughts.
Thanks! I think safe softmax is trying to address a different problem:
"One of the issues that commonly comes up is the necessity for a safe softmax – that is, if there is an entire batch that is “masked out” or consists entirely of padding (which in the softmax case translates to being set to -inf, then this will result in NaNs, which can lead to training divergence." - PyTorch docs page
But, I'm not super familiar with this approach, so it's possible that this is another option (depending upon how it works).
"[...] Note that x_i might be very large and e^x_i can easily overflow. For instance, the maximum number that float16 can support is 65536, which means that for x>11, e^x would exceed the effective range of float16. To mitigate this issue, mathematical software often employs a trick known as the safesoftmax: [...]"
Cameron,
My first takeaway is a refresher on Decoder-only Transformer, and the second is how a MoE is built on top of it. When we look at the final code, it looks simple, but you brought to light the need for experiments (for stabilizing and pushing away the divergence point). That shows the iterative nature of the evolution of stability.
Am an ML-DL enthusiast. I was exposed to neural networks thirty years ago during my Masters, but never really worked hands-on on technology (am more towards business processes). This article took me more than a week to read and digest, but worth the efforts. I took my own notes (and realized that most are just copy/pastes) to fit my own thoughts.
Thanks a lot.
That's wonderful to hear! I hope you enjoyed diving back into hands-on neural net work, and I'm happy that the article helped you to do that! :)
Hi Cameron,
again, brilliant article.
You're my favorite writer, based on research, technical and applicable AI, yet clearly understandable.
Question for the softmax in the router:
Why not using safe softmax instead of using float32?
Thanks! I think safe softmax is trying to address a different problem:
"One of the issues that commonly comes up is the necessity for a safe softmax – that is, if there is an entire batch that is “masked out” or consists entirely of padding (which in the softmax case translates to being set to -inf, then this will result in NaNs, which can lead to training divergence." - PyTorch docs page
But, I'm not super familiar with this approach, so it's possible that this is another option (depending upon how it works).
alright, alright, thank you for the response.
to be fair: it's just a detail and the amount of VRAM you save with safe softmax may be limited if applicable at all.
just FYI: I learned about safe softmax in the context of flash attention as the measure to prevent overflow, e.g. here under "2 (Safe) Softmax" https://courses.cs.washington.edu/courses/cse599m/23sp/notes/flashattn.pdf
"[...] Note that x_i might be very large and e^x_i can easily overflow. For instance, the maximum number that float16 can support is 65536, which means that for x>11, e^x would exceed the effective range of float16. To mitigate this issue, mathematical software often employs a trick known as the safesoftmax: [...]"
Thanks for sharing! I had not seen this before!
Excellent article. The refresher course on decoder only architectures is always welcome.
Glad you liked it!
Brilliant!
Thank you!
You are genuinely the best Cameron. Your posts rekindle the love to ML in me that the current hype and industry is constantly butchering.
Excellent article