First of all, this is incredible, thanks for sharing this with us all.
Second, at risk of exposing my lack of technical knowledge -- in the Harmony prompt example you have "always respond in riddles" nested in the developer level of the hierarchy. Should the final output have therefore been a riddle or...? Would love to have this riddle about a riddle explained!
I finally found the time to read the blog, quality content as always, thank you very much!
One little typo caught my eye, "For example, Gemma-3 adopts a 5:1 ratio, meaning that there is one dense attention layer for ever*missing_y* five sliding window attention layers."
And one question regarding "Specifically, GPT-oss uses group sizes of eight—meaning that key and query values are shared among groups of eight attention heads—for grouped-query attention in both model sizes."
Is this correct or should it be "meaning that keys and values are shared among groups of eight attention heads" instead of "meaning that key and query values are shared among groups of eight attention heads". I thought, query values are not shared or do I misunderstand something?
First of all, this is incredible, thanks for sharing this with us all.
Second, at risk of exposing my lack of technical knowledge -- in the Harmony prompt example you have "always respond in riddles" nested in the developer level of the hierarchy. Should the final output have therefore been a riddle or...? Would love to have this riddle about a riddle explained!
Yes it should be a riddle! Great point! I'll fix this LOL
Riddle is now included :)
I finally found the time to read the blog, quality content as always, thank you very much!
One little typo caught my eye, "For example, Gemma-3 adopts a 5:1 ratio, meaning that there is one dense attention layer for ever*missing_y* five sliding window attention layers."
And one question regarding "Specifically, GPT-oss uses group sizes of eight—meaning that key and query values are shared among groups of eight attention heads—for grouped-query attention in both model sizes."
Is this correct or should it be "meaning that keys and values are shared among groups of eight attention heads" instead of "meaning that key and query values are shared among groups of eight attention heads". I thought, query values are not shared or do I misunderstand something?
Thank you and best regards
Thank you for the typo - just fixed it.
You're correct. It should be keys / values (not keys / queries). I just fixed this as well - thank you so much!
I think the image in the Routing section has a minor bug? In the formula of softmax, the denominator should be the sum of all N experts, not K?
You’re right! Just fixed it. Thank you for calling this out.
Superb breakdown @Cameron.
ROPE section is really intuitive.
Thanks so much!
Thanks a lot @Cameron ! This is exactly what I wanted to read. This article is so rich in knowledge.
May I ask you one question? May I use your articles as the reference for my educational writing/tutorials purposes? Thank you.
Go for it as long as you provide a citation / reference!
thanks Cameron , very impressive recap!
Of course! Thanks for reading