Discussion about this post

User's avatar
Benjamin Riley's avatar

First of all, this is incredible, thanks for sharing this with us all.

Second, at risk of exposing my lack of technical knowledge -- in the Harmony prompt example you have "always respond in riddles" nested in the developer level of the hierarchy. Should the final output have therefore been a riddle or...? Would love to have this riddle about a riddle explained!

Expand full comment
Paul's avatar

I finally found the time to read the blog, quality content as always, thank you very much!

One little typo caught my eye, "For example, Gemma-3 adopts a 5:1 ratio, meaning that there is one dense attention layer for ever*missing_y* five sliding window attention layers."

And one question regarding "Specifically, GPT-oss uses group sizes of eight—meaning that key and query values are shared among groups of eight attention heads—for grouped-query attention in both model sizes."

Is this correct or should it be "meaning that keys and values are shared among groups of eight attention heads" instead of "meaning that key and query values are shared among groups of eight attention heads". I thought, query values are not shared or do I misunderstand something?

Thank you and best regards

Expand full comment
11 more comments...

No posts