Discussion about this post

User's avatar
Sam's avatar

A question that I have at the end -- if ALiBi indeed "outperforms" both RoPE and vanilla positional embedding techniques (with respect to extrapolating to longer sequences), why was it "only" (*) used for MPT, rather than for recent models like LLaMA-2, which used RoPE?

I'd like to better understand (maybe a different post, maybe even one you've done) the dimensions across which I'd compare different attention variants.

Sahana's avatar

This is one of the best article I have seen on the topic. This is exactly the one I wanted to understand practical implementation. Thank you so much for taking the time.

14 more comments...

No posts

Ready for more?