Discussion about this post

User's avatar
Sam's avatar

A question that I have at the end -- if ALiBi indeed "outperforms" both RoPE and vanilla positional embedding techniques (with respect to extrapolating to longer sequences), why was it "only" (*) used for MPT, rather than for recent models like LLaMA-2, which used RoPE?

I'd like to better understand (maybe a different post, maybe even one you've done) the dimensions across which I'd compare different attention variants.

Expand full comment
Srikar's avatar

Is it a typo?

which has an input dimension of d and an output dimension of h = 3 * d,

in the code it is 4*d

Expand full comment
12 more comments...

No posts