14 Comments

A question that I have at the end -- if ALiBi indeed "outperforms" both RoPE and vanilla positional embedding techniques (with respect to extrapolating to longer sequences), why was it "only" (*) used for MPT, rather than for recent models like LLaMA-2, which used RoPE?

I'd like to better understand (maybe a different post, maybe even one you've done) the dimensions across which I'd compare different attention variants.

Expand full comment

(IE I'm guessing that saying "A outperforms B" probably leaves some nuance on the table that I'd love to know about)

Expand full comment

Great question. I asked the same question to myself: Why are more LLMs not using Alibi if the paper shows clearly that it's a lot better than RoPE? I went digging for an answer and wasn't able to find anything definitive. So, I'm probably gonna follow-up with another post eventually that compares position embedding strategies. Hopefully, I will have answered this question by then :)

Expand full comment

For a more nuanced comparison of RoPE/Alibi, I'd highly recommend reading the paper: https://arxiv.org/abs/2108.12409

It makes a very direct comparison.

Expand full comment

Is it a typo?

which has an input dimension of d and an output dimension of h = 3 * d,

in the code it is 4*d

Expand full comment

Yep, that's a typo. Just fixed it.

Expand full comment

Hi! Loved the post, I was just wondering what software do you use to build such beautiful figures?

Thanks and keep up the great work!

Expand full comment

Google slides! :)

Expand full comment

Thanks Cameron! Small typo:

this operation is someone inefficient

-->

this operation is SOMEWHAT inefficient

Expand full comment

Thank you! Just fixed it

Expand full comment

Cameron, the lessons you present, month after month are so consistently excellent that I really wish you would bind them up and sell them as a book!

Expand full comment

Definitely considering it. I don't think I've reached a point where I have enough content to make a comprehensive/cohesive book, but I might start trying to move in this direction soon (e.g., publishing a foundations article once every month or two)! It would be very cool to put a book together on some of these lessons/ideas that goes from foundations to modern research.

Expand full comment

In the "causal_self_attention.py", shouldn't it be "self.mask" instead of "self.bias"?

Expand full comment

Hello! Your post has been incredibly insightful. However, I still find myself a bit uncertain about how the Decoder-only model functions during inference after reading your explanation. I'd greatly appreciate it if you could review my understanding and kindly correct any inaccuracies (which I suspect there may be).

From your description, it's evident that during training, each element of the output sequence predicts the subsequent word corresponding to its input sequence counterpart. But what happens during inference? When presented with a context, let's say of 120 words, how does the model handle it? Here's my interpretation:

Given that predicting the next word for, say, the 10th word in the context seems redundant, I assume the model only computes the output for the final input token in the sequence. Then, once the first output token is generated, it is inserted back into the input sequence to predict the subsequent word, and so forth. Is it accurate?

Once again, thank you for your insightful post!

Expand full comment