A question that I have at the end -- if ALiBi indeed "outperforms" both RoPE and vanilla positional embedding techniques (with respect to extrapolating to longer sequences), why was it "only" (*) used for MPT, rather than for recent models like LLaMA-2, which used RoPE?
I'd like to better understand (maybe a different post, maybe even one you've done) the dimensions across which I'd compare different attention variants.
Great question. I asked the same question to myself: Why are more LLMs not using Alibi if the paper shows clearly that it's a lot better than RoPE? I went digging for an answer and wasn't able to find anything definitive. So, I'm probably gonna follow-up with another post eventually that compares position embedding strategies. Hopefully, I will have answered this question by then :)
Definitely considering it. I don't think I've reached a point where I have enough content to make a comprehensive/cohesive book, but I might start trying to move in this direction soon (e.g., publishing a foundations article once every month or two)! It would be very cool to put a book together on some of these lessons/ideas that goes from foundations to modern research.
Hello! Your post has been incredibly insightful. However, I still find myself a bit uncertain about how the Decoder-only model functions during inference after reading your explanation. I'd greatly appreciate it if you could review my understanding and kindly correct any inaccuracies (which I suspect there may be).
From your description, it's evident that during training, each element of the output sequence predicts the subsequent word corresponding to its input sequence counterpart. But what happens during inference? When presented with a context, let's say of 120 words, how does the model handle it? Here's my interpretation:
Given that predicting the next word for, say, the 10th word in the context seems redundant, I assume the model only computes the output for the final input token in the sequence. Then, once the first output token is generated, it is inserted back into the input sequence to predict the subsequent word, and so forth. Is it accurate?
A question that I have at the end -- if ALiBi indeed "outperforms" both RoPE and vanilla positional embedding techniques (with respect to extrapolating to longer sequences), why was it "only" (*) used for MPT, rather than for recent models like LLaMA-2, which used RoPE?
I'd like to better understand (maybe a different post, maybe even one you've done) the dimensions across which I'd compare different attention variants.
(IE I'm guessing that saying "A outperforms B" probably leaves some nuance on the table that I'd love to know about)
Great question. I asked the same question to myself: Why are more LLMs not using Alibi if the paper shows clearly that it's a lot better than RoPE? I went digging for an answer and wasn't able to find anything definitive. So, I'm probably gonna follow-up with another post eventually that compares position embedding strategies. Hopefully, I will have answered this question by then :)
For a more nuanced comparison of RoPE/Alibi, I'd highly recommend reading the paper: https://arxiv.org/abs/2108.12409
It makes a very direct comparison.
Is it a typo?
which has an input dimension of d and an output dimension of h = 3 * d,
in the code it is 4*d
Yep, that's a typo. Just fixed it.
Hi! Loved the post, I was just wondering what software do you use to build such beautiful figures?
Thanks and keep up the great work!
Google slides! :)
Thanks Cameron! Small typo:
this operation is someone inefficient
-->
this operation is SOMEWHAT inefficient
Thank you! Just fixed it
Cameron, the lessons you present, month after month are so consistently excellent that I really wish you would bind them up and sell them as a book!
Definitely considering it. I don't think I've reached a point where I have enough content to make a comprehensive/cohesive book, but I might start trying to move in this direction soon (e.g., publishing a foundations article once every month or two)! It would be very cool to put a book together on some of these lessons/ideas that goes from foundations to modern research.
In the "causal_self_attention.py", shouldn't it be "self.mask" instead of "self.bias"?
Hello! Your post has been incredibly insightful. However, I still find myself a bit uncertain about how the Decoder-only model functions during inference after reading your explanation. I'd greatly appreciate it if you could review my understanding and kindly correct any inaccuracies (which I suspect there may be).
From your description, it's evident that during training, each element of the output sequence predicts the subsequent word corresponding to its input sequence counterpart. But what happens during inference? When presented with a context, let's say of 120 words, how does the model handle it? Here's my interpretation:
Given that predicting the next word for, say, the 10th word in the context seems redundant, I assume the model only computes the output for the final input token in the sequence. Then, once the first output token is generated, it is inserted back into the input sequence to predict the subsequent word, and so forth. Is it accurate?
Once again, thank you for your insightful post!