13 Comments
Mar 22Liked by Cameron R. Wolfe, Ph.D.

Is it a typo?

which has an input dimension of d and an output dimension of h = 3 * d,

in the code it is 4*d

Expand full comment
Mar 15Liked by Cameron R. Wolfe, Ph.D.

Hi! Loved the post, I was just wondering what software do you use to build such beautiful figures?

Thanks and keep up the great work!

Expand full comment
Mar 5Liked by Cameron R. Wolfe, Ph.D.

A question that I have at the end -- if ALiBi indeed "outperforms" both RoPE and vanilla positional embedding techniques (with respect to extrapolating to longer sequences), why was it "only" (*) used for MPT, rather than for recent models like LLaMA-2, which used RoPE?

I'd like to better understand (maybe a different post, maybe even one you've done) the dimensions across which I'd compare different attention variants.

Expand full comment
Mar 5Liked by Cameron R. Wolfe, Ph.D.

Thanks Cameron! Small typo:

this operation is someone inefficient

-->

this operation is SOMEWHAT inefficient

Expand full comment
Mar 4Liked by Cameron R. Wolfe, Ph.D.

Cameron, the lessons you present, month after month are so consistently excellent that I really wish you would bind them up and sell them as a book!

Expand full comment

Hello! Your post has been incredibly insightful. However, I still find myself a bit uncertain about how the Decoder-only model functions during inference after reading your explanation. I'd greatly appreciate it if you could review my understanding and kindly correct any inaccuracies (which I suspect there may be).

From your description, it's evident that during training, each element of the output sequence predicts the subsequent word corresponding to its input sequence counterpart. But what happens during inference? When presented with a context, let's say of 120 words, how does the model handle it? Here's my interpretation:

Given that predicting the next word for, say, the 10th word in the context seems redundant, I assume the model only computes the output for the final input token in the sequence. Then, once the first output token is generated, it is inserted back into the input sequence to predict the subsequent word, and so forth. Is it accurate?

Once again, thank you for your insightful post!

Expand full comment