Decoder-Only Transformers: The Workhorse of…

Cameron R. Wolfe, Ph.D.

Mar 4, 2024

134

Building the world's most influential neural network architecture from scratch...

Read →

15 Comments

Sam

Mar 5, 2024

A question that I have at the end -- if ALiBi indeed "outperforms" both RoPE and vanilla positional embedding techniques (with respect to extrapolating to longer sequences), why was it "only" (*) used for MPT, rather than for recent models like LLaMA-2, which used RoPE?

I'd like to better understand (maybe a different post, maybe even one you've done) the dimensions across which I'd compare different attention variants.

Expand full comment

Reply (1)

Sam

Mar 5, 2024

(IE I'm guessing that saying "A outperforms B" probably leaves some nuance on the table that I'd love to know about)

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Mar 5, 2024

Great question. I asked the same question to myself: Why are more LLMs not using Alibi if the paper shows clearly that it's a lot better than RoPE? I went digging for an answer and wasn't able to find anything definitive. So, I'm probably gonna follow-up with another post eventually that compares position embedding strategies. Hopefully, I will have answered this question by then :)

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Mar 5, 2024

For a more nuanced comparison of RoPE/Alibi, I'd highly recommend reading the paper: https://arxiv.org/abs/2108.12409

It makes a very direct comparison.

Expand full comment

Srikar

Mar 22, 2024

Is it a typo?

which has an input dimension of d and an output dimension of h = 3 * d,

in the code it is 4*d

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Mar 22, 2024

Yep, that's a typo. Just fixed it.

Expand full comment

Manuel Ángel Suarez

Mar 15, 2024

Hi! Loved the post, I was just wondering what software do you use to build such beautiful figures?

Thanks and keep up the great work!

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Mar 15, 2024

Google slides! :)

Expand full comment

Sam

Mar 5, 2024

Thanks Cameron! Small typo:

this operation is someone inefficient

-->

this operation is SOMEWHAT inefficient

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Mar 5, 2024

Thank you! Just fixed it

Expand full comment

Michael

Mar 4, 2024

Cameron, the lessons you present, month after month are so consistently excellent that I really wish you would bind them up and sell them as a book!

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Mar 4, 2024

Definitely considering it. I don't think I've reached a point where I have enough content to make a comprehensive/cohesive book, but I might start trying to move in this direction soon (e.g., publishing a foundations article once every month or two)! It would be very cool to put a book together on some of these lessons/ideas that goes from foundations to modern research.

Expand full comment

muche

Jul 13

BM25 for sub-second candidate retrieval;

sBERT dense vectors for semantic recall;

Cross-Encoders for precision re-ranking.

I log every experiment—MRR@10, latency curves, negative sampling tricks—in Markdown. Drop the file into https://md-tool.com/ → hit “Markdown → PDF” for paginated, syntax-highlighted reports; switch to “Markdown → HTML” and the same notes become an internal wiki page. Docs, slides, and API briefs—parallel tracks, zero extra tooling.

Expand full comment

wen

Aug 22

In the "causal_self_attention.py", shouldn't it be "self.mask" instead of "self.bias"?

Expand full comment

FreyMiggen

Apr 2, 2024

Hello! Your post has been incredibly insightful. However, I still find myself a bit uncertain about how the Decoder-only model functions during inference after reading your explanation. I'd greatly appreciate it if you could review my understanding and kindly correct any inaccuracies (which I suspect there may be).

From your description, it's evident that during training, each element of the output sequence predicts the subsequent word corresponding to its input sequence counterpart. But what happens during inference? When presented with a context, let's say of 120 words, how does the model handle it? Here's my interpretation:

Given that predicting the next word for, say, the 10th word in the context seems redundant, I assume the model only computes the output for the final input token in the sequence. Then, once the first output token is generated, it is inserted back into the input sequence to predict the subsequent word, and so forth. Is it accurate?

Once again, thank you for your insightful post!

Expand full comment

Deep (Learning) Focus

Decoder-Only Transformers: The Workhorse of…