Using LLMs for Evaluation

Cameron R. Wolfe, Ph.D.

Jul 22, 2024

LLM-as-a-Judge and other scalable additions to human quality ratings...

14 Comments

Logan Thorneloe

Excellent article Cameron! I’m curious if anyone has found smaller models to be good judges especially since they’ve taken off recently. I know achieving GPT-4 performance was vital for LLM-as-a-judge to work, but can we make it less expensive by using smaller models? Or are they not quite up to par yet?

Expand full comment

Cameron R. Wolfe, Ph.D.

People are exploring finetuning custom judges a decent amount (few papers on this mentioned in the last section of the writeup). I think GPT-4o-mini will also probably cause this to take off. But, overall I haven't seen a ton of work specifically looking at small judges.

Expand full comment

Really great summary! I've been reading LLM as a judge papers this past week and this feels like a very comprehensive overview. Thanks for posting

Expand full comment

Cameron R. Wolfe, Ph.D.

Of course! Thanks for reading :)

Expand full comment

Awesome as always. 🔥

Expand full comment

Cameron R. Wolfe, Ph.D.

Thank you so much! Glad you liked it :)

Expand full comment

Catalina Villouta

Amazing post! Thank you for all the info!! 🙌

Expand full comment

Cameron R. Wolfe, Ph.D.

Thanks for reading!

Expand full comment

I'm always shocked at how well you put these together. Another banger broski

Expand full comment

Cameron R. Wolfe, Ph.D.

Thank you! Means a lot coming from you :)

Expand full comment

Dhruvajyoti Sarma

Excellent post, and a great read covering LLM as a judge. Thanks a lot for wriitng this!

Expand full comment

Cameron R. Wolfe, Ph.D.

Thanks for reading and for the kind words! I’m glad you liked it :)

Expand full comment

Outstanding article CRW Ph.D., the most in depth I have read, and to think how much the general public gives you guys a hard time when one thing goes wrong.

If they only knew the sheer depth and complexity of the challenge in front, I don't think we would hear too much from them, of course that's the core of the problem, they don't understand.

Thanks Cameron excellent my friend.

Kev Borg

Kiwi <3

Expand full comment

For AI evaluation, it can be summarized into 10 key points:

1. Conversational Ability

2. Reasoning Ability

3. Programming Ability

4. Context Window

5. Cost per Input/Output

6. Output Speed

7. Output Speed Over Time

8. Latency

9. Latency Over Time

10. Total Response Time

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts