14 Comments
User's avatar
Logan Thorneloe's avatar

Excellent article Cameron! I’m curious if anyone has found smaller models to be good judges especially since they’ve taken off recently. I know achieving GPT-4 performance was vital for LLM-as-a-judge to work, but can we make it less expensive by using smaller models? Or are they not quite up to par yet?

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

People are exploring finetuning custom judges a decent amount (few papers on this mentioned in the last section of the writeup). I think GPT-4o-mini will also probably cause this to take off. But, overall I haven't seen a ton of work specifically looking at small judges.

Expand full comment
Adi Pradhan's avatar

Really great summary! I've been reading LLM as a judge papers this past week and this feels like a very comprehensive overview. Thanks for posting

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Of course! Thanks for reading :)

Expand full comment
Vaibhav's avatar

Awesome as always. 🔥

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Thank you so much! Glad you liked it :)

Expand full comment
Catalina Villouta's avatar

Amazing post! Thank you for all the info!! 🙌

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Thanks for reading!

Expand full comment
Devansh's avatar

I'm always shocked at how well you put these together. Another banger broski

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Thank you! Means a lot coming from you :)

Expand full comment
Dhruvajyoti Sarma's avatar

Excellent post, and a great read covering LLM as a judge. Thanks a lot for wriitng this!

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Thanks for reading and for the kind words! I’m glad you liked it :)

Expand full comment
Kevin Borg's avatar

Outstanding article CRW Ph.D., the most in depth I have read, and to think how much the general public gives you guys a hard time when one thing goes wrong.

If they only knew the sheer depth and complexity of the challenge in front, I don't think we would hear too much from them, of course that's the core of the problem, they don't understand.

Thanks Cameron excellent my friend.

Kev Borg

Kiwi <3

Expand full comment
Meng Li's avatar

For AI evaluation, it can be summarized into 10 key points:

1. Conversational Ability

2. Reasoning Ability

3. Programming Ability

4. Context Window

5. Cost per Input/Output

6. Output Speed

7. Output Speed Over Time

8. Latency

9. Latency Over Time

10. Total Response Time

Expand full comment