14 Comments
User's avatar
Logan Thorneloe's avatar

Excellent article Cameron! I’m curious if anyone has found smaller models to be good judges especially since they’ve taken off recently. I know achieving GPT-4 performance was vital for LLM-as-a-judge to work, but can we make it less expensive by using smaller models? Or are they not quite up to par yet?

Cameron R. Wolfe, Ph.D.'s avatar

People are exploring finetuning custom judges a decent amount (few papers on this mentioned in the last section of the writeup). I think GPT-4o-mini will also probably cause this to take off. But, overall I haven't seen a ton of work specifically looking at small judges.

Adi Pradhan's avatar

Really great summary! I've been reading LLM as a judge papers this past week and this feels like a very comprehensive overview. Thanks for posting

Cameron R. Wolfe, Ph.D.'s avatar

Of course! Thanks for reading :)

Vaibhav's avatar

Awesome as always. 🔥

Cameron R. Wolfe, Ph.D.'s avatar

Thank you so much! Glad you liked it :)

Catalina Villouta's avatar

Amazing post! Thank you for all the info!! 🙌

Devansh's avatar

I'm always shocked at how well you put these together. Another banger broski

Cameron R. Wolfe, Ph.D.'s avatar

Thank you! Means a lot coming from you :)

Dhruvajyoti Sarma's avatar

Excellent post, and a great read covering LLM as a judge. Thanks a lot for wriitng this!

Cameron R. Wolfe, Ph.D.'s avatar

Thanks for reading and for the kind words! I’m glad you liked it :)

Kevin Borg's avatar

Outstanding article CRW Ph.D., the most in depth I have read, and to think how much the general public gives you guys a hard time when one thing goes wrong.

If they only knew the sheer depth and complexity of the challenge in front, I don't think we would hear too much from them, of course that's the core of the problem, they don't understand.

Thanks Cameron excellent my friend.

Kev Borg

Kiwi <3

Meng Li's avatar

For AI evaluation, it can be summarized into 10 key points:

1. Conversational Ability

2. Reasoning Ability

3. Programming Ability

4. Context Window

5. Cost per Input/Output

6. Output Speed

7. Output Speed Over Time

8. Latency

9. Latency Over Time

10. Total Response Time