12 Comments
Jul 23Liked by Cameron R. Wolfe, Ph.D.

Excellent article Cameron! I’m curious if anyone has found smaller models to be good judges especially since they’ve taken off recently. I know achieving GPT-4 performance was vital for LLM-as-a-judge to work, but can we make it less expensive by using smaller models? Or are they not quite up to par yet?

Expand full comment
author

People are exploring finetuning custom judges a decent amount (few papers on this mentioned in the last section of the writeup). I think GPT-4o-mini will also probably cause this to take off. But, overall I haven't seen a ton of work specifically looking at small judges.

Expand full comment
Jul 22Liked by Cameron R. Wolfe, Ph.D.

Really great summary! I've been reading LLM as a judge papers this past week and this feels like a very comprehensive overview. Thanks for posting

Expand full comment
author

Of course! Thanks for reading :)

Expand full comment
Aug 13Liked by Cameron R. Wolfe, Ph.D.

Amazing post! Thank you for all the info!! 🙌

Expand full comment
author

Thanks for reading!

Expand full comment
Jul 24Liked by Cameron R. Wolfe, Ph.D.

I'm always shocked at how well you put these together. Another banger broski

Expand full comment
author

Thank you! Means a lot coming from you :)

Expand full comment
Jul 22Liked by Cameron R. Wolfe, Ph.D.

Excellent post, and a great read covering LLM as a judge. Thanks a lot for wriitng this!

Expand full comment
author

Thanks for reading and for the kind words! I’m glad you liked it :)

Expand full comment

Outstanding article CRW Ph.D., the most in depth I have read, and to think how much the general public gives you guys a hard time when one thing goes wrong.

If they only knew the sheer depth and complexity of the challenge in front, I don't think we would hear too much from them, of course that's the core of the problem, they don't understand.

Thanks Cameron excellent my friend.

Kev Borg

Kiwi <3

Expand full comment

For AI evaluation, it can be summarized into 10 key points:

1. Conversational Ability

2. Reasoning Ability

3. Programming Ability

4. Context Window

5. Cost per Input/Output

6. Output Speed

7. Output Speed Over Time

8. Latency

9. Latency Over Time

10. Total Response Time

Expand full comment