Excellent article Cameron! I’m curious if anyone has found smaller models to be good judges especially since they’ve taken off recently. I know achieving GPT-4 performance was vital for LLM-as-a-judge to work, but can we make it less expensive by using smaller models? Or are they not quite up to par yet?
People are exploring finetuning custom judges a decent amount (few papers on this mentioned in the last section of the writeup). I think GPT-4o-mini will also probably cause this to take off. But, overall I haven't seen a ton of work specifically looking at small judges.
Outstanding article CRW Ph.D., the most in depth I have read, and to think how much the general public gives you guys a hard time when one thing goes wrong.
If they only knew the sheer depth and complexity of the challenge in front, I don't think we would hear too much from them, of course that's the core of the problem, they don't understand.
Excellent article Cameron! I’m curious if anyone has found smaller models to be good judges especially since they’ve taken off recently. I know achieving GPT-4 performance was vital for LLM-as-a-judge to work, but can we make it less expensive by using smaller models? Or are they not quite up to par yet?
People are exploring finetuning custom judges a decent amount (few papers on this mentioned in the last section of the writeup). I think GPT-4o-mini will also probably cause this to take off. But, overall I haven't seen a ton of work specifically looking at small judges.
Really great summary! I've been reading LLM as a judge papers this past week and this feels like a very comprehensive overview. Thanks for posting
Of course! Thanks for reading :)
Awesome as always. 🔥
Thank you so much! Glad you liked it :)
Amazing post! Thank you for all the info!! 🙌
Thanks for reading!
I'm always shocked at how well you put these together. Another banger broski
Thank you! Means a lot coming from you :)
Excellent post, and a great read covering LLM as a judge. Thanks a lot for wriitng this!
Thanks for reading and for the kind words! I’m glad you liked it :)
Outstanding article CRW Ph.D., the most in depth I have read, and to think how much the general public gives you guys a hard time when one thing goes wrong.
If they only knew the sheer depth and complexity of the challenge in front, I don't think we would hear too much from them, of course that's the core of the problem, they don't understand.
Thanks Cameron excellent my friend.
Kev Borg
Kiwi <3
For AI evaluation, it can be summarized into 10 key points:
1. Conversational Ability
2. Reasoning Ability
3. Programming Ability
4. Context Window
5. Cost per Input/Output
6. Output Speed
7. Output Speed Over Time
8. Latency
9. Latency Over Time
10. Total Response Time