5 Comments
User's avatar
Aditya Sharan's avatar

Thank You for writing this. Much needed. I think soon we'll be in the Era of some substack posts getting more Citations than papers. This might be one of those.

Cameron R. Wolfe, Ph.D.'s avatar

Appreciate the kind words!

Brad K's avatar

For agent evals, should we consider prompts that share the same environment (a code repo, a synthetic database, etc.) as dependent and in the same cluster? What would that mean for a set of agent evals that all share the same environment, like the agent company, as an example

Eli's avatar

Thanks, Cameron! Very informative post.

There are a few places where I suspect there may be typos (all in figures):

1. Standard error and confidence interval for LLM evaluations (from [1]): In the standard error formula, a square appears to be missing, i.e., ((s_i - \bar{s})^2).

2. Standard error of the estimated difference in mean scores: In the last line, I think the SE terms are missing the bar notation above them.

3. Standard error of the mean score difference (from [1]): The differences also appear to be missing the squaring.

Paul's avatar
Mar 14Edited

Very dense and intense read, but really enjoyable—thanks again! It took me a few hours.

These days, most—if not all—evaluation sets aren’t independent, right?

HumanEval, MMLU, GSM8K, and SWE-bench each focus on a single domain, don’t they?