Thank You for writing this. Much needed. I think soon we'll be in the Era of some substack posts getting more Citations than papers. This might be one of those.
For agent evals, should we consider prompts that share the same environment (a code repo, a synthetic database, etc.) as dependent and in the same cluster? What would that mean for a set of agent evals that all share the same environment, like the agent company, as an example
There are a few places where I suspect there may be typos (all in figures):
1. Standard error and confidence interval for LLM evaluations (from [1]): In the standard error formula, a square appears to be missing, i.e., ((s_i - \bar{s})^2).
2. Standard error of the estimated difference in mean scores: In the last line, I think the SE terms are missing the bar notation above them.
3. Standard error of the mean score difference (from [1]): The differences also appear to be missing the squaring.
Thank You for writing this. Much needed. I think soon we'll be in the Era of some substack posts getting more Citations than papers. This might be one of those.
Appreciate the kind words!
For agent evals, should we consider prompts that share the same environment (a code repo, a synthetic database, etc.) as dependent and in the same cluster? What would that mean for a set of agent evals that all share the same environment, like the agent company, as an example
Thanks, Cameron! Very informative post.
There are a few places where I suspect there may be typos (all in figures):
1. Standard error and confidence interval for LLM evaluations (from [1]): In the standard error formula, a square appears to be missing, i.e., ((s_i - \bar{s})^2).
2. Standard error of the estimated difference in mean scores: In the last line, I think the SE terms are missing the bar notation above them.
3. Standard error of the mean score difference (from [1]): The differences also appear to be missing the squaring.
Very dense and intense read, but really enjoyable—thanks again! It took me a few hours.
These days, most—if not all—evaluation sets aren’t independent, right?
HumanEval, MMLU, GSM8K, and SWE-bench each focus on a single domain, don’t they?