Applying Statistics to LLM Evaluations

Mar 9

Most LLM evaluations are conducted without a deep consideration of statistics.

5 Comments

Thank You for writing this. Much needed. I think soon we'll be in the Era of some substack posts getting more Citations than papers. This might be one of those.

Reply (1)

Cameron R. Wolfe, Ph.D.

Mar 10

Appreciate the kind words!

Brad K

Mar 20

For agent evals, should we consider prompts that share the same environment (a code repo, a synthetic database, etc.) as dependent and in the same cluster? What would that mean for a set of agent evals that all share the same environment, like the agent company, as an example

Eli

Mar 17

Thanks, Cameron! Very informative post.

There are a few places where I suspect there may be typos (all in figures):

1. Standard error and confidence interval for LLM evaluations (from [1]): In the standard error formula, a square appears to be missing, i.e., ((s_i - \bar{s})^2).

2. Standard error of the estimated difference in mean scores: In the last line, I think the SE terms are missing the bar notation above them.

3. Standard error of the mean score difference (from [1]): The differences also appear to be missing the squaring.

Paul

Mar 14Edited

Very dense and intense read, but really enjoyable—thanks again! It took me a few hours.

These days, most—if not all—evaluation sets aren’t independent, right?

HumanEval, MMLU, GSM8K, and SWE-bench each focus on a single domain, don’t they?

Deep (Learning) Focus

Applying Statistics to LLM Evaluations