Evaluation of LLMs
Nowadays, LLMs become more and more powerful to tackle many tasks (e.g., math proplems, content generation). However, evaluating LLMs has been still a big challenge. a
A statistical approach to model evaluations —view from Anthropic
Suppose an artificial intelligent model outperforms another one on an interest benchmark—such as testing whose abilities of common knowledge or solving computer coding questions. Is this difference in capabilities real? Or could one model just have gotten luckly in the choice of questions on the benchmark?