Slowly patterns are emerging in generative AI evaluations but we are still **very** far from having things figured out. And the risks for safety and frontier harms are tricky to understand given the areas require highly specialized knowledge and a deep understanding of how to solicit the types of inputs that could trigger these types of harms. This is the first of many blogs we are writing as the newly formed AI Alliance - I hope you enjoy it and please comment if you are working in this space. The tent for the alliance is HUGE and there is room for everyone!!
> “In many ways, we are really talking about the evaluation of a model or agent as the new PRD (product requirements document). This flips product development on its head given defining an eval up front and working backwards requires those developing foundation models to work backwards defining everything from safety mitigations to data mixtures for both pretraining and post training.” Great to see evals appropriately represented as primitives in this framework. Excellent write-up!
The links to the various leaderboards are very noteworthy.
Nice to see the collaboration joe!
Product Leader @ Amazon | Generative AI | Autonomous Systems
1moThanks for sharing. The part about gaps in public benchmarks such as MMLU/GSM8K resonates a lot and what is required is both a widely accepted taxonomy of user needs as well as SME benchmarks to evaluate those needs. This is how product managers can influence model capabilities to align with use cases and applications. "The line of sight from something like the Massive Multitask Language Understanding (MMLU) or HellaSwag datasets, to what the downstream consumer (i.e., the developer) wants in terms of application performance is unclear and certainly non-linear"