Joseph Spisak’s Post

View profile for Joseph Spisak, graphic

Product Director & Head of Generative AI Open Source @Meta | Ex: Google, Amazon

Slowly patterns are emerging in generative AI evaluations but we are still **very** far from having things figured out. And the risks for safety and frontier harms are tricky to understand given the areas require highly specialized knowledge and a deep understanding of how to solicit the types of inputs that could trigger these types of harms. This is the first of many blogs we are writing as the newly formed AI Alliance - I hope you enjoy it and please comment if you are working in this space. The tent for the alliance is HUGE and there is room for everyone!!

Evaluation of Generative AI - what’s ultimately our goal? | AI Alliance

Evaluation of Generative AI - what’s ultimately our goal? | AI Alliance

thealliance.ai

Pulkit Kapur

Product Leader @ Amazon | Generative AI | Autonomous Systems

1mo

Thanks for sharing. The part about gaps in public benchmarks such as MMLU/GSM8K resonates a lot and what is required is both a widely accepted taxonomy of user needs as well as SME benchmarks to evaluate those needs. This is how product managers can influence model capabilities to align with use cases and applications. "The line of sight from something like the Massive Multitask Language Understanding (MMLU) or HellaSwag datasets, to what the downstream consumer (i.e., the developer) wants in terms of application performance is unclear and certainly non-linear"

Mark Simithraaratchy

ML Eng Management | Meta Alum

1mo

> “In many ways, we are really talking about the evaluation of a model or agent as the new PRD (product requirements document). This flips product development on its head given defining an eval up front and working backwards requires those developing foundation models to work backwards defining everything from safety mitigations to data mixtures for both pretraining and post training.” Great to see evals appropriately represented as primitives in this framework. Excellent write-up!

Kelvin Meeks

Consulting Architect/CTO - Leadership in Enterprise Architecture and Software Engineering Innovation (US Army Veteran)

1mo

The links to the various leaderboards are very noteworthy.

Nice to see the collaboration joe!

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics