Assaf Yablon’s Post

View profile for Assaf Yablon, graphic

AI | Amazon | Ex-Microsoft | Harvard | MIT | Penn Engineering

One of the biggest challenges in language models today is making them more interpretable. We often treat AI models as black boxes: data goes in, a response comes out, and the reasoning behind that response remains unclear. I remember an interview with Google CEO, where he was asked to explain how Gemini works. He said he didn’t know. This response resonated with the scientific community, as deep learning often similar to the human brain, but the interviewer was shocked! How can a model, released to millions, be so poorly understood? Two weeks ago, Anthropic released an important paper on model interpretability. They used a technique called "dictionary learning," borrowed from classical ML, which isolates patterns of neuron activations that recur across many different contexts. This paper sheds some light on this important challenge, which, if solved, will create more trust in these models and thus ease the integration of AI into our everyday lives. Highly recommend reading: https://lnkd.in/gPzEePx8

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub

To view or add a comment, sign in

Explore topics