We’ve been testing LLM to solve a data enrichment and classification problem. We're assuming more verbose text data will then make classifications more accurate. TLDR: There’s still no clear model/framework winner across LLMs from an evaluation framework of accuracy, latency, and cost on classification/prompting for this limited use case. Some summary results on a small dataset: claude-3.5-sonnet, cost 65c with good results as we saw gpt-4o, batched cost 10c, practically the same results but was very very slow gpt-4-turbo, practically the same results, batched cost 22c gpt-3.5-turbo, batched cost 2c and was faster and just as accurate as gpt4 amazon-titan-text-premier, cost 1c and results also just as acceptable as above meta-llama2-70b-chat, cost 5c and made real mistakes compared to the others llama2-70b is not up to the task Takeaway so far is that seeding prompts with semantic search results seems to level the playing field so that less sophisticated models can make more informed classifications. Some models, like Titan and Llama, needed some more tuning than other out of the box models for our purposes. ++ John Butler for the great eval work here *Some model leaderboard info here from huggingface, they seem to change weekly:
I am curios about real world use cases ( i.e. not the classification/clustering itself, but solve a real world problem like a world hunger, energy demands, climate issues and so on). What would the cost be to run those vs customer segmentation and the likes? Or what exactly r u segmenting for and why?
Passionately Curious | Girl Dad x2 | Better Every Day
3wThis is fascinating Andy Owens ... I don't know anyone else who has shared even a high level summary of this. This may be a dumb question, but what outcome are you hoping for in future state?