Daniel Kornev’s Post

View profile for Daniel Kornev, graphic

CEO at Sentius | Techstars'24 | Microsoft Alumni'10 | Xoogler'11 | 2nd-time Founder

Awesome to see BABILong helpful in deciding what model to use.

View profile for Kirill Shcherbakov, graphic

ML Engineer @ JetBrains AI

💡 I remember how helpful the evals of GPT-4-Turbo and GPT-4o on BABILong were in deciding whether to replace GPT-4-Turbo with a new flagship GPT-4o when it was released, offering SOTA on almost all eval benchmarks and x2 less costs - https://lnkd.in/e4JTxrzs The answer wasn't that straightforward since it was shown that the performance varies based on your use case - simple Q&A with short / mid-size / long contexts. And the brand new GPT-4o doesn't always win. 🔥 Today, they released LLMs evals on long context (up to 10M+) reasoning dataset BABILong - https://lnkd.in/eqBQcqxg 🗝 The findings are quite interesting: - Popular open-source LLMs, as well as GPT-4 and RAG, show that their performance heavily relies on the first 5-25 % of the input => highlights the need for improved context processing mechanisms - Fine-tuning boosts performance for GPT-3.5-Turbo and Mistral-7B, but context lengths remain limited (16K and 32K, respectively) - Mamba (130M) and RMT (137M) achieved the strongest results: RMT can process lengths up to 11 million tokens, while Mamba struggles beyond 128K tokens => shows that these challenges are indeed solvable Credits for pics to Mikhail Burtsev

  • No alternative text description for this image
  • No alternative text description for this image
Tanya Sushchenko

Senior Product Manager

1mo

I don't know the official score, but Claude 3.5 has me )

To view or add a comment, sign in

Explore topics