Waseem Alshikh’s Post

Co-founder and CTO of Writer

5mo

#LLMs are becoming increasingly powerful, and it is important to have a way to evaluate their performance. An #LLM evaluation benchmark is a set of tasks that are used to measure the performance of an LLM. The results of these tasks can be used to compare different LLMs and to track their progress over time. There are many different LLM evaluations benchmark available, each with its own strengths and weaknesses. I'll discuss some of the most popular LLM evaluation benchmarks and explain them in simple terms. #MMLU MMLU (Massive Multitask Language Understanding) is a benchmark that measures the performance of an LLM on a variety of language understanding tasks. These tasks include question-answering, natural language inference, and summarization. MMLU is a good benchmark for evaluating the general language understanding ability of an LLM. #NarrativeQA NarrativeQA is a benchmark that measures the performance of an LLM on the task of answering questions about a story. The stories in NarrativeQA are typically long and complex, and the questions require the LLM to have a good understanding of the story. NarrativeQA is a good benchmark for evaluating the ability of an LLM to understand long and complex text. #GSM8k GSM8k (Generating Stories from Multiple Sources) is a benchmark that measures the performance of an LLM on the task of generating a story based on a set of facts. The facts in GSM8k are typically short and simple, and the LLM is expected to generate a story that is consistent with the facts. GSM8k is a good benchmark for evaluating the ability of an LLM to generate text that is consistent with a given set of facts. #BLEU-4 WMT 2014 - BLEU-4 (Workshop on Machine Translation 2014 - Bilingual Evaluation Understudy - 4) is a benchmark that measures the performance of an LLM on the task of machine translation. The BLEU-4 score is a measure of the similarity between the LLM's translation and a human translation. WMT 2014 - BLEU-4 is a good benchmark for evaluating the ability of an LLM to translate text from one language to another. #TruthfulQA TruthfulQA is a benchmark that measures the performance of an LLM on the task of answering questions about the truth of a statement. The questions in TruthfulQA are typically about the factual accuracy of a statement, and the LLM is expected to answer the questions correctly. TruthfulQA is a good benchmark for evaluating the ability of an LLM to determine the truth of a statement. #HellaSwag HellaSwag is a benchmark that measures the performance of an LLM on the task of answering questions about commonsense knowledge. The questions in HellaSwag are typically about everyday situations, and the LLM is expected to answer the questions correctly. So, while benchmarks might sound technical, they're essentially thorough and methodical report cards for our AI. These guideposts help us nurture AI that's not just intelligent but also trustworthy and relatable!

8 Comments

Simeon Simeonov

Health and generative AI CTO. Investor. Director.

5mo

Waseem, do you know of anyone working on a benchmark related to understanding and working with data in-prompt, as opposed to text-to-SQL/code or the like?

1 Reaction

Nick Lashinsky

Unlocking data for Enterprise AI & ML projects

5mo

Great write up Waseem! I’d be curious to hear what do you think of products that are entering the market to evaluate LLM outputs and hallucinations?

1 Reaction

Andrés Corrada-Emmanuel

Industrial scientist and developer focusing on robust AI systems and evaluation frameworks.

5mo

There is also a new benchmark for logic reasoning error detection in Chain of Thought prompted answers. TL;DR LLMs are not good at finding them, but do much better fixing them once they are prompted to correct one. https://github.com/WHGTyen/BIG-Bench-Mistake

1 Reaction

Marc Appel

Content marketing exec with digital DNA

5mo

Super helpful explainer. Thank you!

3 Reactions

Angela Polania, CPA, CISM, CISA, CRISC, CAISS, CMMC RP

Cyber and AI Risk Mgmt. Advisor- Elevating internal controls, cyber security, AI governance and AI Risk Management.

5mo

Excellent summary and information Waseem!. As an risk management and audit professional and I am working on addressing auditing the entire life cycle. This is very helpful and I will research further to understand how the benchmark works. Thank you!

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Waseem Alshikh

Co-founder and CTO of Writer
2w
Report this post
As a CTO with over a decade of experience in enterprise technology, I’ve witnessed numerous paradigm shifts. However, none compare to the current revolution we're experiencing with AI-infused enterprise software. This isn’t just an upgrade; it's a fundamental reimagining of how businesses operate, decide, and grow. ## The New Paradigm: AI-Native Enterprise Applications We're moving beyond AI as a mere add-on. Today's cutting-edge enterprise applications have AI woven into their very fabric, creating systems that are intuitive, predictive, and adaptive. Key features of this new breed of software include: 1. **Contextual Intelligence**: These systems don't just process data; they understand it. For instance, an AI-powered CRM doesn't just log customer interactions; it interprets sentiment, predicts needs, and suggests personalized engagement strategies. 2. **Natural Language Interfaces**: Imagine asking your ERP system, "What's our inventory situation for Product X in Region Y?" and receiving an instant, comprehensive answer. This is the reality of AI-native applications. 3. **Predictive and Prescriptive Analytics**: We're moving from "What happened?" to "What will happen, and what should we do about it?" For example, AI-infused supply chain software can predict disruptions months in advance and suggest mitigation strategies. 4. **Autonomous Decision-Making**: In a recent project, we implemented an AI system that autonomously adjusts manufacturing parameters in real-time, optimizing for quality and efficiency without human intervention. 5. **Continuous Learning**: These systems evolve. An AI-powered marketing platform we deployed last year has increased its predictive accuracy by 27% through continuous learning from campaign data. ## The AI-Native applications touching every corner of the enterprise: - **Customer Service**: Chatbots are old news. We're now seeing AI systems that can handle complex, multi-step customer issues, even detecting and responding to emotional cues. - **Sales and Marketing**: AI is enabling hyper-personalization at scale. One customer saw a +45% increase in conversion rates using an AI system that tailors product recommendations and pricing in real-time. - **Human Resources**: From predicting employee churn to personalizing development plans, AI is transforming HR into a strategic powerhouse. - **Finance**: We've implemented AI systems that can audit 100% of reporting data in real-time, a task previously impossible for human teams. The shift to AI-native enterprise software isn't coming; it's here, and it's reshaping what's possible in business. At Writer, we've been at the forefront of this revolution, helping our customers build truly AI-native applications. I encourage you to reach out to our team for a demo of some of the exciting AI-powered solutions we've developed. See firsthand how AI is transforming enterprise software and discover how your organization can leverage these powerful technologies.
7 Comments
Like Comment
To view or add a comment, sign in
Waseem Alshikh

Co-founder and CTO of Writer
3w
Report this post
In the rapidly evolving landscape of AI/LLM, a paradigm shift is underway that demands the attention of forward-thinking enterprises: the rise of Domain-Specific Large Language Models. While general-purpose LLMs have captured headlines, it is the targeted power of domain-specific models that is going to reshape the AI landscape. ## Why Domain-Specific LLMs 1. Precision in Specialization: Domain-specific LLMs offer unparalleled accuracy within their designated fields. By training on curated, industry-specific datasets, these models develop a nuanced understanding of sector-specific terminology, regulations, and best practices. 2. Resource Optimization: While general-purpose LLMs require vast computational resources, domain-specific models present a more sustainable alternative. Their focused training datasets and narrower scope allow for more efficient use of computational power and data storage. 3. Enhanced Data Governance and Compliance: In an era of stringent data protection regulations, domain-specific LLMs offer superior control over sensitive information. By limiting the model's exposure to a specific domain, organizations can more effectively manage data access, reducing the risk of inadvertent disclosure. 4. Accelerated Innovation Cycles: The focused nature of domain-specific LLMs allows for more rapid iteration and deployment of AI solutions. 5. Competitive Differentiation: By investing in domain-specific LLMs, organizations can develop proprietary AI capabilities that are uniquely tailored to their specific market challenges. ## The Implementation Imperative In our experience implementing domain-specific LLMs, we've observed: - A 40% increase in task-specific accuracy compared to general-purpose models - A 50% reduction in time to deployment for new AI features - A 35% decrease in data processing costs due to more efficient resource utilization This diagram illustrates the flow of how enterprises can use domain-specific LLMs while maintaining security and isolation. Here's a brief explanation of the diagram: 1- Enterprise data is first classified into sensitive and non-sensitive categories. Sensitive data is processed in a secure enclave, where domain-specific LLMs operate. 2- Non-sensitive data can be processed by a general-purpose LLM. 3- Each domain-specific LLM produces isolated outputs. 4- All outputs, including those from the general-purpose LLM, go through a security check. 5- Finally, the verified outputs are integrated and used in various enterprise applications. This flow emphasizes the importance of data security, isolation of domain-specific models, and the integration of outputs from various LLMs.
8 Comments
Like Comment
To view or add a comment, sign in
Waseem Alshikh

Co-founder and CTO of Writer
1mo Edited
Report this post
Building generative AI apps requires more than just a model. Developers must assemble a complex stack of tools for inference, RAG, process management, feature extraction, and more. Configuring these can take teams of engineers months of effort, and even longer to improve and refine them to meet enterprise-level accuracy and quality requirements. Writer AI Studio abstracts away the complexity of our full-stack generative AI platform into an easy-to-use development environment — allowing you to focus on building AI apps, not stitching together a complex stack.

Writer AI Studio - Build AI apps with a fast visual editor and extensible Python backend.

https://www.youtube.com/

3 Comments
Like Comment
To view or add a comment, sign in
Waseem Alshikh

Co-founder and CTO of Writer
1mo
Report this post
This goes beyond chat apps, thanks to our Streamsync acquisition. Welcome, Ramiro Medina and team! Now, developers can fully integrate the power of our #Palmyra #LLMs and the Writer platform into their AI apps.
Ramiro Medina

Writer Framework | Engineering Manager @ Writer
1mo

A few months ago, Waseem Alshikh and May Habib, co-founders of Writer, scheduled a Zoom call. They opened with "Hi Ramiro, we're fans!". When I started Streamsync, I wanted to create something special. This was it. I had somehow managed. I had discussed deals in the past, but this one was special. The company looked amazing and the enthusiasm was real. The long nights paid off. I'm so glad I spent this time doing what I love, wondering where it'd take me, and meeting a few people on the way who've joined me on this journey. I'm thrilled to officially announce that Streamsync has been acquired by Writer, the full-stack generative AI platform for the enterprise. By joining Writer, we're not just continuing this mission; we're amplifying it. The integration of Streamsync, now known as Writer Framework, into the wider Writer platform allows us to bring powerful AI capabilities to your applications. Writer Framework will remain open source. We're committed to keeping community at the heart of the product. This means you can still contribute to and benefit from the framework just as before, but now with the added firepower of Writer's full-stack generative AI capabilities. With just a few lines of code, developers can now access APIs for LLMs directly within their apps. Plus, deploying your apps to the Writer Cloud is as easy as typing a single line of code. Thank you to everyone who contributed to this outcome, by spreading the word, donating, reporting bugs and sending fixes. Special thanks to Fabien Arcellier for believing in Streamsync and giving me a push when I needed it most.
Like Comment
To view or add a comment, sign in

6,086 followers

118 Posts

View Profile Follow

Waseem Alshikh’s Post

More Relevant Posts

Writer AI Studio - Build AI apps with a fast visual editor and extensible Python backend.

https://www.youtube.com/

Explore topics