Manny Bernabe’s Post

AI Evangelist @ Ushur ◆ GenAI for Enterprise

1mo

“What We Learned from a Year of Building with LLMs“ A wonderful article on building LLMs in production. Thank you Eugene Yan ,Bryan Bischof, Charles Frye, Hamel H., Jason Liu and Shreya Shankar! My highlights below 👇 ---- Part One focuses on tactical insights, including prompting tips, evaluation strategies, Retrieval-Augmented Generation (RAG), and fine-tuning considerations. Parts two and three cover operational aspects and strategy. [Links in comments] ---- Prompting: 1 - Start every LLM application with prompting. 2 - Build on various prompting techniques: in-context learning w/ n-shot prompts, chain of thought (CoT), relevant resources (RAG, etc.). 3 - For n, aim for 5 examples per prompt; don’t hesitate to go up to 12. 4 - Lean towards smaller, focused prompts over large, multi-purpose ones. ---- RAG (Retrieval-Augmented Generation): 1 - Don’t rely solely on vector embeddings for search. 2 - Combine keyword search with embeddings for a hybrid approach. 3 - Start with RAG before fine-tuning; it's cost-efficient and effective. 4 - RAG remains relevant even with longer context models. ---- Fine-Tuning: 1 - Consider fine-tuning only when prompting and RAG are insufficient. 2 - Fine-tuning adds complexity and costs: requires annotated data, model evaluation, and hosting. ---- Evaluation: 1 - Use LLMs as judges - decent correlation to human judges 2 - Evaluation metrics: Likert scales, binary classifications, pairwise comparisons. Binary easiest cognitive load. 3 - Safety and PII defects are managed well, but hallucinations still around ~5-10% and tough to get under 2%. ---- Overall, a dense (but readable) article offering valuable tactical insights into LLM development. Highly recommended for a deep dive into current best practices and tips for effective LLM implementation. #genai #enterprise #llms

3 Comments

Manny Bernabe

AI Evangelist @ Ushur ◆ GenAI for Enterprise

1mo

Part One focuses on tactical insights, including prompting tips, evaluation strategies, Retrieval-Augmented Generation (RAG), and fine-tuning considerations. Parts two and three cover operational aspects and strategy. Part One: https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/

Christian Noel Jr A.

Mechanical Engineering Graduate at Pontificia Universidad Católica del Perú

Useful tips

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Henry Peter

We are making ground breaking changes to service creation and delivery using Large Language Models! Innovate in AI-driven eXperience Automation and be part of a pioneering team. #AI #XA #CXA #Innovation #JoinUs
1mo Edited
Report this post
That's a good summary Manny Bernabe - Reflecting on an insightful article from O'Reilly about LLMs, I was struck by this point: "LLMs are also broadly accessible, allowing everyone, not just ML engineers and scientists, to build intelligence into their products. While the barrier to entry for building AI products has been lowered, creating those effective beyond a demo remains a deceptively difficult endeavor." This democratization reminds me of the printing press's impact on biblical texts, which expanded access but also introduced new interpretive challenges. As we navigate these advancements, it's crucial to balance accessibility with responsible use. At Ushur, we've been at the forefront of leveraging LLMs in our backend/knowledge work automation as well as carefully introducing them on the conversational side via our Chatbots and other modes. More will be coming out on that. Read more here: O'Reilly Article #AI #LLM #Technology #Innovation #History #Democratization #ResponsibleAI #Ushur

Manny Bernabe Manny Bernabe is an Influencer

AI Evangelist @ Ushur ◆ GenAI for Enterprise
1mo

“What We Learned from a Year of Building with LLMs“ A wonderful article on building LLMs in production. Thank you Eugene Yan ,Bryan Bischof, Charles Frye, Hamel H., Jason Liu and Shreya Shankar! My highlights below 👇 ---- Part One focuses on tactical insights, including prompting tips, evaluation strategies, Retrieval-Augmented Generation (RAG), and fine-tuning considerations. Parts two and three cover operational aspects and strategy. [Links in comments] ---- Prompting: 1 - Start every LLM application with prompting. 2 - Build on various prompting techniques: in-context learning w/ n-shot prompts, chain of thought (CoT), relevant resources (RAG, etc.). 3 - For n, aim for 5 examples per prompt; don’t hesitate to go up to 12. 4 - Lean towards smaller, focused prompts over large, multi-purpose ones. ---- RAG (Retrieval-Augmented Generation): 1 - Don’t rely solely on vector embeddings for search. 2 - Combine keyword search with embeddings for a hybrid approach. 3 - Start with RAG before fine-tuning; it's cost-efficient and effective. 4 - RAG remains relevant even with longer context models. ---- Fine-Tuning: 1 - Consider fine-tuning only when prompting and RAG are insufficient. 2 - Fine-tuning adds complexity and costs: requires annotated data, model evaluation, and hosting. ---- Evaluation: 1 - Use LLMs as judges - decent correlation to human judges 2 - Evaluation metrics: Likert scales, binary classifications, pairwise comparisons. Binary easiest cognitive load. 3 - Safety and PII defects are managed well, but hallucinations still around ~5-10% and tough to get under 2%. ---- Overall, a dense (but readable) article offering valuable tactical insights into LLM development. Highly recommended for a deep dive into current best practices and tips for effective LLM implementation. #genai #enterprise #llms

1 Comment
Like Comment
To view or add a comment, sign in
Mark P. A.

Enterprise Architecture | AI - Machine Learning | Software Development | IT Operations | Analytics
1mo
Report this post
What We’ve Learned From A Year of Building with LLMs https://lnkd.in/gETctFFs #genai #LLMs

Applied LLMs - What We’ve Learned From A Year of Building with LLMs

applied-llms.org
Like Comment
To view or add a comment, sign in
Aishwarya Naresh Reganti

Gen AI Tech Lead @ AWS | Lecturer | ML Researcher | Speaker | CMU LTI Alumni
2mo
Report this post
💡 Your RAG pipelines may NOT always require context retrieval; at times, they can simply provide answers based on pre-trained knowledge. An often overlooked aspect of RAG pipelines in the wild is that there are instances where retrieved knowledge isn't necessary to answer common questions, or where pre-trained knowledge outperforms what's retrieved. 🤔 How do you train your LLM in the RAG pipeline to decide? ⛳ A recent paper "When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively" proposes a tailored training approach for LLMs, called ADAPT-LLM, where they are trained to generate a special token (⟨RET⟩) when they don't know the answer to a question. 📖 Some more insights ⛳ Two approaches are discussed: Closed-book question answering, relying solely on parametric memory, and open-book question answering, coupling LLMs with IR systems. ⛳ They authors then propose a hybrid approach, where LLMs use parametric memory for popular questions and employ IR systems for less popular ones ⛳ An evaluation conducted on an open-domain question answering dataset demonstrates that ADAPT-LLM effectively discerns when to retrieve information and when to rely on its parametric memory. ⛳ ADAPT-LLM consistently outperforms traditional LLMs under various configurations, showing improvements in question answering accuracy. ⛳ The primary bottleneck for ADAPT-LLM's performance lies in the IR system, suggesting potential areas for improvement in future research. I really like the idea of optimizing LLMs for RAG applications and this is a promising direction. I remember discussing ADAPT-RAG not long ago, which had a similar approach, but this one looks much neater as it doesn't require any additional components. 🚨 I post #genai content daily, follow along for the latest updates! #genai #llms #rag
5 Comments
Like Comment
To view or add a comment, sign in
Rory James Zauner

5+ years working at the intersection of technology and design - I love creating content around AI and design. My goal is to help you build products your customers will love.
4mo
Report this post
LLM vs LLM: Are fine-tuned models better at evaluation? Product teams often turn to LLMs to evaluate the generated output of their own LLM-powered solutions. Be it retrieval-augmented-generation (RAG) or fine-tuning, LLMs are a valuable resource to leverage for evaluation. You can see why teams would do this. Making use of LLM to evaluate other LLMs is an efficient way for product teams to evaluate their progress. The alternative being having to rely on human evaluators, which can sometimes be costly and take a lot more time. Let’s be honest, there are not many companies who can afford to have their subject matter experts working on other projects. This does raise the question though: Should we be fine-tuning evaluations models instead of using a larger, more general one? This new paper tries to shed light on how we can make a more informed decision on this matter. **Paper highlights** The research shows: - LLM vs. LLM: Researchers pitted different LLM-based evaluation models against each other, including the reputable GPT-4 - Fine-tuned models perform well in their domain: Models trained specifically for a task performed impressively within that domain, even surpassing GPT-4! - There is a catch: These task-specific evaluation models struggled with anything outside their training. **Results** While these fine-tuned evaluators make for great in-domain evaluation, they struggle to generalize beyond their fine-tuning data. They also struggle with fairness when compared to GPT-4. These results indicate that business will need to be careful when employing a fine-tuned evaluation model in their pipeline. **Thoughts** This research highlights the need for a more robust and generalizable approach to LLM evaluation. One avenue worth exploring would be open-source evaluation models, allowing for a more democratic process with different contributors. As the paper has also stated, their claims could be supplemented by human evaluators - the authors cited resource constraints as the reason for not pursuing this route. This would definitely be interesting to see which model’s evaluation is closer to the human evaluation scores. As more business introduce LLMs into their processes, we will definitely see more evaluation research and techniques emerge. Now it is your turn - what do think of this? Share your thoughts below! ⬇️
Like Comment
To view or add a comment, sign in
To Data & Beyond

8,248 followers
2mo
Report this post
Once such a system is built, how can you assess its performance? As you deploy it and users interact with it, how can you monitor its effectiveness, identify shortcomings, and continually enhance the quality of its responses? In this article, we will explore and share best practices for evaluating LLM outputs and provide insights into the experience of building these systems. One key distinction between this approach and traditional supervised machine learning applications is the speed at which you can develop LLM-based applications. As a result, evaluation methods typically do not begin with a predefined test set; instead, you gradually build a set of test examples as you refine the system.

Testing Prompt Engineering-Based LLM Applications

youssefh.substack.com
Like Comment
To view or add a comment, sign in
Ayushi Gupta

Technology Risk - Assurance | EY, Ex-Genpact ERC | PGDM Finance | MBA International Business
5mo
Report this post
Before delving into the technical aspects, it's vital to understand the intricacies of business collaboration. The initial three steps form a critical "preproduction" phase: pitching, socializing, and collaborating. This phase collectively determines the deployment strategy and performance evaluation of machine learning. It extends beyond merely setting business objectives; it urges business professionals to explore how predictions will intricately affect operations. The collaboration between data scientists and business professionals in this cross-disciplinary team ensures a deployment plan that is not only technically feasible but also operationally sound. #BusinessStrategy #MachineLearning #CollaborationMagic

Harvard Business Review

14,431,097 followers
5mo

With machine learning projects, create a deployment plan that is both technically feasible and operationally viable.

Getting Machine Learning Projects from Idea to Execution

hbr.org
Like Comment
To view or add a comment, sign in
David Murargi

IT Director / SVP - Engineering and Ops
5mo
Report this post
#execution always the execution. More than in #ML, How often have u been able in your career, to pick your ideas and move them into scaled organizations or environments with millions of end users? It’s an collective #art underpinned by inspiring leaders, usually. Share your thoughts!

Harvard Business Review

14,431,097 followers
5mo

With machine learning projects, create a deployment plan that is both technically feasible and operationally viable.

Getting Machine Learning Projects from Idea to Execution

hbr.org

2 Comments
Like Comment
To view or add a comment, sign in
Comet

12,506 followers
1y
Report this post
Today in Heartbeat: Training and deploying large-scale machine learning models can be complex and time-consuming. Explore how Comet can be useful for training, developing, and deploying LLMs in Dan Eberechi's new article: https://bit.ly/43IT4v6

How Comet Can Serve Your LLM Project from Pre-Training to Post-Deployment

heartbeat.comet.ml
Like Comment
To view or add a comment, sign in
Daniel Verten

Strategy @ Synthesia | Generative AI Strategy & Execution for the Enterprise | ex WPP
6mo Edited
Report this post
LLM providers are facing lower gross margins, compared to their software counterparts - but why? The Information reports that Anthropic's gross margin is in the 50-55% range. Well below the norm for software (70-90%) and even the S&P 500 average (~67%). One would expect LLMs to drive SaaS-like gross margins but the data is telling a different story. Over the past decade, software enjoyed meaningfully higher GMs than asset-heavy companies. One explanation is the scarcity of top software engineering talent and the lower marginal cost of the software they produce. This talent arbitrage doesn't seem to be playing out for LLMs, even though top AI talent is even more scarce. Sure, startups like OpenAI or Anthropic are early-stage and hence prioritising growth over cost control, but the lower gross margins are telling of a different trend: competitive pressures. The incredible amount of capital that poured into LLMs is driving competition and quick commoditisation in the space. The reality is that the outputs generated by GPT-4, Claude 2 or Gemini are not different enough to act as moat. The consequence is a race to the bottom. Great for customers, less so for LLM margins... It will be interesting to see how this plays out. One scenario is that a future model becomes meaningfully different, driving more pricing power for the company that builds it. Building AGI would clearly fall in this bracket (if you believe in those things). However, if no clear differentiation emerges, we can expect further margin crunch in the short-term and consolidation on the long run. #LLM #generativeai #business #AI #Competition
5 Comments
Like Comment
To view or add a comment, sign in
Elena Yunusov

Executive Director, Human Feedback Foundation | AI Strategy Leader | x-RBC / Borealis AI Head of Marketing
1mo Edited
Report this post
Reason #12089 for why the work we do at Human Feedback Foundation matters: Nathan Lambert's excellent piece on the importance of human feedback in AI training. Key observations: - “All of the open-source tools” for doing at-scale RLHF are largely broken. The engineering stack we, being the folks training open-aligned models, use is extremely fickle and out of touch with the techniques that industry is using" - "If we are serious about making better open models, continuing to go deeper on fewer algorithms and datasets is needed. The dataset work will be how the best academics and open-source members differentiate themselves from the noise." https://lnkd.in/gAkhYny6

RLHF roundup: Getting good at PPO, charting RLHF’s impact, RewardBench retrospective, and a reward model competition

interconnects.ai

1 Comment
Like Comment
To view or add a comment, sign in

8,755 followers

View Profile Follow

Manny Bernabe’s Post

More from this author

Leveraging the Lessons of Lean Manufacturing to Enhance the Success of Analytics Programs in Manufacturing

AI + IoT for Midsize Enterprise Part 1

Lead With a Compelling AI Vision

Explore topics