Andy Owens’ Post

Analytics || Data

3w Edited

We’ve been testing LLM to solve a data enrichment and classification problem. We're assuming more verbose text data will then make classifications more accurate. TLDR: There’s still no clear model/framework winner across LLMs from an evaluation framework of accuracy, latency, and cost on classification/prompting for this limited use case. Some summary results on a small dataset: claude-3.5-sonnet, cost 65c with good results as we saw gpt-4o, batched cost 10c, practically the same results but was very very slow gpt-4-turbo, practically the same results, batched cost 22c gpt-3.5-turbo, batched cost 2c and was faster and just as accurate as gpt4 amazon-titan-text-premier, cost 1c and results also just as acceptable as above meta-llama2-70b-chat, cost 5c and made real mistakes compared to the others llama2-70b is not up to the task Takeaway so far is that seeding prompts with semantic search results seems to level the playing field so that less sophisticated models can make more informed classifications. Some models, like Titan and Llama, needed some more tuning than other out of the box models for our purposes. ++ John Butler for the great eval work here *Some model leaderboard info here from huggingface, they seem to change weekly:

Introduction

huggingface.co

4 Comments

Kyle Whitmire

Passionately Curious | Girl Dad x2 | Better Every Day

This is fascinating Andy Owens ... I don't know anyone else who has shared even a high level summary of this. This may be a dumb question, but what outcome are you hoping for in future state?

Tanya S.

Creator of the "TGI-AI (c) : Tanya's Global AI Index"

I am curios about real world use cases ( i.e. not the classification/clustering itself, but solve a real world problem like a world hunger, energy demands, climate issues and so on). What would the cost be to run those vs customer segmentation and the likes? Or what exactly r u segmenting for and why?

See more comments

To view or add a comment, sign in

More Relevant Posts

Mazen Lahham

💡 LinkedIn Top AI Voice | Committed to Generative AI Innovation
11mo Edited
Report this post
💵 How much does it cost a company to fine-tune 1TB of data using the new GPT-3.5 Turbo? 💵 Is Generative AI becoming a field only accessible to wealthy enterprises? While the AI community is largely celebrating OpenAI's new fine-tuning model, I find this development more concerning than exciting. The exorbitant cost, coupled with privacy and security issues, paints a picture that, in my view, is rather disheartening and raises serious barriers to entry for many! OpenAI just announced today the cost of fine-tune your data: Fine-tuning costs are broken down into two buckets: the initial training cost and usage cost: 💲 Training: $0.008 / 1K Tokens 💲 Usage input: $0.012 / 1K Tokens 💲 Usage output: $0.016 / 1K Tokens For example, a gpt-3.5-turbo fine-tuning job with a training file of 100,000 tokens (about 75,000 words) that is trained for 3 epochs would have an expected cost of $2.40. Characters per Word: Let's take an average word length, including spaces and punctuation, of 5 characters. Bytes per Word: Since 1 ASCII character is 1 byte, each word will take around 5 bytes. Total Bytes in 1 GB ~ 1,000,000,000 bytes. Total Words in 1 GB: 1,000,000,000 / 5=200,000,000 1 GB =200,000,000 words. Since 75,000 words cost $2.40, then the cost for 200,000,000 words would be: 200,000,000 words / 75,000 words = 2666.667 Chunks 2666.667 Chunks x $2.40 = $6,400 To Train 1 Terra byte of data $6,400 x 1000 = $6,400,000 🤑 $6.4 million to fine-tune 1 TB of data !! 🤑 😯 That's ridiculously expensive! And this is just the training cost. Not to mention the usage input or output fees, which can add up significantly for a company with many employees. These numbers pose a serious question: Is Generative AI becoming a field only accessible to wealthy enterprises? The financial barriers may hinder small businesses and independent researchers from leveraging cutting-edge AI technologies. This financial barrier is why all of our team is committed to working on Open Source LLM models such as: Platypus, LLAMA2, MPT, and Falcon. These open-source initiatives provide an opportunity to democratize access to AI and allow a broader community to participate in developing and utilizing advanced machine learning techniques. Not to mention privacy and security concerns regarding sending all of your data to OpenAI - And this is not like the Cloud where you only select a section of your enterprise data, here you need to send everything Let's continue to support and contribute to open source projects, ensuring that the AI revolution is for everyone, not just those with deep pockets. 👽 Follow boxMind.ai page https://lnkd.in/dB_DxP6R 👽 Join our AI newsletter https://bit.ly/3sdLE67 👽 Follow my page: https://lnkd.in/dqBNbBb5 👽 Visit our web site: https://boxmind.ai/ #AI #LLM #FalconLLM #Tech #Innovation #opensource #OpenAI #Meta #HuggingFace #finetuning https://lnkd.in/e7Q7qSHD

GPT-3.5 Turbo fine-tuning and API updates

openai.com
Like Comment
To view or add a comment, sign in
Jesper Alkestrup

Experienced AI and IoT entrepreneur
6mo Edited
Report this post
Deploying on-premise LLM models for data augmentation tasks, such as RAG and summarization, is becoming increasingly feasible. These models are now (almost) able to perform on par with the best closed-source model GPT-4. This development is significant as it can substantially reduce the cost of running inference. Moreover, it enables organizations with strict privacy concerns to keep all data local. A recent experiment by Pinecone on a massive RAG system (1 billion chunks of data) revealed that Mistral MoE + RAG outperforms GPT-3.5-turbo + RAG. Impressively, it performs only 3% worse than GPT-4 + RAG. 🤯 Faithfulness 🥇 GPT 4-turbo w. RAG: 0.835 🥇 Mistral MoE w. RAG: 0.808 🥇 GPT 3.5-turbo w. RAG: 0.804 ... GPT 4-Turbo: 0.700 Another study focusing on summarization, a commonly used task, finds that Llama-2 performs on a similar level as GPT-4 in terms of factual correctness. This study was conducted before the introduction of Mistral MoE. Summarization correctness 🥇 Human: 84.0% .... GPT 3.5-turbo: 67.0% 🥇 GPT 4: 85.5% .... Llama-2-13b: 58.9% 🥇 Llama-2-70b: 81.7% Unfortunately, there seems to be a significant gap between the performance of these models in English and their application to 🇩🇰 😞 So, until some of the new foundation models are released this spring, their applicability is likely limited to English contexts.
1 Comment
Like Comment
To view or add a comment, sign in
LlamaIndex

183,128 followers
9mo Edited
Report this post
There’s a LOT of LLMs, but how do we know which ones work well from “simple” tasks (single prompt, top-k RAG) to “hard” tasks (advanced RAG, agents)? We’re excited to launch a comprehensive survey of different LLMs performing simple to hard LLM/RAG/agent tasks 📝. For each model, learn which tasks work out-of-the-box ✅, which would okay but need some prompt engineering ⚠️, and which ones are unreliable ⛔️. Models used: ✅ OpenAI models (gpt-3.5-turbo, gpt-3.5-turbo-instruct, gpt-4) ✅ Anthropic models (claude-2, Claude-instant-2) ✅ llama2-chat-7b 4bit ✅ Mistral-7b Tasks 🛠️: Basic RAG, routing, query planning, text-to-SQL, structured data extraction, agents! Results/Notebooks 🧑🔬: Docs page is here: https://lnkd.in/gbnyyz5E We have comprehensive notebooks for each model. gpt-3.5-turbo: https://lnkd.in/gtnd4vJa gpt-3.5-turbo-instruct: https://lnkd.in/gt8FHpjT gpt-4: https://lnkd.in/gZiykA-x claude2: https://lnkd.in/gPfqWB3h claude-instant-1.2: https://lnkd.in/giAetun7 Llama2-chat-7b: https://lnkd.in/ghaxzugT Mistral-7b: https://lnkd.in/gmVnwaqi Contributions 🙌: Have a model / task in mind? Anyone is welcome to contribute new LLMs to our docs, or modify an existing one! (e.g. if you think our defaults/prompts can be improved). Credits: Huge shoutout to our very own Logan Markewich for driving this entire effort ⚡️
18 Comments
Like Comment
To view or add a comment, sign in
Tom Reamy
6mo
Report this post
Enterprise AI’s Weak Link This weak link affects basic machine learning (ML) and the newer GPT and LLMs. In fact, it impacts virtually every attempt to process and utilize unstructured text – from taxonomy and knowledge graph building to search and auto-tagging and customer and business intelligence. The culprit? Poor and inaccurate training sets (actually any set of targeted documents). By not paying enough attention to the many pitfalls involved in selecting documents to train your LLM or auto-tagging or search application, you diminish the quality of any applications you build. To learn more about what those pitfalls are and how to overcome them, please check the most recent blog by Tom Reamy, https://lnkd.in/gWpwFViv in his series on what is smarter and safer than LLM/GPT.

Enterprise AI's Weak Link

https://kapsgroup.com
Like Comment
To view or add a comment, sign in
Stefano Sciarpa

Junior Data Analyst
9mo Edited
Report this post
How do we use GPT-4 newest function to scrape any site? Find out with me in this handy guide: "How I used GPT-4 to scrape my favourite website — and then I didn’t" Stay tuned for next part on Monday, where we analyse the limits and consequences of this method.

How I used GPT-4 to scrape my favourite website — and then I didn’t

medium.com
Like Comment
To view or add a comment, sign in
Alberto Romero García

Writer | Tech & AI | Substack & Medium | Analyst at CambrianAI | Words on Forbes.com, Fast Company, OneZero, KDnuggets, TDS
11mo
Report this post
My Top 5 Must-Reads on AI this week 1. Is GPT-4 getting worse over time? (https://lnkd.in/d6BBFPap) “In short, the new paper [Chen, Zaharia, and Zou] doesn’t show that GPT-4 capabilities have degraded. But it is a valuable reminder that the kind of fine tuning that LLMs regularly undergo can have unintended effects, including drastic behavior changes on some tasks. Finally, the pitfalls we uncovered are a reminder of how hard it is to quantitatively evaluate language models.” 2. Meta’s Llama 2 (https://ai.meta.com/llama/) “Llama 2 [is] a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models.” Varied reactions to Llama 2: - Llama 2: an incredible open LLM (https://lnkd.in/dqay3vM4) - Llama copyright drama: Meta stops disclosing what data it uses to train the company's giant AI models (https://lnkd.in/d6RdrDW6) - Is it really an open-source model? (https://lnkd.in/dkfu_Dnb) - Is Llama 2 safe to use? (https://lnkd.in/dYkajT5Q) 3. Why computer-made data is being used to train AI models (https://lnkd.in/d2PUQHhK) “Generic data from the web is no longer good enough to push the performance of AI models, according to developers … The new trend of using synthetic data sidesteps this costly requirement. Instead, companies can use AI models to produce text, code or more complex information related to healthcare or financial fraud. This synthetic data is then used to train advanced LLMs to become ever more capable.” 4. Apple is testing a ChatGPT-like AI chatbot (https://lnkd.in/dm4KbMQ5) “Apple is developing artificial intelligence tools to challenge OpenAI, Google and others … The tech giant has created a chatbot that some engineers are internally referring to as “Apple GPT.” Apple has yet to determine a strategy for releasing the technology to consumers, but is reportedly aiming to make a significant AI-related announcement next year.” 5. Google Tests A.I. Tool That Is Able to Write News Articles (https://lnkd.in/dy7tFrfy) “One of the three people familiar with the product said that Google believed [Genesis] could serve as a kind of personal assistant for journalists, automating some tasks to free up time for others … Some executives who saw Google’s pitch described it as unsettling, asking not to be identified discussing a confidential matter.”

Is GPT-4 getting worse over time?

aisnakeoil.com

1 Comment
Like Comment
To view or add a comment, sign in
Nijesh Kanjinghat

Generative AI | Multi Modal & Large Language Models| MLOps | Responsible AI | Keynote Speaker|Professional Member of Singapore Computer Society
11mo
Report this post
Finally, we can now fine tune OpenAI GPT 3.5 Turbo Models. As per OpenAI following are the conditions/use cases where you can fine tune GPT : ✅ Improved steerability: Fine-tuning allows businesses to make the model follow instructions better, such as making outputs terse or always responding in a given language. For instance, developers can use fine-tuning to ensure that the model always responds in German when prompted to use that language. ✅ Reliable output formatting: Fine-tuning improves the model's ability to consistently format responses—a crucial aspect for applications demanding a specific response format, such as code completion or composing API calls. A developer can use fine-tuning to more reliably convert user prompts into high-quality JSON snippets that can be used with their own systems. ✅ Custom tone: Fine-tuning is a great way to hone the qualitative feel of the model output, such as its tone, so it better fits the voice of businesses’ brands. A business with a recognizable brand voice can use fine-tuning for the model to be more consistent with their tone. I do agree that the size of the prompts would reduce eventually with fine tuning but the size to cost tradeoff can be a bit counter intuitive. Training cost seems great ... However, usage Input and output needs to seen on a case by case basis. This will heavily depend on client specific scenario. ✅Training: $0.008 / 1K Tokens ✅Usage input: $0.012 / 1K Tokens ✅ Usage output: $0.016 / 1K Tokens https://lnkd.in/gFQsvEPQ

GPT-3.5 Turbo fine-tuning and API updates

openai.com
Like Comment
To view or add a comment, sign in
Jordan Bentley

AI Consultant | Ex-Salesforce Principal | Kaggle Competition Gold | Data Scientist | Machine Learning Expert | AI
4mo
Report this post
GPT-4 fine tuning is not yet widely available, but one group with early access has shared their experience. - They achieved a 1.5x improvement on their benchmark over un-tuned GPT-4 - This was less (proportionally) than the 2x improvement going from GPT-3.5 to GPT-3.5 with fine tuning This is still a massive gain from fine-tuning, but will the importance of fine-tuning diminish as models become more powerful? GPT-3.5 gets confused by a lot of instructions in the prompt, so fine-tuning is often mandatory for complex use cases and specific output formats, but maybe the proportional reduction for GPT-4 is just a result of the un-tuned baseline being higher? https://lnkd.in/ekhh2STu

First Impressions of Early-Access GPT-4 Fine-Tuning | Supersimple

supersimple.io
Like Comment
To view or add a comment, sign in
Eric Ciarla

Co-Founder @ Mendable / Building Firecrawl
10mo
Report this post
With GPT 3.5 fine-tuning, you can get GPT 4 level responses at much lower cost 💸 Here are some early insights from the Mendable by SideGuide (YC S22) team. Getting good training data is a challenge: To gather data, ideally you are using a platform like Mendable by SideGuide (YC S22) or LangChain’s LangSmith where you can capture user rated completions but manual / synthetic dataset creation with tools like LlamaIndex are always an option. Transforming data is important: Transforming the data before fine-tuning can lead to more improvements. For example if you remove repeated instructions but keep the same desired completion, the model after fine-tuning will no longer need the repeated instructions, saving you tokens! OpenAI makes fine-tuning is pretty simple: After you have gathered and transformed your dataset, OpenAI makes it super easy to actually implement the fine-tuning process. Shout out to the team, their new API is super straight forward and we had fine-tuning up and running in under a hour! Overall this is a huge step for model improvement and the future is bright! You can read more on our blog: https://lnkd.in/gxYm_UbB

Early Insights Fine-Tuning GPT 3.5 from Mendable.ai

mendable.ai

1 Comment
Like Comment
To view or add a comment, sign in
Mendable

1,103 followers
10mo
Report this post
We fine tuned GPT 3.5 on thousands of records from the Mendable.ai implementation on the LangChain docs and the results are fascinating. Here are some early insights 👀

Eric Ciarla

Co-Founder @ Mendable / Building Firecrawl
10mo

With GPT 3.5 fine-tuning, you can get GPT 4 level responses at much lower cost 💸 Here are some early insights from the Mendable by SideGuide (YC S22) team. Getting good training data is a challenge: To gather data, ideally you are using a platform like Mendable by SideGuide (YC S22) or LangChain’s LangSmith where you can capture user rated completions but manual / synthetic dataset creation with tools like LlamaIndex are always an option. Transforming data is important: Transforming the data before fine-tuning can lead to more improvements. For example if you remove repeated instructions but keep the same desired completion, the model after fine-tuning will no longer need the repeated instructions, saving you tokens! OpenAI makes fine-tuning is pretty simple: After you have gathered and transformed your dataset, OpenAI makes it super easy to actually implement the fine-tuning process. Shout out to the team, their new API is super straight forward and we had fine-tuning up and running in under a hour! Overall this is a huge step for model improvement and the future is bright! You can read more on our blog: https://lnkd.in/gxYm_UbB

Early Insights Fine-Tuning GPT 3.5 from Mendable.ai

mendable.ai
Like Comment
To view or add a comment, sign in

2,026 followers

141 Posts

View Profile Follow

Andy Owens’ Post

More Relevant Posts

Explore topics