AI Expert | Transformation | Strategy | Digital | Futurist | CAIO | Startup Co-founder | Keynote Speaker | Make better not just faster
Would be good to have an open source contender
๐๐ฒ๐ฎ๐ญ๐๐ข ๐๐๐ฆ๐จ๐๐ฌ ๐๐ข๐ซ๐ฌ๐ญ ๐๐จ๐ง๐ฏ๐๐ซ๐ฌ๐๐ญ๐ข๐จ๐ง๐๐ฅ ๐ฆ๐จ๐๐๐ฅ: ๐ญ๐ก๐๐ฒ'๐ซ๐ ๐ ๐จ๐ข๐ง๐ ๐๐จ๐ซ ๐๐๐-๐๐จ ๐ฅ Non-profit French lab Kyutai was created last November with a huge endowment of 300Mโฌ. So I was eagerly waiting for their first results, but nothing was coming! โก๏ธ Until now. They really delivered today: in their first ever presentation, they revealed their first model named "๐๐๐๐๐". It's a model that listens and speaks: audio to audio. How is that so important? Generally, AI systems chain 3 steps to perform audio to audio: 1. Listen. ๐โก๏ธ ๐ฌ 2. Think. ๐ฌ โก๏ธ ๐ฌ 3. Speak ๐ฌ โก๏ธ ๐ This is because it's easier to train reasoning in a text to text model, since you have huge corpuses of data to train on. But this position of text as a central bottleneck creates problems. On top of losing non-textual nuances like the voice tone, it creates a horrible latency: it's generally considered very hard for such a pipeline to go under 500ms latency, which makes it too slow for a natural conversation โ So ๐๐๐๐๐ฎ๐ถ ๐ท๐๐๐ ๐บ๐ฎ๐ฑ๐ฒ ๐ฎ ๐ฑ๐ถ๐ฟ๐ฒ๐ฐ๐ ๐ฎ๐๐ฑ๐ถ๐ผ ๐๐ผ ๐ฎ๐๐ฑ๐ถ๐ผ ๐บ๐ผ๐ฑ๐ฒ๐น. ๐โก๏ธ๐ For this, they have an impressive novel approach: instead of using text for training, they use extremely compressed audio. The model also uses textual thoughts alongside audio in its training, since written text is still a very efficient representation. It has two streams: input audio (listening) and output audio (speaking) going on at all times. โ All of this together gives a ๐ง๐๐๐ก๐ก๐ฎ ๐๐ข๐ฅ๐ง๐๐จ๐จ๐๐ซ๐ ๐๐ค๐ฃ๐ซ๐๐ง๐จ๐๐ฉ๐๐ค๐ฃ ๐ข๐ค๐๐๐ก, ๐๐๐๐๐๐ซ๐๐ฃ๐ ๐ช๐ฃ๐๐๐ง 200๐ข๐จ ๐ก๐๐ฉ๐๐ฃ๐๐ฎ. โจ Moshi has a very natural flow, it can even whisper or imitate a french accent. Small limitation: it tends to continue speaking even when people want it to stop (but I do it too) ๐๐ง๐ฌ๐ข๐ ๐ก๐ญ๐ฌ: โค ๐ง๐ถ๐ป๐ ๐๐ฒ๐ฎ๐บ: 8 people โค ๐ฆ๐ข๐ง๐ ๐ฟ๐ฒ๐๐๐น๐๐: ๐ป๐ฎ๐๐๐ฟ๐ฎ๐น ๐ฐ๐ผ๐ป๐๐ฒ๐ฟ๐๐ฎ๐๐ถ๐ผ๐ป๐, ๐๐ถ๐๐ต ๐ฟ๐ฒ๐ฐ๐ผ๐ฟ๐ฑ ๐น๐ผ๐ ๐น๐ฎ๐๐ฒ๐ป๐ฐ๐ ๐๐ป๐ฑ๐ฒ๐ฟ ๐ฎ๐ฌ๐ฌ๐บ๐ ๐ โค ๐ง๐๐ผ ๐บ๐ผ๐ฑ๐ฎ๐น๐ถ๐๐ถ๐ฒ๐: main is audio, also uses a second text stream to support thought โค Two permanent streams: audio input and audio output โค Fine-tuned on 100k synthetic audio + text sampltes โค 7B model, probably quantized on 8 bits โค Model is going to be released in ๐ข๐ฝ๐ฒ๐ป-๐ฆ๐ผ๐๐ฟ๐ฐ๐ฒ ๐