Paul Bratcherโ€™s Post

View profile for Paul Bratcher, graphic

AI Expert | Transformation | Strategy | Digital | Futurist | CAIO | Startup Co-founder | Keynote Speaker | Make better not just faster

Would be good to have an open source contender

View profile for Aymeric Roucher, graphic

Machine Learning Engineer @ Hugging Face ๐Ÿค— | Polytechnique - Cambridge

๐Š๐ฒ๐ฎ๐ญ๐€๐ข ๐๐ž๐ฆ๐จ๐ž๐ฌ ๐Ÿ๐ข๐ซ๐ฌ๐ญ ๐œ๐จ๐ง๐ฏ๐ž๐ซ๐ฌ๐š๐ญ๐ข๐จ๐ง๐š๐ฅ ๐ฆ๐จ๐๐ž๐ฅ: ๐ญ๐ก๐ž๐ฒ'๐ซ๐ž ๐ ๐จ๐ข๐ง๐  ๐Ÿ๐จ๐ซ ๐†๐๐“-๐Ÿ’๐จ ๐Ÿ’ฅ Non-profit French lab Kyutai was created last November with a huge endowment of 300Mโ‚ฌ. So I was eagerly waiting for their first results, but nothing was coming! โžก๏ธ Until now. They really delivered today: in their first ever presentation, they revealed their first model named "๐š–๐š˜๐šœ๐š‘๐š’". It's a model that listens and speaks: audio to audio. How is that so important? Generally, AI systems chain 3 steps to perform audio to audio: 1. Listen. ๐Ÿ”Šโžก๏ธ ๐Ÿ’ฌ 2. Think. ๐Ÿ’ฌ โžก๏ธ ๐Ÿ’ฌ 3. Speak ๐Ÿ’ฌ โžก๏ธ ๐Ÿ”Š This is because it's easier to train reasoning in a text to text model, since you have huge corpuses of data to train on. But this position of text as a central bottleneck creates problems. On top of losing non-textual nuances like the voice tone, it creates a horrible latency: it's generally considered very hard for such a pipeline to go under 500ms latency, which makes it too slow for a natural conversation โ›” So ๐—ž๐˜†๐˜‚๐˜๐—ฎ๐—ถ ๐—ท๐˜‚๐˜€๐˜ ๐—บ๐—ฎ๐—ฑ๐—ฒ ๐—ฎ ๐—ฑ๐—ถ๐—ฟ๐—ฒ๐—ฐ๐˜ ๐—ฎ๐˜‚๐—ฑ๐—ถ๐—ผ ๐˜๐—ผ ๐—ฎ๐˜‚๐—ฑ๐—ถ๐—ผ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น. ๐Ÿ”Šโžก๏ธ๐Ÿ”Š For this, they have an impressive novel approach: instead of using text for training, they use extremely compressed audio. The model also uses textual thoughts alongside audio in its training, since written text is still a very efficient representation. It has two streams: input audio (listening) and output audio (speaking) going on at all times. โœ… All of this together gives a ๐™ง๐™š๐™–๐™ก๐™ก๐™ฎ ๐™ž๐™ข๐™ฅ๐™ง๐™š๐™จ๐™จ๐™ž๐™ซ๐™š ๐™˜๐™ค๐™ฃ๐™ซ๐™š๐™ง๐™จ๐™–๐™ฉ๐™ž๐™ค๐™ฃ ๐™ข๐™ค๐™™๐™š๐™ก, ๐™–๐™˜๐™๐™ž๐™š๐™ซ๐™ž๐™ฃ๐™œ ๐™ช๐™ฃ๐™™๐™š๐™ง 200๐™ข๐™จ ๐™ก๐™–๐™ฉ๐™š๐™ฃ๐™˜๐™ฎ. โœจ Moshi has a very natural flow, it can even whisper or imitate a french accent. Small limitation: it tends to continue speaking even when people want it to stop (but I do it too) ๐ˆ๐ง๐ฌ๐ข๐ ๐ก๐ญ๐ฌ: โžค ๐—ง๐—ถ๐—ป๐˜† ๐˜๐—ฒ๐—ฎ๐—บ: 8 people โžค ๐—ฆ๐—ข๐—ง๐—” ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€: ๐—ป๐—ฎ๐˜๐˜‚๐—ฟ๐—ฎ๐—น ๐—ฐ๐—ผ๐—ป๐˜ƒ๐—ฒ๐—ฟ๐˜€๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€, ๐˜„๐—ถ๐˜๐—ต ๐—ฟ๐—ฒ๐—ฐ๐—ผ๐—ฟ๐—ฑ ๐—น๐—ผ๐˜„ ๐—น๐—ฎ๐˜๐—ฒ๐—ป๐—ฐ๐˜† ๐˜‚๐—ป๐—ฑ๐—ฒ๐—ฟ ๐Ÿฎ๐Ÿฌ๐Ÿฌ๐—บ๐˜€ ๐Ÿ… โžค ๐—ง๐˜„๐—ผ ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น๐—ถ๐˜๐—ถ๐—ฒ๐˜€: main is audio, also uses a second text stream to support thought โžค Two permanent streams: audio input and audio output โžค Fine-tuned on 100k synthetic audio + text sampltes โžค 7B model, probably quantized on 8 bits โžค Model is going to be released in ๐—ข๐—ฝ๐—ฒ๐—ป-๐—ฆ๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ ๐ŸŽ‰

  • No alternative text description for this image

To view or add a comment, sign in

Explore topics