speechbrain header
relaod

Downloaded 2 million times

Happy Person OVHcloud

Contributions from 140 developers

receipt2x

Released under Apache licence, version 2.0

Executive summary

SpeechBrain is an open-source toolkit aiming to make Conversational AI more accessible for everyone. Created by Dr. Mirco Ravanelli and Dr. Titouan Parcollet, SpeechBrain facilitates the research and development of neural speech processing technologies, such as speech recognition, spoken language understanding, speech enhancement, text-to-speech, and many more.  The objective of SpeechBrain is to develop a machine that, akin to our own brains, can naturally comprehend speech, understand its content and emotions, and participate in engaging conversations with humans.

Fig. 1
Fig. 1. The conceptual idea of SpeechBrain. The goal is the creation of different technologies that can emulate the communication capabilities of the brain.

SpeechBrain is currently one of the most popular open-source speech processing toolkits, providing a flexible and comprehensive platform for an international community of researchers, developers and sponsors.

The challenge

To release the latest version of SpeechBrain (SpeechBrain 1.0), the SpeechBrain team needed to implement and support the most advanced deep learning technologies, such as self-supervised learning, continual learning, large language modeling, diffusion models, advanced beam search, streamable networks, interpretable neural networks, and much more. Implementing these complex techniques is not only challenging but also extremely computationally demanding. The main challenge for the release of SpeechBrain 1.0 was finding proper computational resources to keep pace with state-of-the-art technology, which requires larger and larger models and datasets.

For instance, the team worked on continual learning, which is the process where a neural network learns and adapts over time by integrating new information without forgetting previous knowledge. SpeechBrain added interfaces to large language models, making it easy for users to fine-tune them and create chatbots. SpeechBrain implemented sophisticated algorithms for beam search, which is a method used in speech recognition to find the most likely sequence of words by considering multiple possibilities at each step. This significantly improved the performance of their speech recognizers. Along this line, they developed speech recognizers that can work in real-time, processing spoken words as they are being said, making them faster and more responsive. Neural networks often operate as black boxes, meaning their internal workings are not easily understood. To mitigate this problem, SpeechBrain implemented several methods to make neural networks more interpretable, increasing their ability to be understandable and transparent in how they make decisions. Finally, the team implemented diffusion models, which are advanced techniques for generating high-quality audio by gradually refining it.

To achieve these demanding tasks, SpeechBrain required a scalable cloud platform that can support large AI models trained on increasing amounts of data. As its goal is to democratize Conversational AI, SpeechBrain also wanted to find a partner that aligned with its values of openness and transparency, as well as the open-source principles of portability, interoperability, and reversibility.

The solution

With its commitment to trust and openness and offering a range of cloud solutions built on open-source technologies, OVHcloud was the natural choice for SpeechBrain. SpeechBrain adopted NVIDIA® GPU instances and AI Training, both hosted on OVHcloud’s Public Cloud platform.

GPUs (Graphic Processing Units) are computer chips within servers that can process large datasets and perform mathematical calculations at high speeds. For this reason, they are used by AI developers and data scientists to create and run AI training models. NVIDIA GPUs are regarded as some of the fastest in existence, and SpeechBrain adopted NVIDIA Tesla® V100 GPUs, NVIDIA Tensor Core A100 GPUs, and NVIDIA Tensor Core H100 GPUs to support its specific AI training requirements. These GPUs are all virtual and accessible as cloud instances on OVHcloud’s Public Cloud, with no need to purchase physical hardware.

The Tesla V100 delivers the performance of 100 CPUs in a single GPU, making it one of the most powerful GPUs on the market today. It offers 30x higher inference and 47x higher throughput than a single CPU, which reduces AI training times from weeks to days. These high speeds enabled SpeechBrain to boost the efficiency of its training and accelerate time to market.

The Tensor Core A100 GPU provided further performance, with up to 3x higher AI training speeds on the largest models. It enables multiple networks to operate on a single GPU at the same time and can also be partitioned into several instances to cope with dynamic demands. The A100 also offers increased memory capacity and 249x higher AI inference over CPUs, making it ideal for running SpeechBrain’s large-scale speech recognition models.

To solve its most complex calculations, SpeechBrain also adopted the Tensor Core H100 GPU, which accelerates large language model training by 30x and includes a Transformer Engine for solving trillion-parameter models. These capabilities delivered the power and speed required to train SpeechBrain’s complex models with ease.

Finally, to perform its training tasks, SpeechBrain leveraged OVHcloud’s AI Training solution. Hosted on the Public Cloud and built on the open-source Kubernetes platform, this tool enables a training task to be launched in just a few seconds and is compatible with open-source machine learning libraries such as PyTorch, TensorFlow and Scikit-learn. Developers can also kick-start their projects using pre-configured Jupyter notebooks and pre-installed Docker images. AI Training also optimizes GPU resource allocation and allows multiple tasks to be run in parallel, enabling developers to focus on training their AI models, without having to worry about complex engineering tasks.

The result

Partnering with OVHcloud equipped SpeechBrain with the speed, performance and tools required to deliver its large-scale Conversational AI training models.

Adopting NVIDIA GPUs and AI Training enabled SpeechBrain to accelerate its AI model training, whilst also accommodating increasing volumes of data. As these solutions were all hosted on the Public Cloud, SpeechBrain was able to benefit from a scalable and reliable cloud infrastructure, which has a 99.99% Service Level Agreement (SLA) and is built on multiple data centers to ensure high-availability. This ensured SpeechBrain’s GPUs were accessible whenever they needed them. The Public Cloud also offers transparent pricing and cost tracking via the OVHcloud Control Panel, which enables SpeechBrain to control costs efficiently.

With solutions built on open-source licenses and as a longstanding member of the Open Invention Network (OIN), choosing OVHcloud as a partner also aligned with SpeechBrain’s values of openness and transparency. The two plan to continue working together to make Conversational AI more accessible to a wider audience and support AI innovation worldwide.

“Our most positive experience revolved around the availability of computational resources, especially GPUs. They were consistently accessible even when we required multiple simultaneously. Additionally, we greatly value the introduction of H100 GPUs, as they have significantly accelerated our progress.”
Dr. Mirco Ravanelli, Creator of SpeechBrain

Resources
Website: https://speechbrain.github.io/
Code Repository: https://github.com/speechbrain/speechbrain
What’s new in SpeechBrain:  https://colab.research.google.com/drive/1IEPfKRuvJRSjoxu22GZhb3czfVHsAy0s?usp=sharing
SpeechBrain: A General-Purpose Speech Toolkit: https://arxiv.org/abs/2106.04624