Picovoice Leopard and Cheetah offline, on-device speech-to-text engines are said to achieve cloud-level accuracy, rely on tiny Speech-to-Text models, and slash the cost of automatic transcription by up to 10 times.
Leopard is an on-device speech-to-text engine, while Cheetah is an on-device streaming speech-to-text engine, and both are cross-platform with support for Linux x86_64, macOS (x86_64, arm64), Windows x86_64, Android, iOS, Raspberry Pi 3/4, and NVIDIA Jetson Nano.
Looking at the cost is always tricky since companies have different pricing structures, and the table above basically shows the best scenario, where Picovoice is 6 to 20 times more cost-effective than solutions from Microsoft Azure or Google STT. Picovoice Leopard/Cheetah is free for the first 100 hours, and customers can pay a monthly $999 fee for up to 10,000 hours hence the $0.1 per hour cost with PicoVoice. If you were to use only 1000 hours out of your plan that would be $1 per hour, still not too bad. Check out the pricing page for details.
But the price is not everything, and a cheap service that does not do the job would be worthless, so the company provided some speech-to-text benchmarks with instructions to reproduce their setup on Github comparing Picovoice Leopard/Cheetah against AWS Transcribe, Google STT/STT-Enhanced, IBM Watson STT, and Microsoft Azure.
The first metric looked into is the word error rate to estimate the accuracy of the services/solutions. Picovoice Leopard and Cheetah achieve a relatively low word error rate similar to cloud-based services such as Azure, Amazon, and Google Enhanced, and much better than Mozilla DeepSpeech offline, on-device speech-to-text engine.
Mozilla DeepSpeech would still be the most cost-effective solution (since it’s free) provided your application can do with the lower accuracy, but another aspect is that Picovoice speech-to-text engines make use of much fewer resources than the Mozilla STT solution with a lower Real-Time Factor (RTF), the ratio of CPU processing time to the length of the input speech file, and acoustic and language models that are 60 times smaller.
The closed-source libraries for all supported platforms, as well as documentation, can be found on Github in the respective Cheetah and Leopard repositories.
Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.
Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress
Interesting but does it manage many languages ? Like French for example 😅
It only works in English now, but support for German, French and Spanish languages is expected later in 2022, and other languages should/may be supported in 2023 depending on market demand.
Funny how far the benchmark numbers diverge depending on the dataset. The numbers shown on the diagram picture, being an average between the 5 speech datasets Picovoice chose for their benchmark, seem to be very highly driven by how badly each speech-to-text model/service f*cks up on the CommonVoice dataset. Ironically, that’s the dataset collected by mozilla themselves and where Mozilla DeepSpeech has the worst hangups. The sad thing with Mozilla DeepSpeech is that in 2020, Mozilla CEO Mitchell Baker fired first 70 then 250 more Mozilla employees (out of a staff ca. 1000), among which, iirc, the Rust team and…… Read more »
Some of the links in the Flashlight section were messed up (The same link was repeated three times), I made some changes, but not sure those are the right links.