Espressif ESP-SR is a speech recognition framework enabling on-device speech recognition on ESP32 and ESP32-S3 wireless microcontrollers with the latter being recommended due to its vector extension for AI acceleration and larger, high-speech octal SPI PSRAM.
The ESP-SR framework was first released on December 17, 2021 with version 1.0, before the v1.20 update was introduced in March of this year, but I only found out about ESP-SR offline speech recognition solution through a tweet by John Lee showing an ESP-SR demo video by @ThatProject.
Comrades of the world, liberate your hands from the chains of typing and touching germy switches! Embrace the revolutionary power of speech recognition with ESP32-S3 + ESP-SR. Let your words flow freely, for the proletariat shall not be silenced by keyboards or bourgeois input… pic.twitter.com/bm3udteB3o
— John Lee (@EspressifSystem) July 15, 2023
I initially was confused since ESP32 boards have supported speech recognition for years using the ESP-ADF framework. But the key difference is that the latter relies on online voice assistants such as Baidu DuerOS, Amazon Alexa, and Google Assistant, while the relatively new ESP-SR does that locally directly on the ESP32 CPU, so you don’t even need a network connection for this to work. We’ve written about various offline voice recognition modules in the last few years, and I didn’t know this was already implemented on the ESP32 chips.
The GitHub repository for ESP-SR lists four main components:
- Audio Front-end AFE
- WakeNet Wake Word Engine
- MultiNet Speech Command Word Recognition
- Speech Synthesis (only supports the Chinese language at this time)
If some of the components above ring a bell, that’s because they are existing solutions and we covered the ESP-AFE algorithms when they become Alexa certified, while WakeNet and MultiNet are part of the ESP-SKAINET assistant introduced in 2019. What appears to be new are test apps for speech recognition and text-to-speech conversion that were committed just 3 to 5 days ago.
So it looks like the ESP-SR simply combines all those different projects as components to help with integration into customers’ projects. You’ll find documentation on the Espressif website, and the company recommends the ESP32-S3-Korvo-1 or ESP32-S3-Korvo-2 development boards to get started although I’d assume it should probably work on other ESP32-S3 smart audio devkits with microphones such as the ESP32-S3-BOX as well.
Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.
Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress
Wake word on ESP32 would be a perfect project in ESPHome for “Home Assistant’s year of Voice”
https://www.home-assistant.io/blog/2022/12/20/year-of-voice/
Willow is already doing this.
https://github.com/toverainc/willow
The problem with WIllow is that you need to leave RTX 3xxx class card running 24/7 to do the processing.
For some reason Willow has chosen to use Whisper for ASR which yeah if you use the large model then you need some Ooomf when using off device ASR. ESP-SR does offer on device ASR through Multinet but its really pushing past the devices capability hence why it broadcasts via a KW trigger to a central ASR. Wakenet the KW part also isn’t the best and the Claims willow is competive is extremely optimistic, but the ADF does give a 2/3 mic BSS (blind source seperation alg) that can really help with noise and far field. Still though it uses… Read more »