Espressif ESP-SR enables on-device speech recognition framework on ESP32-S3 and ESP32 WiSoCs

Espressif ESP-SR is a speech recognition framework enabling on-device speech recognition on ESP32 and ESP32-S3 wireless microcontrollers with the latter being recommended due to its vector extension for AI acceleration and larger, high-speech octal SPI PSRAM.

The ESP-SR framework was first released on December 17, 2021 with version 1.0, before the v1.20 update was introduced in March of this year, but I only found out about ESP-SR offline speech recognition solution through a tweet by John Lee showing an ESP-SR demo video by @ThatProject.

Comrades of the world, liberate your hands from the chains of typing and touching germy switches! Embrace the revolutionary power of speech recognition with ESP32-S3 + ESP-SR. Let your words flow freely, for the proletariat shall not be silenced by keyboards or bourgeois input… pic.twitter.com/bm3udteB3o

— John Lee (@EspressifSystem) July 15, 2023

I initially was confused since ESP32 boards have supported speech recognition for years using the ESP-ADF framework. But the key difference is that the latter relies on online voice assistants such as Baidu DuerOS, Amazon Alexa, and Google Assistant, while the relatively new ESP-SR does that locally directly on the ESP32 CPU, so you don’t even need a network connection for this to work. We’ve written about various offline voice recognition modules in the last few years, and I didn’t know this was already implemented on the ESP32 chips.

The GitHub repository for ESP-SR lists four main components:

Audio Front-end AFE
WakeNet Wake Word Engine
MultiNet Speech Command Word Recognition
Speech Synthesis (only supports the Chinese language at this time)

If some of the components above ring a bell, that’s because they are existing solutions and we covered the ESP-AFE algorithms when they become Alexa certified, while WakeNet and MultiNet are part of the ESP-SKAINET assistant introduced in 2019. What appears to be new are test apps for speech recognition and text-to-speech conversion that were committed just 3 to 5 days ago.

ESP-SR ESP32 on-device speech recognition workflow — Speech recognition workflow

So it looks like the ESP-SR simply combines all those different projects as components to help with integration into customers’ projects. You’ll find documentation on the Espressif website, and the company recommends the ESP32-S3-Korvo-1 or ESP32-S3-Korvo-2 development boards to get started although I’d assume it should probably work on other ESP32-S3 smart audio devkits with microphones such as the ESP32-S3-BOX as well.

Jean-Luc Aufranc (CNXSoft)

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

3 Replies to “Espressif ESP-SR enables on-device speech recognition framework on ESP32-S3 and ESP32 WiSoCs”

Wake word on ESP32 would be a perfect project in ESPHome for “Home Assistant’s year of Voice”
https://www.home-assistant.io/blog/2022/12/20/year-of-voice/

Jon Smirl says:

July 18, 2023 at 18:24

Willow is already doing this.
https://github.com/toverainc/willow

The problem with WIllow is that you need to leave RTX 3xxx class card running 24/7 to do the processing.

Reply
1. Stuart Naylor says:
  
  July 21, 2023 at 02:11
  
  For some reason Willow has chosen to use Whisper for ASR which yeah if you use the large model then you need some Ooomf when using off device ASR.
  ESP-SR does offer on device ASR through Multinet but its really pushing past the devices capability hence why it broadcasts via a KW trigger to a central ASR.
  
  Wakenet the KW part also isn’t the best and the Claims willow is competive is extremely optimistic, but the ADF does give a 2/3 mic BSS (blind source seperation alg) that can really help with noise and far field.
  
  Still though it uses the $50 Esp32-S3-Box dev kit that is loaded with a bloat as a technology demonstrator and for some reason depending on your opinion opensource in this arena can be considered extremely optimistic or snakeoil with its comparisons depending on your opinion.
  
  What wakenet does prove is that the esp32-s3 has the potential to make a really good broadcast low cost KW microphone where maybe several could be used in a zone to increase coverage.
  If anybody is up for creating a dual mic ADC shim for esp32-S3 or a low cost specific S3 mic dev kit then please do.
  
  The current Wakenet suffers from poor datasets that if a community got together with an opt-in and do what Big-Data do of collating a on-device dataset, the problem of poor datasets would be solved.
  There is no better dataset than a dataset recorded on device of use and its a catch-22.
  
  Reply

Boardcon CM3588 Rockchip RK3588 System-on-Module designed for AI and IoT applications

Hedda says:

July 18, 2023 at 16:29

Wake word on ESP32 would be a perfect project in ESPHome for “Home Assistant’s year of Voice”
https://www.home-assistant.io/blog/2022/12/20/year-of-voice/

1. Jon Smirl says:
  
  July 18, 2023 at 18:24
  
  Willow is already doing this.
  https://github.com/toverainc/willow
  
  The problem with WIllow is that you need to leave RTX 3xxx class card running 24/7 to do the processing.
  
  1. Stuart Naylor says:
    
    July 21, 2023 at 02:11
    
    For some reason Willow has chosen to use Whisper for ASR which yeah if you use the large model then you need some Ooomf when using off device ASR.
    ESP-SR does offer on device ASR through Multinet but its really pushing past the devices capability hence why it broadcasts via a KW trigger to a central ASR.
    
    Wakenet the KW part also isn’t the best and the Claims willow is competive is extremely optimistic, but the ADF does give a 2/3 mic BSS (blind source seperation alg) that can really help with noise and far field.
    
    Still though it uses the $50 Esp32-S3-Box dev kit that is loaded with a bloat as a technology demonstrator and for some reason depending on your opinion opensource in this arena can be considered extremely optimistic or snakeoil with its comparisons depending on your opinion.
    
    What wakenet does prove is that the esp32-s3 has the potential to make a really good broadcast low cost KW microphone where maybe several could be used in a zone to increase coverage.
    If anybody is up for creating a dual mic ADC shim for esp32-S3 or a low cost specific S3 mic dev kit then please do.
    
    The current Wakenet suffers from poor datasets that if a community got together with an opt-in and do what Big-Data do of collating a on-device dataset, the problem of poor datasets would be solved.
    There is no better dataset than a dataset recorded on device of use and its a catch-22.

3 Replies to “Espressif ESP-SR enables on-device speech recognition framework on ESP32-S3 and ESP32 WiSoCs”

Leave a Reply Cancel reply

Leave a Reply