Follow

has just released github.com/openai/whisper, a new open-source model for speech detection.

While after a couple of tries I'm impressed by its accuracy (you need to use the small model or a larger one if you want enough precision though), I'm also still unimpressed by its resource usage and performance.

The small model took ~30 seconds to process an audio file with 2 seconds of speech on my 6-year-old laptop with an i7 CPU, and in the meantime it used up more than 4 GB of RAM.

Mozilla's model was also heavy when I last used it ~1 year ago, but not THIS slow (although it was also slightly less accurate).

For now I definitely see the use-case for using OpenAI's new model for offline transcriptions, but they are still very far from being used for real-time applications such as voice assistants.

I'm still looking for a good open-source model that can be run on a RaspberryPi as a stable voice assistant. Ideally, it needs a small and simple model that can be used for hotword detection (I used to use Snowboy, but that project is now dead), and a more complex model to be used once the hotword is detected in order to transcribe the speech. And the audio transcription needs to be done within max 5 seconds in order to be compatible with the real-time expectations from a voice assistant.

Ideally, it needs to only include the model, not a lot of bloat around it that makes it harder to embed it - so is excluded.

So far, I haven't found any such model. My RPi still run the Google Assistant's push-to-talk script that I adapted into Platypush years ago, and a Snowboy hotword detection model that I managed to train before the project was shut down. If anybody knows of better solutions that could cut this last dependency on Google, I'd be happy to try them out.

· · Web · 0 · 3 · 2
Sign in to participate in the conversation
Mastodon

A platform about automation, open-source, software development, data science, science and tech.