While after a couple of tries I'm impressed by its accuracy (you need to use the small model or a larger one if you want enough precision though), I'm also still unimpressed by its resource usage and performance.
The small model took ~30 seconds to process an audio file with 2 seconds of speech on my 6-year-old laptop with an i7 CPU, and in the meantime it used up more than 4 GB of RAM.
Mozilla's #DeepSpeech model was also heavy when I last used it ~1 year ago, but not THIS slow (although it was also slightly less accurate).
For now I definitely see the use-case for using OpenAI's new model for offline transcriptions, but they are still very far from being used for real-time applications such as voice assistants.
I'm still looking for a good open-source model that can be run on a RaspberryPi as a stable voice assistant. Ideally, it needs a small and simple model that can be used for hotword detection (I used to use Snowboy, but that project is now dead), and a more complex model to be used once the hotword is detected in order to transcribe the speech. And the audio transcription needs to be done within max 5 seconds in order to be compatible with the real-time expectations from a voice assistant.
Ideally, it needs to only include the model, not a lot of bloat around it that makes it harder to embed it - so #Mycroft is excluded.
So far, I haven't found any such model. My RPi still run the Google Assistant's push-to-talk script that I adapted into Platypush years ago, and a Snowboy hotword detection model that I managed to train before the project was shut down. If anybody knows of better solutions that could cut this last dependency on Google, I'd be happy to try them out.
A platform about automation, open-source, software development, data science, science and tech.