All these APIs are doing is converting audio to text, processing it through a language model, and then converting it back to audio. It might seem sophisticated on the surface but underneath it's just text generation in a robot's voice. Misses all the important details of audio interactions.
I used ->
* Llama 3 (on Groq)
* WebSpeechRecognition API
* Deepgram (TTS)
Each individual system is comprehensive and reasonably mature, but glue them all together on our proverbial pig in lipstick and there is no real understanding of the nuances of audio interactions.