The Magical Interaction Between ChatGPT and TTS
A Brand New Voice Assistant
This article was last updated 1 year(s) 7 month(s) ago, and the content may have evolved or changed since then.
Back when ChatGPT first launched at the end of last year, I managed to get access to an account. However, the user experience at that time was not great: ChatGPT would suddenly stop when speaking long passages, and I would frequently get errors about too many requests leading to service denial. So after just a few days of experimentation, the novelty wore off and I put it aside.
Who could have expected that a couple months later, the ChatGPT concept suddenly blew up in China again, with more and more use cases being discovered. At the same time, the user experience itself improved quite a bit: no more sudden stops during long passages; service denial happening less and less; and new features like conversation history being supported. So at this point I started consciously using ChatGPT to handle some daily tasks - writing PPTs, general knowledge Q&A, and even restructuring code. Although there were still some errors with the code restructuring, it did give me some inspiration.
Back then I had an idea - ChatGPT is great at everything, except for one thing: it can only communicate via typing (Ahem, if it could communicate via voice, wouldn’t interviews and such all become easy?). Unfortunately ChatGPT didn’t release a public API at first, so if I wanted to use it, I’d probably have to simulate it with a headless browser, but that would involve cracking Cloudflare’s captcha, too much work…
Finally, in March, OpenAI released the ChatGPT API as well as the corresponding speech-to-text model Whisper API. My previous idea of voice conversing with ChatGPT now seemed simple enough to implement, but I was feeling a bit lazy and lacking motivation to work on this project. A few days ago, my girlfriend forwarded me a video that showcased voice interactions with ChatGPT. She thought it was very interesting and asked if I could make one for her. This happened to align perfectly with what I had been thinking about, so I started researching this idea.
Module Breakdown
Clearly, to converse with ChatGPT via voice, the overall functionality can be split into 3 modules:
Speech recognition module: This part converts the user’s speech into corresponding text, which serves as ChatGPT’s input.
ChatGPT module: This part sends the text to ChatGPT and saves its responses.
Speech synthesis module: This part converts ChatGPT’s responses into speech and plays it back.
Below is a diagram of the design:
Next I’ll go through the design and implementation of each of these three modules.
Speech Recognition Module
This was the module that took me the longest time. At first I didn’t plan on using the Whisper API (A lifetime freeloader is always looking for something free), but instead wanted to try offline solutions, mainly because I was concerned about the network overhead of using an API - it would slow down speech recognition and degrade the user experience.
I tried two offline speech recognition options - PocketSphinxand DeepSpeech. But neither worked that well. PocketSphinx’s accuracy was terrible, while DeepSpeech’s was decent but so slow, and offline solutions would require me to maintain training models myself. With my half-baked machine learning skills, I was too lazy to deal with model management.
So after going in circles I ended up back with looking for speech recognition as a cloud service. After quite a bit of searching, I found an open source Python library called SpeechRecognitionthat integrates most major speech recognition services, although most are paid services that require an API key. Luckily Google’s offering can still be used for free.
So my speech recognition module is basically just a wrapper around SpeechRecognition with some custom operations. I can tweak it later if any pain points come up during usage.
ChatGPT Module
This module took the least time to complete and there’s not much to say. I simply called the OpenAI API. However, I did research the API parameters a bit and made two adjustments compared to the default:
temperature: According to OpenAI, lower values result in more consistent responses. For example, with the same input question, higher temperature leads to more diverse responses.
messages: I made two more granular tweaks here:
I prepend a
role=system
directive before all conversations. This is to make ChatGPT’s responses follow the system directive. My current directive is “Answer in concise language”, forcing brief responses instead of rambling. This is configurable in the settings file.I maintain the last 3 conversation turns by default. The ChatGPT API does not maintain context between conversations like the web interface does. To tie conversations together, previous contents must be included in the request. Without limiting history length, token consumption would quickly get out of hand.
My current training approach is to leave settings as default for casual chatting. For special conversations, like using it as an encyclopedia for questions, I lower history length and set the system directive to “Answer in as much detail as possible”. For English conversation practice, I set the directive to “Act as an English teacher, point out grammatical errors and ask questions based on context”. For mock interviews, I set it to “Act as an interviewer for the xx role and conduct an interview, respond and ask questions concisely.”
Speech Synthesis Module
This module also took quite some effort on my part. Again I wanted to find an offline solution to save on network overhead and make conversations more fluid. I did find a corresponding offline option - pyttsx3, which uses the OS’s built-in TTS engine: SAPI5 for Windows, NSSpeechSynthesizer for Mac, and eSpeak for other platforms.
Although this frees me from dependencies, pyttsx3 has a severe bug - it does not support multithreading. Calls to .speak() don’t block if on a non-main thread, leading to overlap. This bug was reported5 years ago and remains unresolved.
With no choice, I abandoned the offline option and went with mooching off Azure and Google APIs, provided by the edge-ttsand gTTSlibraries.
One optimization worth mentioning is decoupling the ChatGPT interaction and speech synthesis into separate threads communicating via a queue. With linear execution, waiting for the full ChatGPT response before speaking led to long delays for longer responses. Now, each ChatGPT response sentence gets queued immediately, while the speech thread continuously dequeues data to start speaking.
This improves the user experience tremendously.
Links
Demo video (Chinese): https://www.bilibili.com/video/BV1rY411z7tA/
Code repository: https://github.com/WincerChan/talkgpt
Some Thoughts
One day in high school class, my math teacher told us about the three crises of mathematics throughout history (forgot how the topic came up): the first due to irrational numbers, the second due to infinitesimals, and the third due to Russell’s paradox. After finishing the story-like overview, he asked us a question: Do you think there could be a fourth crisis of mathematics? We immediately started lively discussion, saying things like “Shouldn’t happen right?” or “What would it be about if there were another one?”. He quietly watched us discuss from the podium before saying, “Seems like you all think there won’t be another mathematical crisis.”
Seeing no objections, he continued, “I don’t think so. I believe there will definitely be a fourth, fifth crisis and so on. As mathematicians, we can’t always judge based on common sense. Before each of the previous crises, contemporaries also didn’t believe a crisis would happen, yet they did. So I assert firmly - as long as human civilization keeps developing, there will definitely be more mathematical crises.” (It’s been almost 10 years so I can’t recall his exact phrasing, but this captures the gist.) At the time, I didn’t really understand but still felt the immense impact on my undeveloped mind. Looking back now, I think I can appreciate his perspective.
As technology professionals, we are the first to experience waves like ChatGPT. Whether to ignore or utilize these waves, I feel I’ve made my choice while working on this little tool.
Conclusion
After spending a few days’ spare time, I’ve got a rough version of this tool working. There’s still lots of polish needed, but the foundation is there.
Lately I’ve been rather restless and haven’t seriously studied technology. Finishing this project feels quite good. I will slowly rekindle my passion for technology.