Using openCV and NLTK to Make Speech Synthesis Animation
The project of this article is hosted at http://github.com/kazenotenshi/NLPAnimation.
All the mouth images where found at http://www.garycmartin.com/mouth_shapes.html.
The Beginning
Hello, and if you’re here after a long time, so sorry for the long delay. After I came back from Japan I’ve been working a lot and had few time to continue updating this blog. Today, I’ll discuss and show some code of a prototype ( that is not something really fancy =) ) of an exercise to do some speech synthesis with animations of text files.
Let’s talk! The main idea is shown below:
The main ideia is to transform text files into animations with mouth visualization and speech synthesis. To do so, my goal is to pre-process this file and transfer it into phonemes and then use it to generate a synced speech synthesis and image animation. Let’s take a look at how to do it.
Coding
The main files of the project are phonemes.py and imageprototype.py. The other ones where just small tests that I had conducted during the development just to verify some assumptions.
The file phonemes.py is more like a “.h” file to store the image library and the phonemes correspondence to these images that is gonna be really useful during the pre-processing phase. Each word will be translated to a list of phonemes ( e.g: NATURAL = [N ,AE1,CH,ER0,AH0,L] ) and that the global list of phonemes will have an image associated to it. Refer to NLTK Doc ( here ) about how these phonemes work. Along with the phoneme information, NLTK provides some intonation information of every word, it will be useful later. The file phonemes.py has two main structures: mouths and PhonemesToMouth. The file is implemented as shown below:
Now, let’s move to the next part. A sample text is written at sample.txt file. The first thing is to translate this file to a [word,phoneme] list. To do it, the program has to open the file, split the string into words using a tokenizer and find all the file correspondences in the phoneme dictionary. The code snippet bellow shows how to open the file, separate it into words, and find the phoneme correspondence to each word in the phoneme dictionary provided by NTLK.
A disclaimer: this is not the best optimized way to find words in a big list. Actually, it is really slow right now. For proof of concept purposes it just works. Still, I have some ideias to improve this part, I’ll try some stuff and probably put the results here later.
The TextToSpeech variable will hold something like:
Now, for every phoneme of every word, the phoneme time has to be inferred, and the proper image has to be shown during the speech. Here comes a harder part. I have researched many different libraries to do the speaking part. And none of then provided me a truly reliable callback to know the time for each word. The one I decided to use was the OSX’s NSSpeechSynthetizer, which is used by pyttsx library. Pyttsx is a python biding for Speech Synthesis in many different OS.
The hard coded part of this tutorial was to infer what is a normal phoneme, long phoneme and small phoneme. For each word, the program iterates over the list of phonemes and applies the corresponding times to each phoneme before the speech and visual processing. If the phoneme has a number 1 on it, it is a long phoneme, if it has a 0 it is a small one. Take a look at the code below:
Why do I hold both phoneme and word time? Because the phonemes time are used for the mouths independently, and the word time is used by the speech synthesis.
During speech, a main problem is that the call holds the program until all the speaking is done. Since the main ideia is to speech and visualization at the same time, a multi-threading approach must be done. Each time I need to speak a word, I’ll call a function from another thread to avoid interruptions and put a wait time, so the speaking is not cut in the middle. Look at the code below:
Finally, the program uses the OpenCV to show the images of the mouths using the same time information collected before and the dictionary that maps mouths to phonemes. This part is shown below:
The complete imageprototype.py code is:
Discussion
Although it is pretty easy to do this kind of coding using some existing python bindings, it seems to be a little inefficient at start. I put a small paragraph to pre-process and it takes about 2 minutes. The main reason is the phoneme match part. Since the library just gives the whole dictionary it is really painful to iterate over some thousands of words to find the one you want. Some improvements can be done at this part of the code.
Also, all the speech libraries I have looked does not provide a true API to easily do the syncing. In the example I had to guess the phonemes times and use it along with the mouth visualization. Sometimes I get awful animations, totally unsync. I haven’t had time to investigate these libraries deeply, but it seems that if I want to have this level of detail, I’ll have to make my own. Besides all that, NLTK is a great way to process words into phonemes and OpenCV is really useful when dealing with images.
Thanks for reading, and I appreciate any feedback. See ya.