Can you tell the difference between AI-generated computer speech and a real, live human being? Maybe you’ve always thought you could. Maybe you’re fond of Alexa and Siri but believe you would never confuse either of them with an actual woman.
Things are about to get a lot more interesting. Google engineers have been hard at work creating a text-to-speech system called Tacotron 2. According to a paper they published this month, the system first creates a spectrogram of the text, a visual representation of how the speech should sound. That image is put through Google’s existing WaveNet algorithm, which uses the image to produce extremely natural sounding human speech.
Using this method, the researchers report, “Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.” (A mean opinion score is a telecommunications term that measures how true-to-life something sounds.)
As Google’s audio samples demonstrate, Tacotron 2 can detect from context the difference between the noun “desert” and the verb “desert,” as well as the noun “present” and the verb “present,” and alter its pronunciation accordingly. It can place emphasis on capitalized words and apply the proper inflection when asking a question rather than making a statement.
And it can generate text that sounds so similar to human speech that it’s difficult or impossible to know the difference. If you want to see just how hard it is, go to Google’s audio samples page, and scroll down to the last set of samples, titled “Tacotron 2 or Human?” There you’ll find Tacotron 2 and a real person each saying sentences such as, “That girl did a video about Star Wars lipstick.”
SPOILER ALERT: To test yourself, listen to the samples and guess which is which before reading the rest of this column.