• Neural Speech – Neural Nets Will Soon Be Able To Do All The Talking For You

    Imagine how annoying it would be if you suddenly found yourself unable to speak? You know exactly what you want to say, when and in which tone of voice, but you just can’t get it out. Your lips, jaw, tongue and voice box just won’t do as their told. They no longer obey the instinctively governed instructions that usually move the 100+ muscles that must be carefully coordinated by the brain to produce effortless speech at the drop of a hat.

    Imagine being speechless

    Sadly, what is to most of us just a worrying thought experiment, actually happens to millions of people around the world. Those robbed of the ability to speak such as people living with with advanced Amylotrophic Lateral Sclerosis (as Stephen Hawking did for decades before his death in 2018), late stage Parkinson’s, strokes affecting the brain stem or throat cancer may one day, in the not too distant future, be able to have a sheet of electrodes surgically attached to the surface of their brain and produce speech via a brilliant invention by University of California San Francisco scientists.

    People with ALS, like Stephen Hawking (RIP), may one day be able to speak via neural speech

    Publishing their innovate study in the prestigious journal Nature, the researchers described how they managed to pull off the incredible feat of developing a computer algorithm able to produce intelligible speech sounds on the basis of brainwaves recorded from the surface of the brains of five brave volunteers. All were patients who had been fitted with a sheet of electrodes so that their neurosurgeon could work out where their epileptic seizures were coming from. As they had to stay in hospital for a few days anyway (waiting for a seizure to happen so that the epicentre can be pinpointed and ultimately removed), they are often happy to help out when neuroscientists have cool experiments to participate in.

    Electrocorticography (ECoG) allows measurement of neural activity to be taken at the brain surface. This electrode grid is over the somatosensory (S) and motor (M) cortices

    In this case the researchers were recording from three different brain areas while each volunteer read out hundreds of words and standardised sentences. One of the brain areas they were interested in getting data from was the Inferior Frontal Gyrus in which “Broca’s Area” is found. Thanks to an intrepid French neurologist, Paul Broca, we’ve known since 1886 that this area is extremely important when it comes to producing speech sounds. He had a patient who was only able to say one word – “Tan” – over and over again. Once the patient died, Broca took a look at his brain and noticed clear and obvious damage to the lower part of the left frontal cortex. Broca deduced that this area must be instrumental in speech production.

    Pierre-Paul Broca

    Another area, the superior temporal gyrus, houses the cortical real-estate in which auditory information arriving through the ear is processed to generate what we hear. This information is important in speech production as it enables us to modulate how we speak on the fly, on the basis of what we hear ourselves saying as we say it. Last but not least, they also monitored what the ventral sensorimotor cortex was up to as they spoke the set phrases. The ventral (lower) part of the motor cortex contains all the neurons that connect with the muscles that control movements of the mouth, tongue, jaw and vocal cords. The ventral part of the sensory cortex contains neurons that detect movements in these same body parts. This is important for speech in terms of providing feedback on how all those articulatory movements (position of tongue, mouth, jaw etc) are faring so that the motor cortex signals can be tweaked accordingly from moment to moment to ensure the correct sounds are produced.

    The superior temporal gyrus (auditory cortex)

    “Articulatory kinematics” refers to movements of the tongue, jaw, voice box and mouth that enable us to produce the word sounds (“speech acoustics”) we are aiming for. Anumanchipalli, Chartier and Chang figured that the best way to generate the sound of speech from measurements of brain waves was to go via these articulatory kinematics. Millions of man-hours had already gone into finessing a model that could take various different acoustic features that make up speech sounds (like the pitch, voicing, glottal excitation and something called mel-frequency cepstral coefficients) and turn them into a sound file that can be played on a speaker. All they then had to do was work out from the brain data what the intended articulatory kinematics were (i.e. should the jaw be open or closed, should the tongue be in the top or bottom of the mouth, how pursed are the lips etc), decode the intended acoustic features from those and then feed all of the resulting glottal, cepstral, pitch and voicing variables into the pre-existing human speech sound generator. Easy? Far from it!

    The voice box

    The type of machine learning algorithm they used to make sense of all that brain data was a bidirectional long short term memory (blSTM) neural network (in case you want to amaze your friends). The first one (there were two!) was trained to work out which patterns in the avalanche of brain data were actually related to the various articulatory movements associated with each word. And here’s the bit that amazes me the most: they didn’t actually have the facilities to measure the actual movements of the jaw, tongue, lips etc – they just estimated these variable using statistical methods from a recording of what the person was actually saying at the time. Given the computer science motto: “Garbage in, Garbage out” it seems nothing short of a miracle that they could train up a machine learning algorithm on an estimate of the articulatory kinematics rather than the actual articulatory kinematics. The second algorithm then decoded the intended pitch, mel-frequency cepstral coefficient, glottal excitation and voicing stuff from the output of the first one. And those data were used to create the final sounds. Phew!

    A long short term neural network looks like this

    The results weren’t
    perfect, but they were good enough to cause a huge ripple of excitement across
    the globe. Listeners were either required to transcribe whole sentences of
    neural speech (audio files generated by the neural networks), or to identify
    individual words that had been snipped out from the full sentences. To help
    them, they did this in reference to a pool of either 10, 25 or 50 possible
    words, which helps to reduce the ambiguity in a manner similar to the strategy
    employed by carers of people with serious speech impairments. Their ability to
    identify what was was being said was improved both by choosing from a smaller
    pool of words and when any given word had more syllables rather than fewer.

    This sound wave is the word “above”

    In order to rule out that the use of recordings of each patient’s voice to train up the machine learning algorithm hadn’t contaminated their results, they also got the volunteers to mouth the words without actually saying them out loud. The neural speech generated from the brain data collected when they were miming the words rather than saying them out loud was also pretty good. It wasn’t as good, but that’s likely to be due to the fact that when a person is mouthing the words, their brains are not sending messages to the diaphragm and intercostal muscles to drive air up through the vocal cords – clearly an important part of the overall speech signal – so without it the quality of the speech was understandably degraded. Degraded, but often intelligible nonetheless, which for most people robbed of their faculty of speech is a better option than the alternatives. While natural speech flows at around 150 words per minute, the currently available alternatives crawl along at a measly 10 or so words per minute.

    The inferior frontal gyrus (aka Broca’s Area)

    While this study provided evidence that neural speech algorithms can be trained up on the data from one set of brains, but then used to help generate speech from neural impulses gathered from a different brain altogether, the best results were achieved when the neural networks were trained on data from the same brain that later generated the neural speech. This means that the most likely uses, in the early years at least, will be people in whom the power of speech is retained for many months post-diagnosis – providing the opportunity to implant the neural sheet, train up the individual’s personalised machine learning algorithm – and then once the disease progresses to the point where they can no longer speak, the neural speech will be ready and waiting to lend a helping hand.

    The tongue is a wonderful thing

    While such a scenario may be many years in the future (as there is much work to be done before the regulatory bodies are likely to deem such an approach safe and suitably effective given the unavoidable risks of brain surgery), it is nonetheless a very exciting step in the right direction. And to many speechless people, the prospect of being able to speak fluently at more or less natural pace of conversation could be absolutely life changing.

    In addition to these monthly brain blogs I also tweet regularly (@drjacklewis) and am getting very very close to launching my new YouTube channel VIRTUAL VIVE SANITY, where I will be sharing my neuroscience-informed tips and tricks on how to get the most out of virtual reality equipment in the home, reviewing games and eventually creating tutorials to help anyone create their own VR experiences entirely for free!

Share This Post

Comments are closed.