Visual Cues and Their Effect on Auditory Perception
A substantial amount of information about the environment and the world around us is encoded in sound. The human brain has the laborious task of decoding all this input and extracting information such as the direction of the sound source, the characteristics of the sound (volume and pitch), and the meaning embedded within the stimulus. Speech is one of the most complex auditory stimuli and a perfect example of just how much information can be encoded in sound. However, speech stimuli are often ambiguous, degraded or unclear, making hearing a complex task. When our attention is directed towards specific auditory stimuli, there is almost always a certain level of background noise that can interfere with our perception. In some cases, this interference may become so strong that we experience significant difficulty in perceiving the desired stimulus. However, this occurs much more rarely than would be predicted. Amazingly, the human brain manages to perceive and interpret auditory stimuli even when they are greatly degraded or masked by noise. This is reflected in our ability to hold a conversation in conditions with a lot of background noise such as a loud room or a noisy street.
How then do we manage to perceive and interpret speech in conditions where even sound detection should be difficult if not impossible? The answer lies in the fact that although we consider speech to be primarily an auditory experience, it is strongly influenced by the visual system (Ross, Saint-Amour, Leavitt, Javitt, & Foxe, 2007). This may seem like a novel idea to many people, but it has been around in the scientific literature for more than sixty years, with research dating back to the early 1950’s. This is no surprise as multisensory integration has always drawn much attention from researchers from many different disciplines.
Evidence supporting the relationship between visual stimuli and speech perception comes from two major classes of research: studying the enhancement of speech perception due to lip movement (lip-reading or speech-reading) and the enhancement of speech perception due to written text. Sohoglu, Peelle, Carlyon and Davis (2012) conducted a series of experiments using written text as the visual stimulus and vocoded speech as an auditory stimulus. Vocoded speech is a commonly used in research. It is a form of degraded speech in which the spectral information (information based on frequency) is varied or removed from normal speech (Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995). Sohoglu et al. (2012) presented written text that either matched or did not match the vocoded speech stimulus, which was also presented. Participants were then asked to rate the clarity of the speech. The results showed that the presentation of a visual stimuli that matched the vocoded speech significantly increased the perceived clarity of the degraded auditory stimulus.
There are many other studies that have provided support for the enhancement of the clarity of degraded speech by the simultaneous presentation of written text (Frost, Repp, & Katz,1988; Wild, Davis, & Johnsrude, 2012). Despite the apparent consensus in the literature, many different models have been propsed to explain why written text enhances speech perception. These models can be divided into two mutally exclusive groups: the top-down influence and the bottom-up influence models.
The bottom-up models are the earlier of the two. In these models, the written text is automatically and involuntarily converted into a phonetic representation by our brain (Frost, Repp, & Katz,1988). This phonetic representation is then combined with the auditory signals received from our ears at a later, higher-level stage when the listener has to make a decision regarding the clarity of the auditory signal. These models were based on the idea that written text stimuli amplified signal detection biases and not perceptual sensitivity (Sohoglu et al., 2014). In other words, written text modified participants’ decision processes but not their perceptual processes.
The second class of models, the top-down models, have been gaining more evidence in modern research. The top-down models, as the name implies, are based on the notion that prior knowledge can affect lower-level processing and perception. In these models, the information provided by the written text modifies the way that we perceive the auditory stimuli (Sohoglu et al., 2014).
The two models are incompatible as the bottom-up model leaves no room for the possibility of any top-down interactions (Frost, Repp, & Katz,1988). We can therefore assume that only one of these models is accurate. Sohoglu et al. (2014) designed an experiment specifically to determine which of the two models was correct. The top-down model predicts that a visual stimulus would only enhance the perceptual clarity if it were presented before the onset of the auditory stimulus. Otherwise, it would be impossible for the enhancement to be due to a modification of perceptual processes as the perceptual processes that are to be modified would have already occurred. The bottom-down model does not predict this constraint (Sohoglu et al., 2014). The results obtained by Sohoglu et al. (2014) showed that written text enhanced perceived clarity when the visual stimulus was presented before the auditory stimulus but not when the visual stimulus was presented a few milliseconds after the onset of the auditory stimulus. This study is relatively new and during my limited search of the scientific literature I found no attempts to replicate these results. However, if these results are shown to be experimentally replicable, we would have a strong reason to reject the bottom-up model. This is an area of continuing research and shows great potential for providing us with important insights into the human auditory system.
Experiments using written text as a visual stimulus have been very useful, but they are not the only form of experimental evidence supporting the link between vision and hearing. Much research has also been done regarding the link between speech perception and lip movements. Ross et al. (2007), for example, presented participants with a monosyllabic speech stimulus that had varying signal-to-noise ratios (SNRs). Some participants were also shown a visual display of a face mouthing the words (audiovisual group). This display was synchronized with the auditory stimulus. After each trial, participants had to report what word they heard and their performance was measured in terms of percent correct (%C). As multisensory integration would predict, %C was significantly higher at each level of SNR in the audiovisual group. This study replicates the results of many previous studies that have obtained evidence for the enhancing capability of lip movement on auditory perception (Grant, & Seitz, 2000; Sumby, & Pollack, 1954).
The mechanism by which lip-reading enhances perception is not the same as the mechanism for written text is what scientists refer to as a multisensory integration model. It is important to note that the change in %C’s (Δ%C) between the audiovisual and the simple-audio is not constant at all SNR levels (Grant, & Seitz, 2000; Ross et al., 2007; Sumby, & Pollack, 1954). According to the classical literature, variation of Δ%C across the different levels of SNR follows the rule of inverse effectiveness which was first proposed by Meredith and Stein (1986). Inverse effectiveness is a rule that governs multisensory integration and it states that as each of the single stimuli becomes less effective at eliciting a response, the magnitude of the multisensory response increases (Meredith & Stein, 1986). In our present example, this would imply that Δ%C should be inversely proportional to the SNR and in fact, many earlier studies have shown this to be true (Erber 1969; Sumby, & Pollack, 1954).
The claim that lip-reading and speech perception follow the law of inverse effectiveness has not gone uncontested. Ross et al. (2007) was one of the first studies to provide evidence against the theory of inverse effectiveness. Δ%C was found to be inversely proportional to SNR but this relationship only held for the higher values of SNR. When the SNR was sufficiently low, Δ%C was also found to decrease. For intermediate values of SNR, Δ%C reached its maximum value. In summary, Δ%C increases as SNR decreases until Δ%C peaks in an intermediate range of SNR. After this peak, Δ%C begins to decrease (Ross et al., 2007).
Ross et al. (2007) goes on to explain why previous studies have failed to produce the same results as their controversial study. Firstly, previously conducted studies have not explored SNR levels that were low enough to characterize the full extent of the relationship (Ross et al., 2007). In fact, the majority of the studies use the SNR at which Ross et al. (2007) recorded the greatest Δ%C as the lower boundary for their studies. Secondly, after reviewing existing literature, Ross et al. (2007) claim that some of the studies draw unjustified conclusions from their data. Their arguments are summarized in the following quote:
"Taking these earlier behavioral studies together, we believe that a somewhat consistent pattern emerges. Some who have claimed that the greatest gains are to be found at the lowest SNR actually did not test sufficiently low SNRs to warrant such a contention, whereas others who made similar claims are often not supported by their own data." (Ross et al., 2007, p.1152)
In light of new controversial studies such as the one mentioned above, there is no consensus as to how exactly lip-reading aids in speech perception. However, the research in this field seems to agree upon the fact that lip movements play a large role in our perception of speech (Erber 1969; Ross et al., 2007; Sumby, & Pollack, 1954). In some cases, it is even possible for auditory stimuli and lip movement cues to combine and form a third perceptual experience for which there is no stimulus present in the environment (McGurk & MacDonald, 1976). For example, if a short video clip of a face mouthing the syllables “ga-ga” is dubbed with the sound of a person saying the syllables “ba-ba”, we have the auditory experience of somebody saying “da-da” or perhaps even “tha-tha”. This is commonly referred to as the McGurk effect and it serves as a powerful illustration of the intimate relationship between vision and hearing.
With the amount of evidence gathered, it has been established that human speech perception is often modified by visual stimuli. Two well-studied examples of such visual stimuli are written text and the movement of the lips during speech. The enhancing effect that each of them has on speech perception has been the subject of scientific research for the past half-century. However, more recently, studies are challenging old ideas and forcing modern scientists to rethink and reformulate old models and theories. By studying the multisensory integration of vision and hearing, scientists have learned much about the human mind and how it works.
The implications of fully understanding the relationship between vision and hearing are manifold. It can be used to help treat the sensory impaired and develop new technology for audiovisual industries. Perhaps most importantly, it can explain certain aspects of our daily lives such as why we look at a person when he/she is talking, why wearing sunglasses sometimes creates auditory difficulties when it comes to speech, and why we have no trouble conversing face-to-face with a person in a loud room but experience great difficulty holding a conversation over the phone in the presence of the slightest noise.
The field of multisensory integration in vision and hearing is a minefield of scientific discovery. There have been great advancements since the subjects first came to the attention of scientists. It is a field of ongoing research and is receiving a lot of scientific attention. Based on the amount and the quality of the research being done, many new and important discoveries will arise in the near future. Our knowledge of our perceptual systems has increased substantially over the past few years but there is still so much more that we have yet to discover.
Erber, N. P. (1969). Interaction of audition and vision in the recognition of oral speech stimuli. Journal of Speech & Hearing Research, 12(2), 423-425. doi:10.1044/jshr.1202.423
Frost, R., Repp, B. H., & Katz, L. (1988). Can speech perception be influenced by simultaneous presentation of print? Journal of Memory and Language, 27, 741-755. doi:10.1016/0749-596X(88)90018-6
Grant KW, Seitz PF. (2000). The use of visible speech cues for improving auditory detection of spoken sentences. The Journal of the Acoustical Society of America, 108, 1197-1208. doi:10.1121/1.1288668
McGurk, H., MacDonald, J. W. (1976). Hearing lips and seeing voices. Nature, 264, 746-748. doi:10.1038/264746a0
Meredith, M. A, Stein, B. E. (1986). Spatial factors determine the activity of multisensory neurons in cat superior colliculus. Brain Research Cognitive Brain Research, 369, 350-354. doi: 10.1016/0006-8993(86)91648-3
Ross, L. A., Saint-Amour, D., Leavitt, V. M., Javitt, D. C., & Foxe, J. J. (2007). Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. Cerebral Cortex, 17, 1147-1153. doi:10.1093/cercor/bhl024
Shannon, R. V., Zeng, F-G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech Recognition with Primarily Temporal Cues. Science, 270, 303-304. doi:10.1126/science.270.5234.303
Sohoglu, E., Peelle, J. E., Carlyon, R. P., & Davis, M. H. (2012). Predictive top-down integration of prior knowledge during speech perception. The Journal of Neuroscience, 32, 8443–8453. doi:10.1523/JNEUROSCI.5069-11.2012
Sohoglu, E., Peelle, J. E., Carlyon, R. P., & Davis, M. H. (2012). Top-down influences of written text on perceived clarity of degraded speech. Journal of Experimental Psychology: Human Perception and Performance, 40(1), 186–199. doi:10.1523/JNEUROSCI.5069-11.2012
Sumby, W. H., Pollack, I. (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America, 26, 212-215. doi: 10.1121/1.1907309
The McGurk Effect. (2001). Retrieved March 21, 2001, from http://auditoryneuroscience.com/McGurkEffect
Wild, C. J., Davis, M. H., & Johnsrude, I. S. (2012). Human auditory cortex is sensitive to the perceived clarity of speech. NeuroImage, 60, 1490-1502. doi:10.1016/j.neuroimage.2012.01.035