Real-time 3D Acoustics Rendering
Efficient Sound Auralization with Reflections and Reverberation
David Rosen, advised by Professor Carr Everbach
Abstract
It is common for real-time systems to render 3D sounds by simply panning the sound from left to right and adjusting the volume based on distance. This is a good start, but ignores many effects that contribute to our perception of sounds and the acoustic environment. First, we must take into account the time it takes for the sound to travel from the source to each receiver (e.g. the left and right ear). Second, ears are not omnidirectional microphones; sounds can be occluded by the head and by the ear itself, and different frequency bands can be affected in different ways by this occlusion. Finally, sound sources can be occluded by objects in the environment, and the sound can reflect off of surfaces, or reverberate within spaces. In this paper we describe an efficient way to simulate all of these phenomena in real time.
Distance Delay
Sound travels through the air at very high speed: 340 m/s at sea level. However, unlike light, sound does not travel fast enough for us to treat it as instantaneous. If we do, we miss a number of important acoustic phenomena. First, and most importantly, our brain uses the difference in phase between the sound waves received by our two ears to determine the direction of the sound source. From my experiments, this is a much stronger cue than simple volume panning. If the left and right channel are the same volume, but the left sound plays half a millisecond before the right, it gives a strong sense that the sound is coming from the left. Here is a video demonstrating sound directionality using nothing but phase changes caused by the difference in distance from the sound between the left and right ear. This effect is only perceptible using headphones. The camera is set up as shown below, with a virtual microphone on each side of it representing each ear, and the delay is calculated based on the distance from each microphone to the sound source.
The Doppler effect is another important phenomenon caused by the finite speed of sound, and it automatically works given the system we described. Each sound is loaded into a buffer, and we read from it using a marker that moves from the beginning of the sound to the end at the rate of 1 second per second. However, the delay changes the position of the marker, and a changing delay will thus change the marker’s speed. For example, if the sound source is moving closer to the microphone, the delay will be decreasing, and the marker will move at a rate higher than 1 second per second. This results in the sound playing at a higher frequency. Conversely, if the sound is moving farther away, the delay will be increasing, and the marker will move more slowly, and play at a lower frequency. Here is another video demonstrating of this effect in the simulation. The interpolation is not perfect yet, so there are occasional click artifacts as the marker moves from one part of the sound buffer to another.
Yet another effect is the different speed of sound in different media. In video games, explosions may be the most salient example. The speed of sound through the ground can be as high as 5,000 m/s, almost 15 times faster than in air. The result is that we hear and feel the low-frequency rumble of an explosion through the ground before the more broad-spectrum shock from the air. The shockwave in the air also dislodges and propels dust and other small debris, visually demonstrating the finite speed of sound. Here is a real video demonstrating these two effects of explosions.
Directional Occlusion
For very accurate simulation of the
acoustics of the ear itself, we can make a cast of the user’s ears, put
microphones inside them, put them on a dummy head, and record impulse responses
from many directions. However, this is impractical for any program that will be
used by many users, and would be difficult to implement in real-time. For this
reason we chose a much simpler approach. Each microphone is directional, and
pointed 45 degrees left or right of the camera. If the sound source is in the
direction of the microphone, it is unchanged, but if it is behind the
microphone, its volume is reduced, especially at high frequencies. We could
apply this filter in real-time, but it was too slow to support many sounds at
the same time. We therefore decided to pre-calculate the muffled version, and
blend between the clear and muffled version of the sounds. Here is a video of sound
directionality using both phase changes and this directional occlusion
technique.
Environmental Occlusion and Indirect Sound
The obvious way to handle sound
occlusion is to cast a ray from the sound source to a microphone, and then mute
the sound if there is something in the way. This is straightforward and easy to
implement, but not very realistic. If there is a lamp post between you and your
conversation partner, you will still be able to hear him. Even if there is a
door between you with only a small gap above and below it, you can still
probably hold a conversation. This is possible because of two acoustic
phenomena: diffraction and reflection.
Diffraction
Because sounds are pressure waves,
they can bend around corners. We can identify candidate corners by looking at
the connected faces. If one face is facing towards the sound source, and the
other is facing towards the microphone, then the corner is a good candidate for
diffraction. Once we have identified diffraction edges, we can find the closest
path around that corner, and check for obstacles on that path. Once we have
found an unobstructed path, we can create a virtual sound source for each path,
and render the sound using the distance delay and directional occlusion
discussed earlier. Here is an example of sound diffracting around a pentagonal
light post:
We must also attenuate the sound
based on the angle between the original path and the bent path, and the length
of the diffraction edge. This attenuation must be different for different
frequency bands; low frequencies lose much less energy than high frequencies
when diffracting around edges. The parameters for this attenuation can be
derived physically, but we picked parameters based on what sounded good to us.
Reflection and Reverberation
The second kind of indirect sound is reflection. If a sound source is in a room with 4 walls, a floor and a ceiling, then there are 6 paths it could take to reach a microphone after one reflection. It can bounce off of any one of the walls, off of the floor, or off of the ceiling. Using two reflections, there are 30 (6*5) ways for it to reach the microphone. It could bounce off of one wall and then off of another, or off of the floor and then off of the ceiling, and so on. After three reflections, there are 150 (6*5*5) ways for it to reach the microphone. As you can see, the number of paths increases exponentially with the number of reflections. For this reason, we usually divide the total set of reflections into two parts: first reflections and reverberation.
This division is based on
psychoacoustic research showing that we use the distinct early(first) reflections
to learn more about the size and shape of the environment, and we use the dense
reverberation to learn more about its texture. Because the number of paths
increases exponentially with the number of reflections, we only calculate the
first reflections in real time, and use pre-calculated late reverberation. This
is justified because while the early reflections can vary greatly depending on
the position of the source and microphone within the space, the reverberation does
not change very much. We can find the first reflections by looking at every
face in the environment that faces both the microphone and the sound source,
and calculating the reflection path using trigonometry. If there are no
obstacles along this path, we can create a virtual source like we do for
diffraction, and attenuate the sound based on the absorption properties of the
surface.
To calculate the late reflections,
we created an impulse response, and convolved it with the sound. Convolution
allows us to combine two sound files, for example, a balloon pop in a
reverberant room and someone talking into a microphone, such that the resulting
sound is of someone talking in a reverberant room. To perform this convolution
normally requires O(n*m) computational operations, where n and m are the number
of samples in each sound file. However, we can use the fast Fourier transform
(FFT) to move these sounds into frequency space in O(nlogn + mlogm) time,
convolve them in O(n+m) time, and then convert them back in O(nlogn + mlogm)
time. In short, we can perform the convolution efficiently using the FFT.
However, the convolution is still intensive enough that we can only perform one
at a time on most CPUs, so it would not work for applications that require multiple
sounds at the same time. However, we can precompute the reverberant tail for
each sound as needed, and then add the tail at the end of the sound when it is
played.
One way to calculate the impulse
response for the reverberant tail would be to extend our reflection calculation
to the nth reflection. However, this requires O(m^n) operations, where m is the
number of reflective surfaces. For example, to calculate all the reflections up
to the 15th in our simple cube room would require us to find almost 40
billion reflections. Even at this point, our impulse response would only be as
accurate as the parameters we used for the absorption coefficients of the
walls, which are not easy to measure. For this reason, we decided to use
impulse responses that we recorded in real environments. We recorded balloon
pops in an anechoic chamber, a standard classroom, and two reverberant
stairwells. The results are shown below, and can be heard here.
We added this reverberant tail to
the end of our sound, and attenuated it by the distance from the source (as
opposed to the square of the distance like the direct sound). The results can
be heard in this video.
Conclusion and Future Work
This project showed that we can create an efficient sound renderer that simulates the most salient acoustic phenomena in real-time without any specialized hardware. In the future we would like to extend this system to work with more complex environments, transmission of sound through walls, and interpolation of reverberation between different environments. We would also like to explore stereo impulse responses for reverberation to give a greater sense of space, the reverberation within the human body itself from loud impulses, and environmental reverberation within hollow objects.