Perceptual audio coding

Perceptual audio coding is a compression technology for audio signals that is based on imperfections of the human ear. Perceptual encoding is a lossy compression technique i.e. the decoded bitstream is not an exact copy of the original digital audio bitstream before compression. The task of the perceptual audio coding technique is to have a decoded bitstream that sounds exactly (or at least as close as possible) as the original audio whilst keeping the compressed file as small as possible.

Perceptual audio coding is based on two related characteristics of the human ear:

  1. the sensitivity of the human ear is not the same for all frequencies;
  2. a loud tone or noise can make a weaker tone inaudible.

The human ear is capable of hearing sounds in the frequency range between around 20 Hz - 20 kHz (this upper limit will become lower if you get older). The human ear is most sensitive for frequencies between 500 Hz and 5 kHz. The sensitivity decreases below and above this range. This means that a tone of 100 Hz must be louder before you can hear it than a tone of 1 kHz. The loudness needed for a tone to become audible is called the Treshold in Quiet. Any tone with a loudness below this treshold will not be audible.

A second characteristic of the human ear is that a tone that is (just) audible can be made inaudible by a louder tone or noise. Or to put it the other way around, a loud tone can make other tones inaudible. The tone lifts the treshold in quiet around its frequency. This effect is called the masking effect. This effect can easily be demonstrated. Get in a car and turn on the radio on such a volume that it is not too loud. Then turn on the engine. Most of the sound produced by the radio (execpt the high tones) will be drowned out by the sound of the engine and will not be audible anymore. You will have to turn up the the volume of the radio. For every (loud) tone in the audio signal it is possible to calculate its effect on the threshold. If the volume of another tone lies below this calculated masking threshold, it will be masked by the louder tone and will remain inaudible.

These two characteristics led to perceptual audio coding. The input data is used to calculate the actual masking treshold for a small period of time. The coder then starts with an analysis of the frequencies contained in the audio signal by either a filter bank or a time-to-frequency transformation. The resulting frequency components are quantized whereby more bits are allocated if the frequency component is well above the masking treshold and none if the component is below the treshold. The coded frequency components are packed into a bitstream. The decoder performs the same steps in reverse. It does not have to calculate the masking treshold.

Nearly all well known audio codecs are based on perceptual audio coding, including MP3, Dolby AC-3 and MPEG AAC.

See also