|
Inside the Digital Den
September 2000
Codec capers
The inaudible mystery of the missing bits
Brian Dipert, Contributing Editor
One of the mysteries of digital audio is the myriad compression/decompression (codec) algorithms that pack the digitized bits of sound into reasonably-sized data files. MP3 is far and away the best-known codec, but there are dozens of others, each surrounded by varying levels of hype. I'm here to replace the hype with real data through this codec evaluation project, introduced here and to be continued via an interactive Web presentation here.
Ultimately, the ability of a codec to deliver pristine sound or to minimize file size should be more important than the hype surrounding the technologies. And while few need to understand the intricacies of the algorithms, knowing their capabilities can not only enhance your own personal digital-audio experience, but also lead convergence teams to optimal product specs.
In this introduction, we'll wade quickly into the subject at hand. Should you want some background, both CommVerge and sister publication EDN have covered the topic extensively (see sidebar, "Codec crib notes"). In addition, the sidebar "Compression concepts" explains the fundamentals of data compression.
While I've written quite a bit about digital audio in the last year, my research has been based mostly on second-hand anecdotes, paper specifications, and marketing collateral. An engineer to the core, I always prefer to come to my own conclusions. So I jumped at the chance to do this digital-audio codec project. Only partway through, I've already uncovered some surprising conclusions.
A codec that reduces the size of digitized music is a necessity for storage of such music on low-cost music players and PCs. An uncompressed, dual-channel, 16-bit, 44.1-kHz-sampled audio stream gobbles up nearly 11 Mbytes of storage space per minute of music.
Codecs fall into two basic types—lossy and lossless. Lossless codecs rely strictly on data-compression techniques to reduce file size, ensuring that the decode process will yield an exact copy of the original file. Lossy algorithms discard parts of the audio information, presumably in a way the human ear can't detect, to reduce file size more dramatically.
When the only thing the world seems to want to talk about is MP3 and other lossy codecs, why should you care about lossless compression? Well, all high-end digital-audio products, including the emerging DVD-Audio and Super Audio CD formats, rely on lossless techniques. In addition, audiophiles might want to archive recordings of live performances or digitize dusty LPs. Lossless techniques provide the highest-possible quality and can cut file size by a third or half—nothing to sneeze at.
With the burgeoning possibilities of the convergence era, other lossless characteristics deserve mention. Remember that once you've lossy-converted a sound clip, there's no way to recover the discarded information. And just as with repeated dubbing of an analog audio cassette, each time you modify and re-save an MP3 or other lossy-compression audio file, or transcode it from one lossy format to another, the audio quality further degrades.
|
I'm here to replace the hype with real data through this codec evaluation project. |
|
With the benefits of lossless techniques, what then is the role of lossy compression? Consider the ability to compress audio files not to half their size, but to a guaranteed 1/12th (MP3) or even 1/24th (Windows Media Audio), with little or no degradation in perceived playback quality.
You'll need to lossy-compress audio in order to listen to it on a portable player, stream it to Internet listeners, or digitally broadcast it over the air. And if your audio device can directly read and decode MP3 or other lossy-codec format files burned on a CD-R, you've now got the ability to store up to a few dozen hours of music on one disc.
Goals
Putting generalities behind us, what are some of my specific objectives? For the lossless compression algorithms (see table, "Lossless algorithms"), I wanted to measure and compare several key parameters. First, I wanted to determine how long each algorithm took to both encode (compress) WAV files, and to decode (decompress) back to a WAV. I also wanted to contrast the compressed file sizes to the originals to determine each algorithm's compression efficiency.
|
All of the tables referenced in this article are in PDF format. If you don't already have it, click here to download the free Adobe Acrobat Reader. |
|
I had hoped to differentiate the algorithms based on the percentage of CPU resources they gobbled up. But I found that they all tended to use whatever spare power was lying around. The rudimentary tools at my disposal also weren't robust enough to accurately measure the amount of system memory each algorithm consumed.
My goals are quite different for the lossy algorithms (see table, "Lossy algorithms"). Compressed file sizes aren't meaningful as a comparison point for lossy algorithms. A file compressed to a specific bit rate will be equal in size regardless of the codec used. Of course, some companies, such as Microsoft, claim that their codecs afford better audio quality at a given bit rate than do competitors at twice the bit rate. Thus my goals for lossy codecs will be testing for encoder characteristics that can affect sound quality, and determining the time it takes a codec to encode to a specific bit rate.
Needless to say, audio quality is extremely subjective. Recently a neighbor of mine swore that he was unable to hear any difference between an original audio CD, a 128-kbit/sec MP3 stream, a 64-kbit/sec WMA (Windows Media Audio) stream, and a 16-kbit/sec RealAudio stream, all heard through his PC's cheap speakers. When I brought him over to my house and played the same files on my PC's higher-end speakers, he immediately understood the need for those files that "took so long to download." I'll generally resist the urge, therefore, to make quality comments on various codecs, though if one sounds particularly good or bad to my less-than-golden ears, I'll note it. And my tests will certainly allow you to draw some conclusions in terms of quality.
All my results are highly dependent on the characteristics of the source material. Critical factors include the total frequency range represented, the amount of sample-to-sample frequency variation, whether the material is monophonic or stereo, and (if stereo) how much the channels differ in frequency, amplitude, and phase. To balance the tests, I picked quite a variety of genres from my diverse music collection (OK, I admit it's light on country) to throw at the codecs (see table, "Music samples"). I specifically chose not only modern, digitally captured music but also older analog- and live-recorded tracks.
To convert most of the audio tracks to WAV files, I used an enhanced version of Windows 98's CDFX.VXD driver. I also chose a Beastie Boys hybrid CD that contains both audio tracks and PC software. That disc was unusable with CDFX.VXD, so I extracted the desired song using the WAV conversion utility built into MusicMatch Jukebox.
Some codec vendors have been less than fair in their own benchmarking. For example, lossy codec vendors have been known to feed their own codec low-fidelity mono sound clips while giving a competitive codec a full-fidelity stereo clip—a much more challenging task. Note that all the music tracks in my test suite are stereo.
To fully understand the codecs' capabilities and gain some insight into how they work, it's important to use both real music clips and synthetic test tones. I generated a series of test files using Syntrillium Software's Cool Edit Pro. The files include white noise, pink noise, and a 30-second tone pattern.
White noise is a random audio pattern spanning the entire frequency spectrum, with all frequencies represented in equal proportions. Pink noise's sample proportions, in contrast, follow a 1/frequency pattern. In other words, pink noise contains far more low-frequency components than high-frequency ones.
For both the pink- and white-noise files, I generated stereo tones with equal-intensity channels and with different-intensity channels. Equal-intensity noise channels, converted to a frequency-domain display via a spectrum analyzer, enable me to identify any low-, band- or high-pass filtering done by a lossy codec. Differing-intensity channels, among other things, identify whether or not the encoder is simplifying its task by converting stereo to mono within certain frequency ranges.
My 30-second pattern of frequencies has the two channels out of phase. The aim here is to detect the use of joint or intensity stereo compression techniques. In other words, the codec could again minimize its workload by ignoring phase differences between channels within certain frequency ranges, while accurately preserving differences in amplitude.
Finally, to look for a specific lossy-compression artifact called echo, I used the sharp-transition sound file created by striking a triangle, provided by digital audio guru Arnold Krueger (see www.pcabx.com). For more details on all these test files, see table, "Test tones."
Although I created the noise and test tones in Cool Edit Pro, I also used Sonic Foundry's Sound Forge to run some of the lossy decode tests, because it supports more codec formats. For the initial spectrum-analyzer images of MP3 test results (Figure 1a shows an example), I returned to Cool Edit Pro. Cool Edit Pro enabled me to more tightly zoom into the waveform and also included a more robust and informative spectrum analyzer. To see the full set of spectrum-analyzer images, see sidebar, "Visual evidence."
Unfortunately, no single package can decode all lossy formats. In some cases I will have to use a decoder provided by the codec vendor and output the uncompressed bits to a WAV file, which I can in turn input to Cool Edit Pro or Sound Forge. I'm working to ensure that I accurately capture the decoded bits into lossless WAV files.
When you peruse the results, keep in mind that my PC has a fast and full-featured microprocessor. If a given format's encoders and decoders take advantage of the integer and floating-point SIMD (single instruction, multiple data) instructions of my Pentium III, my PC might run them much faster than their non-enhanced counterparts. The differences could be less glaring on a more conventional CPU (for a detailed look at my PC configuration, see table, "Test system").
Lossless results
Even though I've only just begun, I've already gotten some interesting results. I'll highlight some findings here; for the full gamut, download the table, "Lossless results," which is available as a PDF file and as a Microsoft Excel spreadsheet.
To kick off the lossless tests, I decided to run each of the WAV files through the ubiquitous WinZip program. I ran tests at the program's "fastest speed" setting and its "highest compression" configuration. I'd heard that the compression algorithms employed by generic file-compression utilities such as PKZip, WinRAR, and WinZIP were less than optimal for low-entropy multimedia files, and my results certainly bear this out.
Frankly, I was surprised by just how poor WinZip's compression ratios turned out, as compared to the other programs. Generally, WinZip yielded files 80 to 90 percent as big as the originals, while the audio-specific codecs came in at 40 to 60 percent. I was also surprised that WinZip's "highest compression" option, though it took noticeably longer than the "fastest speed" setting to compress large files, didn't achieve proportionally better results.
WinRAR, a peer of WinZip, is unique among generic file-compression programs in offering a multimedia-optimized compression option. This multimedia mode delivered significantly better results than WinZip. However, on average, WinRAR took roughly twice as long to compress a WAV file than did WinZip at even its slowest setting.
As one might expect, the audio-optimized algorithms performed better. But surprisingly, MUSICompress, Shorten, and WavPack all turned in similar results. This seems odd because, according to their documentation, the algorithms vary widely. MUSICompress, for example, is touted as being a lean-and-mean, integer-only routine employing a relatively simple predictive algorithm. Shorten, by contrast, is a more elaborate and configurable floating-point scheme. In fairness, I must admit that I didn't experiment beyond Shorten's default settings.
|
Compression varied wildly with different types of music. Classical compressed easily; rap and techno proved difficult. |
|
Not surprisingly, the audio-targeted compression routines struggled the most with the white noise and pink noise sound files, achieving limited success comparable to that of the random-tuned WinZip. In fact, for the white-noise, equivalent-channel file, WavPack's "fast option" actually turned out a file larger than the original WAV clip. The fact that a compressed version of a highly random data pattern is larger than the pattern itself doesn't surprise me much; this phenomenon results from the control bit and other overhead of the compression scheme. But most compression algorithms, in such a case, will automatically use the original rather than the "compressed" (actually expanded) file.
The compression ratios varied widely with different types of music. Classical music compressed down to less than half the original file size. Modern music, such as rap and techno, proved nearly as random and difficult to compress as pink and white noise (perhaps our parents were right, after all, when they claimed that rock-and-roll was nothing but noise).
I also found that MUSICompress couldn't handle the Beastie Boys' WAV file that I generated using MusicMatch Jukebox. I've heard that some packages produce less-than-100-percent-standard WAV files, which other programs sometimes find difficult to read and play. I believe this scenario occurred here.
Lossy limbo
I've just started running tests on the lossy codecs. The results will be published on this project's home page, with highlights printed in our December issue. In preliminary experiments, however, I've already made some suprising findings based on my experience with the Fraunhofer-developed MP3 codec built into Sound Forge.
For starters, I strongly suspected the Fraunhofer and other encoders would reduce the bit rate by chopping off the upper end of the audio frequency spectrum—which humans can't really hear anyway. And, as Figure 1 shows (see sidebar, "Visual evidence"), the Fraunhofer codec certainly employs this drastic yet effective measure. Both 64-kbit/sec MP3 files severely attenuated the signal beyond 10 kHz.
Next, I took a look at the MP3-converted versions of the pink-noise test tone with unequal left and right channel intensities. I was looking for stereo-to-mono conversion done by the encoder at low or high frequencies. A lossy codec might perform such a conversion to reduce the randomness of the data and simplify the encoding task. A comparison of original and MP3 spectrum-analyzer plots would reveal use of such a shortcut, but I found no evidence that the Fraunhofer encoder used this technique (see Figure 2 and explanation in the sidebar, "Visual evidence").
I also suspect that many lossy codecs might ignore phase differences between channels. But I didn't find any evidence of high-frequency, channel-to-channel phase-difference collapse (see Figure 3 and explanation in the sidebar, "Visual evidence"). I suspect that I simply gave the encoder too easy of a task, with the two channels identical except for their phase and with long sequences of single-frequency samples.
Phase alteration, unfortunately, is very difficult to observe with the more complex multi-frequency patterns that more closely mimic real music. Similarly, construction of a test-tone sequence that will reveal the presence of encoder-generated frequency and temporal masking is also challenging, though I'll strive to achieve this goal.
With the spectrum analyzer, I did find visual evidence of echo, a common side effect of the time-to-frequency transform at the heart of all lossy algorithms. For example, a low-amplitude noise pattern preceded the actual start of the sound generated by the triangle crash (see Figure 4 and explanation in the sidebar, "Visual evidence"). The MP3-compressed version of the triangle sounds "duller" to my ears than the original. And at moderate playback volumes I can even hear the echo-created colored noise burst.
Future plans
Since my lossless compression results didn't uncover dramatic differences, I don't plan to evaluate any of the other lossless codecs. My next step is to run a series of lossy encoder/decoder combinations through the test suite. I'll be using my real sound clips (see table, "Music samples") to determine compression speed and where possible, best-case decompression speed. I'll also use the test tones and the waveform and spectrum analysis displays of Cool Edit Pro to identify the presence of various lossy compression techniques and uncover audible compression artifacts.
As for MP3, I'd like to test drive multiple encoders to see if their results differ. Other high-priority codecs include AAC, Microsoft's Windows Media Audio, RealNetworks' RealAudio, TwinVQ, and Sony's ATRAC.
I'd appreciate your feedback and suggestions as I proceed. Keep an eye out for updates in our print edition and on this project's home page, where I'll post my test tones, tables and graphics, interim test results, and some surprises.
|
Codec crib notes
For a thorough education on the subject of digital audio, peruse the following recent articles.
"Listen up," CommVerge, January 2000, pg 46.
"Now hear this," EDN, February 3, 2000, pg 50.
"Digital audio breaks the sound barrier," EDN, July 20, 2000, pg 71.
"You say you want a revolution?," CommVerge, June 2000, pg 38.
"Bit players?," CommVerge, July 2000, pg 66.
"Hot & streamin'," CommVerge, April 2000, pg 28.
|
|
Visual evidence
For your convenience, here's an index to all of the spectrum-analyzer images from my tests so far. The images are in JPEG format.
Figure 1
64-kbit/sec "fast encode" (a), 64-kbit/sec "highest quality" (b), 128-kbit/sec "fast encode" (c) and 128-kbit/sec "highest quality" (d) MP3 clips all exhibit severe high-frequency attenuation, compared with the original pink noise pattern (e).
Figure 2
With 64-kbit/sec "fast encode" (a), 64-kbit/sec "highest quality" (b), 128-kbit/sec "fast encode" (c), and 128-kbit "highest quality" (d) MP3 clips, I found no evidence of stereo-to-mono channel combination, compared with the original pink noise pattern (e).
Figure 3
Joint or intensity stereo coding eliminates phase differences between the stereo channels (a). If I'd seen evidence of this technique in use, the result might have looked like this (b).
Figure 4
Take a look at the before-the-line echo noise created by the 64-kbit/sec "fast encode" MP3 conversion process (a), not found in the original signal (b). Now zoom in on the leading edge of the "triangle" transition, and you'll also find waveform distortion and upfront inserted delay in the MP3 file (c), compared with the original (d).
|
|
Compression concepts
The term lossless compression signifies that although the information may look completely different at the output of the encoding stage, the corresponding decompression algorithm is guaranteed to return the data to its exact original state. In other words, if you compress a file, then decompress it, you'll have a file identical to the one with which you started.
Several different lossless algorithms exist, and choosing one depends not only on the available processing power and memory, but also on the characteristics of the data to be compressed.
Statistical-coding algorithms (also called entropy-coding algorithms), like the well-known Huffman code, take advantage of probability statistics to assign short bit strings to commonly-used data patterns, and longer bit strings to less-common alternatives. A statistical coding routine will analyze the entire input file, assign codes and embed a translation "key" table, known as a canonical tree, within the compressed file. The decoding algorithm will use that key to reconstruct the original file. Take a look at the following example:
String to be compressed: RUN SPOT RUN.
Original bitstream length: 104 bits (13 characters at 8 bits/character)
Canonical tree:
| Character |
Value |
| R |
000 |
| U |
001 |
| N |
010 |
| [space] |
011 |
| S |
100 |
| P |
101 |
| O |
110 |
| T |
1110 |
| . |
1111 |
Resultant bitstream length: 41 bits
If the system already knows the data and its probability characteristics, the encoder and decoder have much easier jobs, and the canonical tree doesn't need to be transmitted along with the compressed data. For example, as Wheel of Fortune fans and contestants already know, letters such as A, E, and S are more common in English than letters like Q, X, and Z. Fax machines employ a similar approach. The most common data sequence they see is an entire line of white pixels, which they encode with the shortest-possible bit sequence prior to transmission.
Run-length-encoded (RLE) algorithms replace a long string of same-value data with a shorter combination of the data value plus the number of sequential locations that contain the value. For example, if you have a bitmap image containing a 250-pixel horizontal blue line, an RLE-compression algorithm might replace the 250 identical data values with one value for blue followed by the number 250.
Another well-known lossless technique, known as substitutional compression, replaces repeated iterations of bit or byte strings with pointers to the first time the string appears in the source file. Variations of the LZ (Lempel-Ziv) substitutional algorithm are most common. In the following example i=index and l=length:
String to be compressed: See Spot Run. Run Spot Run.
Compressed string: See Spot Run.[i=9,l=4][i=4,l=10]
In this example, "i=9" tells the decoder to go to location 9 (the 9th character in the string--the space between "Spot" and "Run"). Then "l=4" instructs the decoder to write four characters (starting with the character in location 9). So, the decoder writes " Run"). Next, "[i=4, l=10]" tells the decoder to write the 10 characters starting with location 4. So it spits out " Spot Run."
Another example:
String to be compressed: Blah blah blah blah blah!
Compressed string: Blah b[i=2,l=3][i=5,l=5][i=5,l=10]!
Sharp-eyed readers will notice that in these conceptual examples, the supposedly compressed strings are actually longer than the originals. But remember that the codec would actually be working with 8-bit representations of each character. What's more, in reality, the control strings can be represented much more efficiently.
By varying the size of the sliding memory window used to search backwards in the file for string matches, substitutional algorithms can trade off compression efficiency for required memory and processing power.
Next, let's turn our attention to lossless compression techniques that are particularly appropriate for files with little sample-to-sample variation, such as audio information. Instead of storing each data sample's full code, you could instead store the difference between it and either the preceding sample or a common reference sample. Not surprisingly, these types of lossless algorithms are called differencing routines.
Taking differencing to the next step, the lossless algorithm might choose to estimate the next sample value based on a sequence of past values, then store only the variation between this forecasted pattern and the actual data pattern. These predictive, or delta, coders differ from each other mainly in the predictive algorithm they use and the number of past samples the algorithm incorporates in its calculation. Some algorithms take prediction to multiple derivatives, predicting not only the sample but also its residue, its residue of residue, and so forth.
Living with loss
Now for lossy compression. Figure A shows simplified block diagrams that represent the main functions found in lossy encoders and coders. The encoders convert groups of consecutive audio samples from the time domain to the frequency domain, where much of the lossy compression "magic" takes place. Keep in mind that the 2- to 5-kHz range represents the portion of the audio frequency spectrum in which human hearing is most sensitive. Lossy compression focuses most of its attention, therefore, on eliminating superfluous data that falls outside of this sweet spot, particularly high-frequency information that has the greatest potential for sample-to-sample variation.
One brute-force technique involves digitally lowpass-filtering the audio, eliminating all content above, say, 10 or 16 kHz. Most music doesn't have much energy at these high frequencies anyway, and even if it does most folks wouldn't be able to hear it. For similar reasons, some algorithms stereo-to-mono convert low and/or high ends of the audio spectrum to reduce the overall number of channel-vs-channel differences.
Joint (also called intensity) coding takes advantage of the fact that, above the sweet spot, the human auditory system increasingly relies on signal strength, not phase, to determine the location of sound sources. You may have experienced channel-to-channel phase differences when listening to an old analog tape that's been stretched and warped. In both studio and live environments, phase differences also result from variations in distance between left- and right-channel speakers and the recorder, and between each speaker and the recorder's left and right microphones. Collapsing phase, while retaining unique channel intensity, will certainly create a different sound, and it may reduce the spaciousness that gives live recordings their realism. To hear it for yourself, listen to these two WAV files, inphase.wav (431 kbytes) and oophase.wav (431 kbytes). Although both contain 1-kHz tones, the two channels in oophase.wav are 180 degrees out of phase with each other, while both channels in inphase.wav are identical.
Lossy schemes also commonly employ two techniques that center on the concept of masking. A tone of given frequency and given amplitude will tend to obscure other tones that occur at the same time, but at lower volume and in nearby frequencies. This phenomenon is called frequency masking, shown in Figure B. Similarly, a tone of given frequency and given amplitude obstructs your ability to hear similar-frequency tones that occur both before and, more significantly, after it. This is temporal masking, shown in Figure C. Advanced lossy codecs employ sophisticated techniques that alter the sound data to take advantage of these quirks of human hearing.
Finally, nearly all lossy-compression schemes also employ the lossless techniques described above to further squish those bit streams.
|
Author information
Contributing Editor Brian Dipert is a technical editor for EDN (www.ednmag.com). As a child, he drove his parents to despair by disassembling his toys—as well as the grown-ups' toys—to figure out how they worked. You can reach Brian at bdipert@pacbell.net or his personal Web page.
|