|
Inside the Digital Den
December 2000
Even more codec capers
Our investigation of compression continues. This time, lossy codecs go under the microscope.
Brian Dipert, Contributing Editor
For this, the third installment of my ongoing digital-audio codec project, I've analyzed three different lossy codecs. The results have been quite interesting, both in terms of performance and quality. As you'll see below, I found audible differences between the codecs.
(NOTE: If you happen to have stumbled upon this page out of the blue, you'll probably want to catch up on the project so far, starting at its home page, "Continuing codec capers").
First, allow me to provide some detail on the three lossy codecs I've analyzed so far:
MP3, which I analyzed at 64-, 96-, 128-, 160-, and 192-kbit/sec bit rates and using both quality- and performance-optimized encoder settings at each bit rate
RealAudio, which I analyzed at 64 and 96 kbits/sec (RealNetworks' encoder doesn't support higher settings)
Windows Media Audio, which I analyzed at 64, 96, 128, and 160 kbits/sec (Microsoft's encoder doesn't support the 192-kbit/sec setting).
Before diving into the results, let's cover the background info. First, I'll discuss the hardware and software used to create the test tones, to do the encoding and decoding, and to analyze the results. I'll also cover the music clips and synthetic test tones that I threw at the codecs, with the latter being quite a bit enhanced over the first attempts in my September article.
The hardware
My September lossless compression benchmarking study used a system based on Intel's i820 chipset, with a Katmai-generation Pentium III processor internally running at 533 MHz and employing a 133-MHz local-bus frequency (see "Test system"). Although at the time I indicated that in the future I'd be moving to a system using a Coppermine-generation PIII-800 and an i840 chipset, I've decided to stay with the original PC, at least for the time being. Both PCs use PC800 Direct Rambus memory and Ultra ATA/66 hard drives—which should remove main memory and mass storage as performance bottlenecks that could slow down encode and decode speeds.
Why'd I stay with the original PC? First off, a 533-MHz CPU represents a more mainstream configuration, particularly when you consider not just the PCs that are selling today but also those already in users' hands. In this era of 1.5-GHz Pentium 4s, it's sometimes easy to forget that not too long ago, 533 MHz represented the state-of-the-art. That level of performance is also more compatible with the types of embedded processors that might appear in non-PC systems planning to do various audio encoding and decoding tasks. And a 533-MHz CPU will magnify the codec-vs-codec encode and decode speed differences, versus running the same algorithms on a much faster CPU.
At least so far, I've been able to both encode from WAV files, and decode back to WAV files, entirely within the PC using software supplied by the vendors. For this reason, I haven't needed to, for example, send a compressed audio format's player output to a sound card's digital connection (something my 533-MHz PC's analog interface-based sound system doesn't even support) for capture by my DAT recorder. High Criteria's Total Recorder is a controversial alternative for capturing player outputs. Some people believe that it reduces the audio volume and makes other sonic alterations.
Sony's ATRAC encoding runs exclusively in hardware external to the PC, so digital communication between the PC and MiniDisc player/recorder I picked up on eBay (see "On track with ATRAC") will be a necessity. Total Recorder isn't an option. However, I may not bother with the 292-kbit/sec ATRAC, but instead look at the 132-kbit/sec ATRAC3, Sony's MP3-targeted lower-bit-rate alternative, which Sony uses in the Music Clip, other solid-state player/recorders, and the latest MiniDisc Long Play units. If I still need to capture digital audio outputs in conjunction with ATRAC or any other codec, Ego-Sys' U2A uses a USB interface and thereby avoids the open-the-PC-and-add-a-board-and-plug-and-pray process I'm dreading. User feedback suggests that the U2A is a far more robust performer than the Opcode Systems SONICPort Optical, which I evaluated in a 1999 article for EDN Magazine (see "The high-end PC looks for a home").
One other note on my PC processor: As a Katmai-generation CPU (specifically the 533B) it's different in several key ways from the latest-and-greatest Coppermine-generation Pentium IIIs (such as the 533EB). Both processors employ a 32-kbyte (16-kbyte code, 16-kbyte data) internal L1 cache, and both use a 133-MHz system bus. My 533B, however, includes a 512-kbyte half-speed external L2 cache, while the 533EB and its Coppermine companions employ a smaller (256-kbyte) but faster (full speed) and more advanced L2 cache. Coppermine-era processors also beef up the system buffering to increase bus utilization—four writeback buffers, six fill buffers, and eight bus queue buffers. My 533B should still run rings around a 533 (Pentium II core) or 533A (Pentium III Katmai core) Celeron CPU, with their 66-MHz system buses and even smaller 128-kbyte L2 caches.
The software
Version 4.5h of Sonic Foundry's Sound Forge audio editing program is able to encode MP3, RealAudio G2, and Windows Media 7 files. It can also decode MP3 files back to WAV, but licensing restrictions preclude it from supporting RealAudio and WMA decoding, forcing me to rely on RealNetworks Real Jukebox, and Microsoft's command-line decoder, respectively. By encoding from the same WAV file to each of the three formats within an otherwise-identical software environment, I hope to be most accurately measuring the speed of the encoding algorithm, with other system overhead canceled out.
However, from both an audio-creation and analysis standpoint, I found Syntrillium's Cool Edit Pro to be the superior product. As a result, I used both software packages. Cool Edit Pro, at roughly the same price as Sound Forge, incorporates a 64-track mixer that, in the following sections of this writeup, you'll see I used extensively. It enables me to create and combine precisely defined audio tones, as well as generate white, pink, and brown noise. I also found its time-based (oscilloscope) and frequency-based (spectrum analyzer) output displays to be more informative and robust compared to those in Sound Forge.
My PC runs Windows 98 Second Edition. Before encoding or decoding, I used the task list (accessed via the CTRL-ALT-DEL keystroke combination) to terminate all running applications aside from Explorer and Systray. Doing so both got available system resources above 95 percent and ensured that random interruptions from other programs wouldn't unfairly handicap one codec over another.
Tunes and tones
Even my "slow" 533-MHz CPU can rapidly encode and decode a 30-second sound clip. Therefore, for the performance analysis portions of my study, I employed the same 19 songs I used in the September article's study of lossless compression alternatives. In addition to measuring performance, though, I also hoped to reveal the presence of various lossy-compression techniques and their artifacts. As a review, the list of things I was looking for included:
Lowpass filtering (removal of all audio information above a certain frequency)
Stereo-to-mono conversion of the original two audio channels (completely or above a certain frequency)
Phase collapse (elimination of phase differences between the two channels, completely or above a certain frequency)
Frequency masking (loud tones will mask lower-volume information in nearby frequencies)
Temporal masking (loud tones will mask lower-volume information that both precede and follow in time)
Echo (insertion of unwanted audio information both prior to and after a sharp transient, such as a percussion instrument sound).
I found that the white and pink noise clips first used in the September study (see table "Test tones") were still useful for identifying low-pass filter frequency thresholds. The two noise patterns each came in two variants, one with both channels at identical volumes, and the other with the right channel attenuated by 20 dB. But the sequential-tone file used in the September article went by the wayside, for reasons already noted in an earlier update.
What did I come up with instead? Well, let's continue going down the test-tone list. (Click here to see a table listing all of my final test clips, or here if you wish to actually download the clips.)
You might already be aware that the human auditory system groups its detection of incoming audio information into a number of critical frequency bands, with most of the bands residing below 5 kHz (see table "Critical frequency bands and test tone frequencies"). Notice that the bands' widths increase as the corresponding center frequencies rise. A structure in the inner ear called the organ of Corti is responsible for translating incoming audio waves into nerve impulses. Its basal membrane width, thickness, stiffness, and hair-cell clustering define the critical band frequency ranges and endpoints.
What better place to start my simultaneous-tone test clip development, then, than with the midpoints of each critical band? All of my critical-band-derived sound clips have one channel 180 degrees out of phase from the other, to give the encoder one more challenge to surmount and to enable me to look for phase collapse in the subsequent decode. As with the pink and white noise clips, I created two versions of each file, one with both channels at equivalent amplitude, and the other with the left channel 20 dB louder than the right.
To generate each file, I first created a number of 32-bit per-sample and per-channel, 44.1-kHz-sampled, single-tone sources, then 32-bit mixed them together in Cool Edit and attenuated the result to the desired maximum amplitude. I then needed to convert them to 16-bit equivalents. After discussions with both Syntrillium Software and audio consultant Arny Krueger, I chose the following sample type conversion settings: dither on, 0.5-bit-dither depth, triangular probability distribution function, and no noise shaping.
Next, to test for frequency masking, I regenerated my critical-band midpoint mix, but this time added an additional 50 tones, half of them at the one-quarter point across each critical band and the other half at the three-quarter point. One- and three-quarter point test tones were 20 dB down from their mid-point neighbors. To look for temporal masking, I first confirmed, by research in Zwicker and Fastl's classic reference manual, Psychoacoustics: Facts and Models, that the pre-tone masking duration extends no further than 50 milliseconds ahead of the masking tone, while the post-tone duration extends no more than 200 milliseconds beyond the masking tone. Therefore, I again created my 30 second mid-point tone combination, but this time preceeded it by 50 milliseconds of the same tonal combination but 20 dB quieter, and followed by 200 milliseconds of the same 20 dB down combination.
Finally, to look for pre- and post-echo noise around sharp audio transients, I turned to three tracks off the European Broadcast Union's (EBU) Sound Quality Assessment Material (SQAM) disc: track 27 (castanets), track 32 (triangle) and track 35 (glockenspeil).
You can download my self-made test clips by clicking here. However, copyright restrictions preclude me from providing you with the EBU's tracks 27 and 32, or their lossy-compressed versions. However, you can download track 35 from here.
The results
So lesseee...19 song clips and 13 test tones. Each runs through MP3 encoding 10 times (five different compressed bit rates at both quality- and performance-optimized settings for each bit rate). Then each runs through WMA encoding four times. And then each runs through RealAudio encoding two times. Then, each resultant MP3 file also runs through the decoder built into Sound Forge. That's...512 total encoder runs, and 320 decoder runs—potentially a whole lotta mouse clicks, and a lot of time spent staring at a computer monitor. Fortunately, Sound Forge includes a batch mode, with the option of creating a log file that captures time-to-encode and time-to-decode.
For Windows Media Audio 7, I used a DOS command line decoder, which Microsoft supplied. I wasn't able to figure out how to capture the screen-displayed time-to-decode into a file, so I hand-logged each displayed value as the batch file ran (my dedication knows no bounds, dear reader). For RealAudio G2 decoding, I used RealJukebox's convert-to-WAV capability, and I used the "created" and "modified" time/date stamps viewable through Windows Explorer to determine decode time.
I've presented the results in a rather massive table, which you can download/view as either a PDF or a Microsoft Excel file. (I'd highly recommend the Excel file, because the left-most column will stay put as you scroll to the right to view the results.)
First let's look at the encode and decode performance testing. Several trends in particular jump out at me. Look at the wide disparity in encode times between MP3's "fastest encode" and "highest quality" settings, at the same bit rate. 64 kbits/sec is the deviation to the general trend, but there's a good reason for this anomaly. The MP3 encoder, when set to 64 kbits/sec, downsamples the original 44.1-kHz material to 22.05-kHz, as well as (you'll later see) severely low-pass filtering the upper portion of the frequency spectrum. These alterations ensure that the encoder has a whole lot less source data to work with, at least partially explaining why the "highest quality" and "fastest encode" results are more similar at this bit rate.
Also, notice that MP3 encoding to a 192-kbit/sec bit rate is actually faster than encoding to 160 kbits/sec. Although more compressed data is being generated by the encoder at the higher bit rate, I suspect that this is an overall plus; the encoder doesn't have to work as hard at the higher bit rate to squeeze the data down while maintaining quality. This result also indicates that, thanks to a fast hard drive and DRAM, the additional system overhead needed to store the larger compressed bitstream wasn't a significant factor in the results. I seemed to be truly measuring the encode speed.
If you go back to my September article, you'll find a table listing the songs used for each music genre, their durations, and their uncompressed file sizes in WAV format. Compare this information against the results table (PDF or Microsoft Excel) and you'll find that similar-duration songs of different genres sometimes had significantly different encoding delays. This result indicates that some types of music were "harder" to encode to a given bit rate and quality than others. This matches similar results I'd seen with my September article, and validates the hunch that caused me to do all this work in the first place.
It just makes sense. Compare a techno track to spoken word, for example, and you'll find that the former usually has a broader meaningful frequency spectrum, increased high frequency content, greater channel-to-channel variation in both amplitude and phase, and more numerous abrupt transients.
Compare Windows Media Audio to MP3, particularly in the context of the quality results that follow, and you'll probably end up as impressed as I was with the WMA encoder. With few exceptions, I found that the WMA encoder's performance approximated that of the MP3 encoder set to "fastest encode," while its quality at least matched (and at lower bit rates exceeded) that of MP3 files created using the "highest quality" setting. RealAudio's encoder speed was approximately the same as that of WMA and MP3's "fastest encode," but here the quality news wasn't so good. I consistently found, on both test tones and music tracks, that RealAudio files both sounded the worst and contained the largest number of lossy-compression artifacts.
And what about the decoders? Well, in all three cases their speed scaled with the bit rate of the file they were decoding, which makes sense (more bits to decode equals a slower decode speed, all other factors being equal). MP3 decoding at 64 kbits/sec is much faster than the other two codecs, but remember that the encoder had previously halved the sample rate, halving the size of the resulting decoded WAV file and giving the decoder a built-in "unfair" advantage. Above 64 kbits/sec, MP3 and WMA decoder speeds were comparable. And poor RealAudio was consistently slower than its peers—roughly two times slower on average.
Be careful when drawing here conclusions, though. Because I used three different decoding software packages (Sound Forge, Microsoft-supplied, and RealJukebox), some of these differences may be the result of factors other than the decoding algorithms themselves.
Hunting for artifacts
My next task was to see how well the test tones did in unveiling the secrets behind the magic the codecs perform to achieve their file-size reductions. I've broken this investigation down into four sidebars, each of which looks for a certain type of compression artifact. Each sidebar provides a complete wrap-up of my findings, along with lots of visual evidence.
There's more to this story
My digital audio analysis work is by no means done at this point. First, if you checked out the "Artifact hunting" sidebars mentioned above, you may have already noticed that I haven't exactly been fair to MP3. Due to deadline pressures, I've only shown and discussed the "fastest encode" artifact results, not the "highest quality" outputs, which theoretically might not exhibit the same types of phenomena. Also, for each codec there's a whole range of higher bit rates whose frequency- and time-based graphs I haven't provided.
Other versions of the Fraunhofer MP3 encoder might give me additional flexibility to enable, disable, and otherwise adjust the operation of various compression options. And although Fraunhofer is the most popular MP3 encoder, it's by no means the only game in town. RealNetworks uses the Xing encoder. QDesign also sells one. And a number of independently developed encoders exist—Blade, Gogo, LAME, and Radium, just to name a few. The choice of MP3 decoder might even affect the results, as David Robinson's recent study suggests.
I'd like to look more closely for evidence of phase collapse, as well as pre-echo in both MP3 and WMA. I'd also like to string a series of test tones together to see if I can replicate the behaviour Arny Krueger found when he evaluated WMA, which I mentioned in my earlier update. Then there's a whole plethora of additional codecs I haven't yet looked at. First on the list is AAC; as I type this, Fraunhofer is creating a command line version of its v3 encoder and decoder for me, and Dolby Labs has already supplied me with its AAC professional encoder/decoder software. Sony's got a pretty dominant position in consumer electronics, so I'd also like to give ATRAC3 a look. And for no other reason than pure intellectual curiosity, at least a few of the other algorithms in my updated table of lossy codecs are tweaking my interest.
But you won't find any of these results here or in future issues of CommVerge. Instead, I'll ask you to reference my upcoming January 4, 2001 cover story in EDN magazine, and its accompanying online addendum. Click here to access that article (but remember that the link won't work until January 4, 2001 or later). As always, your feedback is always welcome.
|
Artifact hunting I: Low-pass filtering and stereo-to-mono conversion
First, let's look at a spectrum analyzer (frequency sweep) plot of the original #2 sound clip (a), along with its 64-kbit/sec MP3 (b), RealAudio (c), and Windows Media Audio (d) counterparts. Note: this and all subsequent MP3 diagrams show the output of the encoder set to its "fastest encode" setting. (Click here for a refresher on the test clips.)
As expected, the original file shows content extending out to 22.05 kHz, with the summation of the left channel's frequency components 20 dB "louder" than the right. Also notice the negative slope of both channels' amplitude-vs-frequency plots. This arises because pink noise proportionally places a greater amount of content in low frequencies than it does in high frequencies. A white noise graph, in contrast, would show a flat amplitude slope versus frequency.
Now compare the original plot (a) to the MP3 graph (b). Two things are almost immediately evident. First, the upper-end of the MP3
frequency range terminates at just above 10 kHz, meaning that the encoder
has low-pass filtered and eliminated all information above this point. Why does the encoder do this? For one thing, it makes the highly dubious assumption that many
of us wouldn't be able to hear such high-frequency content even if it
existed. Secondly, any compression algorithm works best if it can minimize the sample-to-sample variation; such variation for audio is most significant at high frequencies.
Secondly, notice that the original amplitude deviation between the two channels is much less pronounced in MP3, particularly at high frequencies. This trend indicates that the encoder is doing a frequency-dependent stereo-to-mono partial conversion to reduce channel-to-channel differences and consequently simplify its job.
Compared with MP3, RealAudio (c) actually looks pretty good. The frequency response extends quite a
bit higher, past 16 kHz, and the channel-to-channel amplitude difference is
better preserved across the entire frequency range. Finally, take a look at Windows Media Audio (d). Of the three lossy codecs, WMA delivers the widest frequency response, and the left channel looks pretty good. But what about that right channel? Observe all the frequency detail that's been altered and discarded!
Noise files provide useful data on how the compression algorithm works, but they're not very applicable to determining how real-life compressed audio will sound (so don't reject WMA quite yet). Also since, after all, they're noise, they tend to obscure the more subtle alterations and additions that the codecs make.
So next, let's take a look at the spectrum analyzer (frequency-based) displays of test clip #6 in original WAV (a) and lossy-compressed MP3 (b), RealAudio (c), and Windows Media Audio (d) formats. (Click here for a refresher on the test clips.)
First, look at the original file (a). As expected, you'll see 25 distinct tones, with both the left and right channels at uniform amplitudes across frequency and the right channel 20 dB below the left. And absolutely no tone information between the 25 critical band mid-points.
Now for MP3 (b). Yuck! Notice again the low-pass filtering. The last three tones in the original file (10,750 Hz, 13,750 Hz and 18,775 Hz) are missing. Also notice the suppressed amplitude difference between the left and right channels even at low frequencies, and how this difference further diminishes as frequency increases. Finally, and perhaps most obviously, look at all that added noise clustered around each of the original tones. Pragmatically, it looks worse than it is. At -80 dB it's not going to be very audible, particularly outside our auditory system's 2-to-5-kHz "sweet spot." But it sure makes for an unattractive comparison, doesn't it?
Next up, RealAudio (c). As the prior pink-noise results predicted, in this case we have a better-preserved frequency response (even the 13,750-Hz tone survived compression, although 18,775 Hz did not) and better-preserved channel separation. But at what tradeoff? Here the noise floor extends at times above -60 dB, just a few dB below the "real" right channel information.
Finally, look at WMA (d). Ahhhh. Clean stereo separation. Wide frequency response; even the 18,775-Hz tone made it through. A noise floor which, at no greater than -80 dB, is comfortably below even the right channel. Windows Media Audio seems to like critical-band mid-points much more than pink noise.
|
|
Artifact hunting II: Frequency masking
Next, let's take a look at test tone #8 in an effort to uncover frequency masking. Here are the spectrum analyzer (frequency-based) displays of test clip #8 in original WAV (a) and lossy-compressed MP3 (b), RealAudio (c), and Windows Media Audio (d) formats. (Click here for a refresher on the test clips.)
First, the spectrum analyzer display of the original file (a) clearly shows the quarter-band, mid-band, and three-quarter-band tones, with the quarter-band and three-quarter-band info 20 dB down (in both channels) from the mid-point tones.
Now look at MP3 (b), and you'll see little evidence that the algorithm has done any frequency masking. All the quarter and three-quarter tones that survived the low-pass filter were not eliminated by other portions of the encoder algorithm. Note that the 10,125-Hz quarter tone made it through the low-pass filter but the corresponding 10,750-Hz mid-point tone and 11,375-Hz three-quarter tones did not.
RealAudio (c): What a mess. I can't tell from looking at the frequency plot what's a distorted quarter or three-quarter tone and what's unwanted noise. You may have already noticed, but I intentionally set the amplitude of the original file's left-channel quarter and three-quarter tones to be identical to that of the right channel's mid-tones. I suspect this fact didn't simplify the encoder's job any, though it appears that as before, the mid-tones themselves, in both channels, survived the encoding process pretty well. It's just all the extra and altered stuff in between that's causing the problems.
What about WMA (d)? Remember earlier with the pink noise file (see "Artifact hunting I"), where the left channel survived pretty much unscathed, but the right channel came out looking very different from the original? Well, the same thing happened here. The quarter and three-quarter tones of the left (louder) channel remain. But right-channel quarter and three-quarter tones below the 18th of 24 critical bands simply disappear. The phenomenon is most visually obvious with critical band 0 data. Remember that this phenomenon isn't necessarily bad. Frequency masking theory tells us that even if the quarter and three-quarter tone data existed, we might not be able to hear it.
Now let's look at oscilloscope (time-based) displays of test clip #7 in lossy-compressed MP3 (a) and original WAV (b) formats, as well as test clip #8 in lossy-compressed MP3 (c) and original WAV (d) formats. (Click here for a refresher on the test clips.)
I mentioned earlier that the original quarter- and three-quarter-tone data seemed to survive MP3 encoding. But the MP3 compression algorithm wasn't immune to sound-altering behavior with test clips #7 and #8. Look at the additional, slowly decaying amplitude (in both channels) in the first few hundred milliseconds of the MP3-compressed version of clip #7 (a), representing increased volume not present in the original WAV file (b). For even stranger behavior, look at the clip #8 results (c versus d), where the increased amplitude in the left channel matches corresponding decreased amplitude in the right channel. I saw similar MP3 behavior with some of my other test tones, though not to the same magnitude. Neither RealAudio nor WMA exhibited this phenomenon.
|
|
Artifact hunting III: Temporal masking, or the sounds of silence
My attempts to find temporal masking weren't very successful, though they did reveal other strange encoder/decoder behavior.
Let's look at oscilloscope (time-based) displays of the first 200 milliseconds of test clip #9 in original WAV (a), and lossy-compressed MP3 (b), RealAudio (c), and Windows Media Audio (d) formats. (Click here for a refresher on the test clips.)
Take a look at the first 200 milliseconds of the original file (a). If temporal masking had occurred, you would have seen a reduced-amplitude or even completely silent interval prior to the onset of the "normal" audio material, at the 50-millisecond point in the original WAV file. Neither the MP3 (b), RealAudio (c), or Windows Media Audio (d) versions of the test tone exhibit such masking. Also note that all three lossy codecs appear to have preserved at least some of the channel-to-channel phase differences present in the original; one channel is a mirror image of the other.
What you should notice, though, is a couple of odd things. First, see how much signal attenuation the MP3 algorithm does to the original clip, whereas the RealAudio and WMA clips are as "loud" as original? Also, notice the mysterious 55 millisecond initial silent gap that MP3 inserted in the test tone, and a 45 millisecond gap inserted by WMA, gaps not present in the original nor created by RealAudio.
Now let's examine oscilloscope (time-based) displays of the last 2 seconds of test clip #9 in original WAV (a) and lossy-compressed MP3 (b), RealAudio (c), and Windows Media Audio (d) formats. (Click here for a refresher on the test clips.)
Here we see that RealAudio saves its gap addition for the tail end of the sound clip. Compare (c) to (a). RealAudio stuck 1.385 seconds of silence, created at either the encode or decode stage, or both, onto the end of the test tone. MP3's back-end-added gap (b), at 50 milliseconds, was smaller but still present, while WMA (d) stuck an even smaller 30 millisecond gap onto the end of the test tone.
Why didn't I find temporal masking? Who knows. Keep in mind that I'm choosing very specific frequencies, as well as masked and masking tone amplitudes. Changes in any of these source variables could provide the "trigger" to turn temporal masking on, as could compressing to a different bit rate or with undocumented encoder settings turned on. For example, I've noticed that versions of the Fraunhofer MP3 engine in some software packages enable you to select whether you'll let the encoder use channel-combining joint stereo techniques or not, while versions found in other packages don't give you this customization.
|
|
Artifact hunting IV: Echo
Last but not least, let's look for echo artifacts. First off, what causes echo in the first place? One of the first steps taken by nearly all lossy-compression audio algorithms (as well as by those for still images, such as JPEG, and video, such as MPEG) involves converting groups of contiguous samples (called frames) from their time-domain representation to the frequency domain. This process, known by such names as the Fourier Transform, is similar to the one done by my computer to create the spectrum analyzer plots you've viewed in these sidebars.
Once in the frequency domain, the encoder decides which portions of the frame's data are inaudible and therefore appropriate to diminish in importance or even discard. Since we don't live in an ideal world, however, this culling process can inject quantization and other noise into the frame. The corresponding frequency-to-time retransform within the decoder spreads this noise throughout all the samples in the frame.
Ordinarily, the noise isn't a big deal. The "real" audio data covers it up. Similarly, temporal masking can hide noise injected after a sharp audio transient, such as a tap on a cymbal, or a handclap. But prior to a transient, the "real" audio information is best-case quiet, or worst-case inaudible. Pre-echo not only muffles transients, it also injects annoying hiss into the supposedly silent gaps between transients.
Take a look at oscilloscope (time-based) displays of the 15th transient in the EBU's SQAM file #27, in original WAV (a) and lossy-compressed MP3 (b), RealAudio (c), and Windows Media Audio (d) formats. (Click here for a refresher on the test clips.)
I was listening to uncompressed and lossy-compressed versions of test tone #27 when my wife walked into the room and sat down for an audition. Even with her "untrained" ears and my PC's cheap speakers, and without my prompting, she immediately identified the echo noise in the 64-kbit/sec RealAudio file (c)—echo not present in the original WAV (a) nor injected into the MP3 (b) or WMA (d) files. The above figures show a zoomed-in version of the 15th transient in clip #27.
The added noise prior to the onset of the transient in the RealAudio file should be pretty obvious to your eyes. And believe me, it'll be obvious to most folks' ears too. In fairness to RealAudio, I need to point out that different codecs use different frame sizes for their time-to-frequency transforms, so different types of transients might cause them problems, too. More advanced codecs minimize pre-echo by supporting multiple frame sizes, with less-efficient smaller frames used when the encoder detects a transient, and longer frames used for more conventional material.
|
Author information
Contributing Editor Brian Dipert (bdipert@pacbell.net) is a technical editor for EDN (www.ednmag.com). The audio system in his first car was worth more than the car itself. But the car was a '75 Vega, so that's not saying much.
|