|
Inside the Digital Den
December 2000
Continuing codec capers (continued)
Reader feedback and lossy-codec-analysis plans
Brian Dipert, Contributing Editor
The first installment in this project, September's "Codec capers," benchmarked various lossless audio-compression algorithms and provided a sneak peek at lossy compression. In that article, I promised that my subsequent December writeup (this one) would include a corresponding analysis for lossy codecs. I didn't quite accomplish that goal, although as I write, I'm in the midst of the analysis work. I'll eventually post the results and my conclusions at the Web home of this ongoing project, "Continuing codec capers."
In the meantime, my first article elicited some interesting feedback and sparked some rewarding explorations. So I've written this update, which also details my plans for the lossy-codec testing.
First, I'll share some feedback I received from my September article. You might also be interested to hear (pun intended) about my experience at September's Audio Engineering Society conference, where I auditioned Verance's perceptual-watermarking technology, which is scheduled to appear both on DVD Audio discs and in SDMI-compliant digital audio files (see the sidebar "Don't consume(r) the water(mark)").
Reader responses
David Bryant, developer of the WavPack algorithm, cleared up some of my vendor-created confusion (and tempted me to do some more analysis). "The compressors you tested actually are quite similar (despite what they may claim)," Bryant wrote. "However, two of the compressors that you didn't test (LPAC and RKAU) are different and produce much better compression, albeit at much slower speeds. However, 'much better' compression in this area is maybe only 5 percent or maybe a little more; hardly earthshaking."
Although I'd tested the latest version of WavPack available at the time, Bryant has been hard at work improving his product. Several other algorithm developers also follow an aggressive frequent-update schedule. In fact, this rapid-revision phenomenon was a big factor in my choice not to include the increasingly popular Monkey's Audio algorithm (for example) in my study. When software goes through multiple uprevs in a short time, analysis results are obsolete long before they could appear in print.
"If you had tested the latest version of WavPack (3.6b), you would have had more variety in the results," Bryant continued. "The new WavPack is a 32-bit program, so it executes much faster than the version you tested, and I have put in a new 'high' compression option (-h) that gives compression ratios much closer to the best programs—while still being reasonably fast. I do this by including more samples in the predictor, but I still retain an all-integer approach."
If you would like to test new versions of WavPack, any of the other algorithms I evaluated, or codecs that I didn't have time to benchmark, you can find test files from the "Continuing codec capers" home page. I've also listed the songs I compressed. As always, I welcome your feedback.
In my brief look at lossy compression in my September article, I wrote that I might have given the encoder a too-easy task with my first stab at a test-tone sequence. Reader feedback confirmed my suspicion, and in hindsight I'm a little embarrassed that I didn't realize at the time what was going on (can I blame deadline pressure?). Most, if not all, lossy encoders transform groups (technically, frames) of successive-time samples into their combined frequency-spectrum representation. The encoder then works its filter-and-discard magic on the frequency-domain data. The corresponding decoder then performs a frequency-to-time retransform as part of audio playback.
Admittedly my original increasing-frequency test tone was a bit challenging to the encoders; the two audio channels were 180 degrees out of phase from each other. But I strung together a series of 2.5-second individual-frequency tones, instead of mixing multiple frequencies together at each 44.1-kHz-sampled data point. This decision meant that for nearly all of the frames of time-to-frequency-transformed information—each several hundreds or several thousands of samples long—the encoder was able to devote all of the available compressed audio bits to just that frequency, dramatically simplifying its job as compared to real-life, multifrequency music. The pink and white noise test files proved more challenging to the encoders, but, because they sounded like broad-spectrum noise even before lossy encoding, subtle compression effects were easily obscured.
Arny Krueger, who maintains of the PC ABX audio testing Web site (www.pcabx.com), wrote to recommend a better approach. "I find it is helpful to use complex multitones for testing, because they tend to load up the codec with multiple concurrent frequencies, just like real-world program material," he said. "Pink or white noise is generally no stress at all for most good coders, because a fully loaded coder will reduce the signal-to-noise ratio in the frequency bands, producing noise with about the same power-spectral density as the original, but not actually the original noise."
Ken Gundry, principal staff engineer at Dolby Labs, echoed and expanded on Krueger's comments. "Perceptual coders work by dividing the spectrum up into small bands, and then either discarding or sending with lesser precision those bands that contain little, or whose content is masked by sounds in other bands," Gundry wrote. "It is clearly easiest to save bits when many bands can be discarded. Conversely, when the spectrum of the signal contains significant power in all or most bands, none can be discarded, and the system may run into bit starvation. Thus impairments tend to show up on constant or slowly changing signals with what someone here [at Dolby Labs] refers to as a picket-fence spectrum—with (audibly) significant spectral lines over a wide range of frequencies, leaving no gaps to be discarded. That is why you will come upon references in the published comparisons of perceptual codecs to a pitch-pipe, which happens to have such a spectrum. From a colleague who heard this over digital radio (MPEG-2), bagpipes also can lead to serious audible impairment. There are those who would say this was as good a reason as any for not transmitting bagpipes!"
My plans
At minimum, I'll be testing the following algorithms:
MP3 (Fraunhofer encoder in MusicMatch Jukebox, and Fraunhofer decoder in WinAmp)
AAC (encoder and decoder software from Fraunhofer)
Windows Media Audio (WMA) (encoder built into Sonic Foundry's Sound Forge, and decoder in the Windows Media Player) (See the sidebar, "Music mysteries" for more on Windows Media Audio.)
RealAudio (encoder built into Sound Forge, and decoder in RealNetwork's RealJukebox)
ATRAC (version 4.5, using Kenwood's MD-203 MiniDisc deck; see the sidebar, "On track with ATRAC")
ePAC (VedaLabs AudioVeda software).
When possible, I'll examine the output of both "high quality" and "high performance" settings, and I'll look at a range of bit rates; 64, 96, 128, 160 and 192 kbits/sec. I will not be evaluating the optional variable-bit-rate setting some codecs offer. Also, note that the ATRAC bit rate is fixed at just under 300 kbits/sec, giving this codec a built-in advantage. Keep this fact in mind as you compare it to the others.
As time allows, I'll add other MP3 encoders to the list, so keep checking "Continuing codec capers" for updates. Specifically, I'd like to look at the Xing algorithm used by RealNetworks and the open-source Blade and LAME encoders. Secondhand evidence suggests that Xing trades lower quality for higher encoding speed, while the open-source encoders are highly customizable, meaning they'll be quite a challenge to fine-tune and that they'll be open to plenty of alternative opinions based on different settings. I'd also like to evaluate the Ogg Vorbis codec, QDesign's QDX, and TwinVQ. Further down on the priority list, but still possible (either by myself or by you, dear readers), are lossy codecs such as AC-3 (also known as Dolby Digital), apt-X, ATELP, DTS, Indeo, and TAC.
For each algorithm, bit rate and quality-vs-speed setting combination, what will I be testing? Quantitatively, I'll datalog the encode, decode, and simultaneous encode-while-decode speed on my reference PC (a Pentium III-800). I'll also attempt to determine the amount of memory the algorithm consumes, as well as the percentage of total available CPU resources used. I'll also pull up the decoded output in my audio software's waveform and spectrum analyzer displays and look for common lossy compression artifacts. These include elimination of channel-vs-channel phase differences (a technique also known as joint stereo), full stereo-to-mono conversions at certain frequency ranges, pre- and post-echo noise around abrupt audio transitions, filtering, frequency and temporal masking, and others.
I'll generally resist the urge to get caught up in qualitive "this one sounds better than that one" comparisons. However, when I find that one codec sounds particularly better or worse at a given bit rate, I won't be shy about mentioning it (especially given my newfound confidence in my ears, which are apparently more gold and less tin than I previously believed—see the sidebar "Don't consume(r) the water(mark)"). I'll be listening to the compressed sound clips using home theater, PC, automobile, and portable equipment, and through both speakers and appropriate-quality headphones.
And what will I be listening to? Well, first the multiple-genre songs listed in the table, "Music samples." Pink and white noise clips remain useful to identify low-, high-, and band-pass filtering and stereo-to-mono conversion. The castanet and other percussion clips from the European Broadcasting Union (EBU) Sound Quality Assessment Material (SQAM) disc should pretty quickly uncover transient-induced echo artifacts. Another synthesized audio clip intended to identify joint stereo effects, which I'll create, will combine tones at the midpoints of each of the human auditory system's critical frequency bands (see table), with the two channels out of phase and at both identical- and different-amplitude settings. To test for frequency and temporal masking effects, I'll also create versions of these clips that contain lower-amplitude and both time-delayed and time-advanced tones at other percentage points within each critical frequency band.
|
Music mysteries
Microsoft steadfastly refuses to reveal the implementation details behind its Windows Media Audio compression algorithm. Aficionados generally agree that this codec has good performance coupled, uncharacteristically, with fast encoding speeds.
Several independent listening tests—as well as my own so-far-unscientific experiences listening to WMA-encoded music and test tones—suggest that at bit rates one-third to one-half lower than the commonly employed 128- to 192-kbit/sec MP3 streams, its quality is at least as good as the industry-standard AAC algorithm, perhaps better. In the absence of official word from Redmond, audio enthusiasts such as myself are doing our best to figure out how WMA works its magic.
In a 1999 EDN article ("Digital audio breaks the sound barrier,"), I suggested that Microsoft might be employing a variant of the vector-quantization technique used by a lossy codec called TwinVQ. The TwinVQ encoder and decoder both contain a prefabricated set of data coefficients called a codebook, representing what the algorithm's developers believe to be the most common sets of per-frame frequency combinations found in audio. The TwinVQ encoder, after performing time-to-frequency transformation and compressing each frame of audio data, finds the closest match in its codebook and instead of sending the actual coefficient data, instead sends the codebook index. The decoder uses this index to pull the matching data approximation from its matching codebook, which it then outputs.
Since each codebook index is much smaller than the actual data, vector quantization can produce impressive file-size reductions. But there's a tradeoff. If the codebook is poorly constructed, or if the encoder doesn't do a good job of matching the actual data to a codebook entry, the compressed audio can sound terrible. Therefore, several variations on the basic vector-quantization technique are possible. Instead of using a preassembled codebook, the encoder might create it on the fly based on the specific characteristics of the audio it's compressing.
The advantage of this on-the-fly approach is that you're more likely to get good matches between data samples and codebook entries. However, now you have to transmit the codebook along with the compressed audio, because the decoder won't already have it. Therefore, the larger (and theoretically better quality) the codebook, the larger the compressed bit stream and the less efficient the compression results. Also, on-the-fly codebook creation is computing-intensive, leading to slow encoding speeds. As an interim step, therefore, the vector-quantization algorithm may rely mostly on a pre-fabricated codebook, supplementing it with a smaller codebook appendix created on the fly.
In response to that EDN article, Sean Alexander, Product Manager for Microsoft's digital media division, responded that "we don't do vector quantization." However, some reported characteristics of Windows Media Audio lead me to suspect that although Microsoft might not be employing a strictly defined vector-quantization approach, their algorithms might be analogous to, or derived from, techniques similar to TwinVQ. I've heard from several sources, for example, that short snippets of WMA-encoded audio sound better at a given bit rate than entire encoded songs. This feedback wouldn't make sense if, like MP3 and many other perceptual coders, WMA simply transformed and compressed several-millisecond long multisample frames in a one-at-a-time fashion.
Quality degradation with increasing audio duration could, however, occur if the encoder was creating a codebook on the fly. Longer audio sequences tend to contain more randomness than shorter clips—randomness that's less accurately approximated by a fixed-size codebook. A back-and-forth discussion on Internet newsgroup rec.audio.pro between Arny Krueger, maintainer of the PC ABX audio testing Web site, and Amir Majidimehr, Microsoft's general manager for digital media, also hints at this phenomenon (to find the thread, search on topic "Sound & Vision's 'Download Showdown' MP3 versus AAC & Windows Media"). Amir was unable to replicate Krueger's results, in which WMA-encoded versions of test clips exhibited pre-echo, quantization noise, missing information, and other artifacts. What Krueger and Majidimehr ended up figuring out was that whereas Majidimehr encoded each test clip separately, Krueger combined all of his test clips into one big file before running them through the WMA encoder.
Majidimehr wrote in one of the newsgroup postings that "Indeed, a combined input file of all the samples does generate different results than encoding each clip, which is what I did. If you want to gang encode a lot of samples, it is necessary to leave sufficient space between them [Editor's note: Later in the posting Majidimehr recommended 5 seconds] so that you truly get independent samples. Otherwise, your results only represent the composite, as the previous clip may change the outcome on some codecs such as ours." I hesitate, though, to absolutely conclude that WMA is using vector-quantization techniques, for two reasons. First is Alexander's "no vector quantization" comment, although something peculiar is definitely going on.
WMA also exhibits a curious discrepancy from TwinVQ and other similar vector-quantization approaches to compressing multimedia such as still images and video). Vector quantization has a reputation for being extremely slow, particularly when the algorithm incorporates on-the-fly codebook creation. WMA encoding, however, is reputed to be very fast, compared with other lossy compression routines at comparable bit rates. Perhaps Microsoft's engineers have just written tight code that takes advantage of processor acceleration hardware such as MMX instructions. I'll dig into the algorithm myself and report back any interesting results I see.
|
|
Don't consume(r) the water(mark)
In September, the folks at Verance invited me to Sony Music Studios (Santa Monica, CA) for a listening test of their watermarking technology. Audio watermarking hides media identification bits within the audio bit stream, in a supposedly inaudible fashion.
A number of different watermarking alternatives exist, including spread-spectrum approaches that derive from the same mathematical theories behind CDMA digital cellular systems, and perceptual coding, which Verance employs. As the name implies, perceptual watermarking uses similar techniques to those in MP3, AAC and other perceptual audio-compression algorithms, with similar artifacts.
The half-hour audition Verance gave me was an abbreviated version of the testing they'd been conducting for a number of months with a large number of listeners as they fine-tuned their algorithms. I was able to choose from among several different music clips. Ten times I first listened, for as long as I wanted, to "clear" original audio (sample A), then watermarked audio (sample B), then to an unidentified sample X. I was then asked to determine whether sample X matched sample A or B. This testing method is known as ABX testing.
Qualifiers first. I knew that perceptual watermarking probably produced perceptual compression-like phenomena, for which my ears were already pre-tuned from my codec work of the past few months. I also intentionally picked an audio selection that would most accentuate these phenomena—a jazz passage with lots of high-frequency energy and sharp percussive transients.
Now the results. In all 10 cases, I was able to clearly identify differences between Samples A and B, such as less distinct, more muddled transients in the latter, as well as reduced high-frequency information and a narrower stereo image. And, more importantly, in 8 of 10 attempts I correctly matched Sample X to either Sample A or Sample B. This success rate translates to only a 5 percent probability that I was guessing. And I'd go so far as to say that, practically, the probability was even lower, because one of the two matches I missed was the 10th, when my accuracy was probably diminished due to listening fatigue.
Here's how I predict Verance will dispute my results. First, the company will claim that if I ran the experiment again, my success percentage might be lower. I have no way to dispute this possibility. But my success percentage might just as well improve.
Secondly, the company will claim that I was predisposed to hearing perceptual artifacts because of my previous experience auditioning lossy compression algorithms. This claim I can more readily dispute. In fact, much of the criticism directed to date at Verance centers on the fact that the company pulled people off the street and ran the tests without pre-training them. Audio experts pretty much agree that the types of defects created by lossy compression of both audio and video become more noticeable as consumers spend more time with the material. For example, I certainly notice MPEG video-compression artifacts more now than I did when I watched my first DVD.
Third, Verance will claim that the audio material I chose was worst-case and not indicative of results across the broad spectrum of music genres. Again, without a more extensive listening test I can neither refute nor support this viewpoint. I will say that the watermarking effects were far subtler than those created by, say, 128-kbit/sec MP3 compression. What's more, they would be masked fairly effectively by low-cost portable audio players, by the reduced dynamic range and frequency range of an automobile listening environment, and even by home stereos of less-than-audiophile quality. To Verance's credit, I'll also point out that in no case did Sample B sound "bad," and that in the absence of the clear Sample A as a reference, I'd probably be perfectly happy listening to Sample B.
But that's not the point. The DVD Audio discs with which Verance watermarking will debut target discriminating audiophiles. Even putting debatable reality aside, why would these folks pay thousands of dollars for equipment, and dozens of dollars per disc, for sound that they perceive to be altered by a watermark? I don't dispute the music industry's valid right to protect content they own from copyright infringement. But watermarking appears to be a technology that only benefits the sellers, not the buyers. And in fact, for the latter, it's a detriment. As such, from where I'm sitting it seems destined to fail.
|
|
On track with ATRAC
Unlike other codecs, the ATRAC encode/decode algorithm used by MiniDisc equipment runs in dedicated hardware, not in software residing on a PC. Most MiniDisc units, such as my Sharp MT-MD15 portable player/recorder, provide digital inputs but not digital outputs. Without moving the data to my PC, I can't analyze it. But it would be unfair for me to subject the already-altered ATRAC audio to further degradation (digital-to-analog conversion coming out of the player's headphone or line output and analog-to-digital conversion coming into the PC's line input). So, for several weeks I regularly pestered the MiniDisc manufacturers (Kenwood, Sharp, Sony) to let me borrow a higher-end MiniDisc player/recorder with full digital I/O.
My efforts failed. As soon as I mentioned that I wanted to perform benchmarking against other codecs, responses to my emails and phone calls mysteriously dried up. Fine. Two can play that game. I went to eBay and bought a Kenwood MD-203 player/recorder. So there.
I'll be running the Kenwood's digital outputs directly into the digital inputs of my PC's sound card. I should also point out that I'll be testing version 4.5 of the original 292-kbit/sec ATRAC algorithm, not the newer 132-kbit/sec and 66-kbit/sec variants, variously known as ATRAC3 and MDLP (for MiniDisc Long Play). ATRAC3, developed in response to MP3 and other high-quality, low-bit-rate codecs, finds use in newer Sony MiniDisc players as well as the Memory Stick Walkman and Vaio Music Clip.
|
Author information
Contributing Editor Brian Dipert (bdipert@pacbell.net) is a technical editor for EDN (www.ednmag.com). The audio system in his first car was worth more than the car itself. But the car was a '75 Vega, so that's not saying much.
|