In my quest to beat PNG compression I’ve side-tracked a bit. I was working on my old Haartransform-like algorithm and tinkered a bit with it when I came up with a algorithm which could potentially also handles soft-waveforms (haar transform is usually only good with crisp data , like a square waveform). This can be well used for image compression but also for audio. So as an experiment I decided to create an lossy audio codec for it. At the end of this post you can download the codec, and listen to some samples of it. Note that it does NOT beat MP3…
The algorithm is a divide-and-conquer recursive procedure which stores the average and the difference. The idea behind the algorithm is that there are more lower frequencies than higher in a wave-form, and that you should find a efficient way of coding these. The algorith operates somewhat like this:
average = (sample1 + sample2) / 2
difference = sample1 - average
You can now reconstruct the actual samples using average and difference like this:
sample1 = average + difference
sample2 = average - difference
Now this is very similar to a haar transform. When using real values no information is lost in this process.
Now image in a set of 8 samples:
0, 2, 3, 2, 0, -2, -3, -2

Which is a rather crude waveform. When we take for each the average and the difference, we have 4 averages, and 4 differences. We then store the averages in the left-side and the differences in the right side. Effectivly we then seperated the lower frequenies (the averages) from the higher frequencies (the differences).
Applying the function one time the the sequence above will result in something like:
low: 1, 2.5, -1, -2.5, high: -1, +0.5, +1, -0.5
Now in the first 4 samples you might recognize the old wave-form, only its frequencies twice that of the original. The second part are the ‘high’ frquencies. Now we can apply the same transformation to the first part of the of the image, but also to the second part. Which will cause the first part, 4 samples to be split into two halves: 2 low frequences, and 2 high frequences. Applying then the transformation again to the blocks of 2 samples will cause it also to be split into a high-frequency part and a low frequency part. I will show this with only the first block:
1, 2.5, -1, -2.5 => low: 1.75, -1.75 and high: 0,75, 0.75
Then again on the low frequencies:
1.75, -1.75 => low: 0, high: 1.75
The lowest frequency is always the average. Now normally I would apply the algorithm on all blocks in the transformation, not only the lowest. This will cause an interesting decomposition. Reconstructing the wave-form is the exact opposite. Lets get back to the two 4 sample blocks with high and low frequencies:
low: 1, 2.5, -1, -2.5, high: -1, +0.5, +1, -0.5
Imagine I throw away all the high frequencies, make them 0:
low: 1, 2.5, -1, -2.5, high: 0, 0, 0, 0
And reverse the decomposition process, I would get the following wave-form:
1, 1, 2.5, 2.5, -1, -1, -2.5, -2.5
What happend! Your smooth waveform became a square wave! Shown together with the original samples it looks like this (where the bars are the new data)

Now this is bad news for any audio-wave-form. Not only did we remove the old high-frequency component, we also introduced a new high-frequency, that of a square wave. It will make your audio sound like a cheap musical greetings-card!
So the secret to the solutions lies in using interpolation when cacluating the differences. Not only do you take the average of s1 and s2, but also the average of your left neigbour to and your right neigbour. Using the interpolation you use a smooth-upscaling algorithm, removing the annoying square wave-forms.
To upscale smoothly we assume some relation to the average of its neigbour. This means that the left sample is one time its left neigbour average + 3 times its own average divided by 4. This is because the left sample we try to find is 0.5 removed from the mid-point of the nearest average and 1.5 removed from the mind-point of the left average. This means that the nearest average is 3 times as important as the left-average. Since we upscale only by 2 anything beyond 1 sample away is not important.
The following diagram tries to clarify the upscaling algorithm:

Probably not doing a good job, but study the source code okay!
So what do you get when we upscale like this using the previous decomposition:
low: 1, 2.5, -1, -2.5, high: 0, 0, 0, 0
We get the following results
0: 1 (we don't have a left neighbour - normal algorithm uses extrapolation here)
1: (1 * 3 + 2.5) / 4 = 1.38
2: (1 + 2.5 * 3) / 4 = 2.12
3: (2.5 * 3 - 1) / 4 = 1.625
4: (2.5 - 1 * 3) / 4 = -0.125
5: (-1 * 3 - 2.5) / 4 = -1.38
6: (-1 - 2.5 * 3) / 4 = -2.125
7: -2.5 (we don't have the right neigbour - normal algorithm uses extrapolation here)
This resuls in the following diagram:

Its not perfect yet, but its an improvement to the square resize. The actual algorithm also uses some average compensation and extrapolation around the averages. Here is a diagram showing the output after the correction and extrapolation have been added

Note that the recreated wave only uses the average data, meaning its constructed of only half of the information of the original wave.
This is because the average of the two interpolated samples should when averages still be the same as the original nearest neighbour average. Why is that? Well,we are trying to store the sample data in the same number or data elements as the original samples. When the interpolated average is the same as the nearest neigbour the difference between the guessed samples are the real samples is the same, e.g:
|guessedSample1FromAverageA - realSample1| = |guesedSample2FromAverageA - realSample2|
|guessedSample3FromAverageB - realSample3| = |guesedSample4FromAverageB - realSample4|
This is important because when we only need to store |guessedSample1FromAverageA - realSample1|. The table below shows this. Its the data shown in the graph above + the difference from the original

Now it clear to see that the wave can be reconstructed using the average and the difference, and that it would only take up a total of 8 samples, 4 for the lower (averages) and 4 for the higher (differences):
1, 2.5, -1, -2.5 and 0.38, -0.25, -0.63, -0.38.
Now applying the same transformation to both the high and the low components again, until each block to transform is only 2 samples.
The inverse is simply the oposite, reconstructing the weighted averges and adding or substracting the highfrequency components recusivly.
Now we have a reversable transform which also handles smooth data well because of weighted averages instead of nearest neighbour. Now for my image format I am creating a 2D version of this algorithm. For audio data you just need a 1D version.
So now we have some ‘Transform’ thingy. Whats up with that?
Well, as said before this algorithm basically decomposes a set of samples into frequency-amplitudes of wave-forms. The key lies in descarding low-amplitude frequency components, which are unhearable anyway. The important frequencies stay intact, while the unimportant are zeroed. Often its possible to zero the larger part of the frequencies. After that you can push the output of the transform through a statistical compression algorithm like the DEFLATE, and you get pretty good compression.
Here is an example of the audio compressors output (using ogg as media format
):
The original tune:
- original-bigyellow.ogg
The tune compressed at normal setting and decompressed:
- normalcompression-bigyellow.ogg
The tune compressed at high setting and decompressed (sounds aweful):
- highcompression_bigyellow.ogg
Now as you’ll notice the highly compressed audio has more plings and ploings. This is because the subtleties of the sound have gone, only the louder frequencies remain. You can download the original Java source codec here , named ‘Olaf’ (Open Lossly Audio Format).I guess the compression ratio lies around 1/4 for CD quality (note that MP3 is 1/10 for near CD quality).
Though its not very good it could be improved quite a bit. For example overlaying could remove the ~40Hz buzz which comes to play when compressing at higher ratios. Smarter encoding of the remaining frequencies could increase the compression ratio quite a bit.
As far as I know it does not make use of any patented algorithms, and no freaking descrete cosine transform (which everyone seems to love). Olaf is extremly quick to decode and requires no floating point operations
Maybe I’ll turn Olaf into a usable format. For now its just a proof of concept, and contains many bugs and inefficiencies.