-
-
[翻译]新手的观点:压缩
-
发表于: 2004-9-1 10:12 5017
-
A Newbie's View:
Compression
-=-=-=-=-==-=-=-=-=-
Phase I : Introduction
Compression is the art of reducing size of a raw data.
I've encountered some places that called compression
encryption, which is can be true, since encryption
is the art of hiding data as other data, and compression
does that.
When you first time think of compression you usually think
of some number that divides every byte/word in the data,
and that compresses it,
the truth is far from that, since it might work, but you
usually ends up having the same size, since you'll need
to keep the modolus as well, so some case will even
increase the size of the code.
Compression itself divides into 2 major types,
lossless compression, which is a compression that
reduces the size, but when you decompress, you retrive
the exact data as the original data,
and a lossy compression, which reduce the size of the data
by removing parts which you don't need, so when you decompress
the packed data back to normal, it is most likely you'll get
a different data, this method is being used mainly in multimedia
since human senses has limits, and that can be used to reduce
sound and image by removing parts the humans cannot see or hear.
In this article, we are going to overview some simple lossless
methods. After that, i'll interduce a simple challange to the
coders among you.
Phase II : Run Length Encoded (RLE)
RLE is a most simple packing,its very useful in some places,
and might be completely useless in others,
lets assume you want to compress a data, that happend to have
graphical data inside (it is most useful with it..),
a raw data, that means, pixels, raw.
usually, bitmaps have areas filled with the same pixel,
which is basicly the same data, so instead of having:
00000000 00000000 00000000 00000000 (4 black pixels)
you can make it: 1600 to say,
we have 16 bytes of "00" in the next part of code,
its true, many times its not like this, still
this can be useful, less on text, since we have less
repeating characters in text, but for that we have the Lempel Ziv (LZ77)
Remember to use a tag code to take apart of the other raw pixels,
because when you have 2 pixels with the same color, its not always useful
to use RLE..
Phase III : Lempel Ziv '77 (LZ77)
Lempel Ziv is a dictionary based algorithm,
yet, it's not a word replacing, so most used words
will be replaced with a short sign.
This algorithm is creating its own dictionary.
By scanning the data, it creates a dictionary buffer,
then, it scans the next block(s), and when encountered
a previously used sign, it replaces it with the index of
that sign in the dictionary.
For instance:
WATCHMATCH is our data,
we define the dictionary size as 5 bytes (WATCH),
now, we will scan the next block for the data from the dictionary,
'W' will be kept plain, since it wasn't refered in the dictionary,
yet 'A' will be encoded as position 1, length 4,
since its a copy of position 1 (we do not count 'M' its 0)
and 4 bytes ahead, so, we can do like this:
HORSEWHORE, with dictionary size of 5 (HORSE).
We will have :
'W'
Pos: 0, Length: 5 (H)
Pos: 1, Length: 5 (O)
Pos: 2, Length: 5 (R)
Pos: 4, Lentgh: 4 (E)
Of course, you can use varius sizes of data, for instance:
SUPER-MANBATMAN
And use 9 letters dictionary, and each part is 3 letters,
now we have:
SUP|ER-|MAN
'BAT'
Pos: 2, Length: 1 (MAN)
So, like this, you can compress more data into less space (sometimes).
For example, Compressing 4-8 bytes at a time is more efficient,
rather than compressing byte after byte, which might be useful for text.
Most of the used compression methods, as RAR and Zip are using
one of LZ77 variant, each has its own benefits and loses.
Phase IV : Huffman
Huffman is a frequency based compression method, it can be very useful
to work with, unlike the LZ77 and the RLE, similar data don't have
to be next to each other, or in similar blocks, it can be spread all
over the data.
Huffman works in the next way:
-Frequency tests
-Building a Huffman tree
-Encoding
The frequency tests are tests to determine what byte is being used to most
in the data. Just like in subitute cryptography, where you try to determine
the most frequent letters, and replace them with logical decrypted letters.
As example, 'E' is the most frequent letter in the english language,
yet, you can write a sentence without any 'E'.
"Dan is away from his daddy" - See ? No 'E' in this sentence,
and if you think, you can create an even more complicated lines with no 'E'
so Huffman, instead of counting on a pre-made freuqency table, is creating
one everytime. This can be very useful for code, sicne some opcodes are repeating
more than others, FPU is less frequent than x86 code (we are talking on x86 platforms)
so some codes will be more frequent than others.
So why not using this to decrease the size of the code ?
Like, the opcode 85C0 (test eax,eax) appears more than D9E1 (fabs)
in most of codes, so why not replacing the 85C0 with 2 or 1 bit ?
That is exactly what Huffman is doing, after doing the frequency test,
the most frequent bytes/characters/words are replaced with shorts bits
sequence, and the less frequent are replaced with longer ones.
The Huffman tree is a binary tree to determine the decoded value
of a bit sequence, but first lets see how they are being made.
In a text we want to compress we have the next line:
"MONEY IS ROOT OF ALL EVIL"
Lets remove the spaces, "MONEYISROOTOFALLEVIL",
Hexadecimal by the ASCII table, this line would be represented as:
4D 4F 4E 45 59 49 53 52 4F 4F 54 4F 46 41 4C 4C 45 56 49 4C
Binary will represent it as:
0100 1101 0100 1111 0100 1110 0100 0101 0101 1001 0100 1001 0101 0011 0101 0010 0100
1111 0100 1111 0101 0100 0100 1111 0100 0110 0100 0001 0100 1100 0100 1100 0100 0101
0101 0110 0100 1001 0100 1100
Lets analyze the Hex string we have,
1 1
2 1
3 1
4 16
5 7
6 2
9 3
C 3
D 1
E 1
F 4
(1) (1) (1) (16) (7) (2) (3) (3) (1) (1) (4)
1 2 3 4 5 6 9 C D E F
Now, this is our basic branches list. all the nibbles we have in that line,
and the amount they appear in the text. To build it into a tree, we need to
step by step take two branches and unite them under a bigger branch with the
sum of amount of all the sub branches it has.
(2) (1) (16) (7) (2) (3) (3) (1) (1) (4)
(1) (1) 3 4 5 6 9 C D E F
1 2
And again, until we have a complete tree:
_________________(40)______________
/ \
_______(17)________ (23)
/ \ / \
/ (5) \ /(12)\ (16) (7)
(3) (2) (6) (6) [4] [5]
(1) (2) (1) (1) (3) (3) (4) (2)
[1] (1) (1) [D] [E] [9] [C] [F] [6]
[2] [3]
This is the final result. Let's review:
40 on the top means 40 nibbles in the code,
divides into 2, which is a 'case' of 0 and 1
if 0, go to the left branch, if 1, goto the right branch.
So:
11 = 5, because we went 1 right, so its either 4 or 5, and then again, right,
we get 5.
This is our new table:
Encoded|Real| Old
------------------
0000 | 1 | 0001
00010 | 2 | 0010
00011 | 3 | 0011
0010 | D | 1101
0011 | E | 1110
0100 | 9 | 1001
0101 | C | 1100
0110 | F | 1111
0111 | 6 | 0110
10 | 4 | 0100
11 | 5 | 0101
This tree is highly unoptimized. it can be done MUCH better,
yet, it compresses "MONEYISROOTOFALLEVIL" from this :
01001101010011110100111001000101010110010100100101010011010100100100
11110100111101010100010011110100011001000001010011000100110001000101
010101100100100101001100 (160 bit)
To that:
10001010011110001110110101100110100111000111100010100111100111111010
0101111001111000001001011001011011110111101001100101 (120 bit)
Almost cutted by a quarter. This is a nice ratio, an optimized tree would
create an even better ration, since it would reducde more than two
signs. (In here, as you can see, only 4 and 5 were reduced to half.)
Huffman can be very useful, especially if encoded with RLE and LZ77
or one of them. Since then you'll have LZ77 dictionary, reducing all
the 4's and 5's down, and the Huffman tree would shrink it even more
when you get the position/length down.
The decoding of Huffman is done in the reverse way, you read the first
bit, and follow the tree, one you reach a decoded value, you stop,
so if you read:
1 > Go right, either 4 or 5
0 > Go left, its 4.
Next :
0 > Go left, its not 4 or 5
And so on and on, until you finish decoding the entire data.
Remember that when you construct a tree, you can use whatever size you want,
depends on your amount of memory, and time, and space.
You can build a tree for 64bit values if you have the memory. Yet, i think
that 4-8 bits trees are the best when considering the speed and size.
They dont have many entries, and they are not complicated too much.
Huffman requires quite a time of sitting and coding. The theory,
as most of things related with logic, is simple, but code implementation
can be hard sometimes. The important thing is not to give up, once you get
the first hang to the code, you will be able to do many great things.
Phase V : Errors FAQ And Tips
These are questions i thought some of you might ask yourselves while
coding compression routines, i tried to answer these in the best way
i could. Any more questions you may feel free asking me.
Q: My RLE/LZ decoded values in a wrong way, what could have happend ?
A: Prefix coding is important when using these methods, remember to
use a tag that will NEVER appear as packed data, if you use text,
use 00 (NULL), or one of the control characters, since the dont appear
in text data.
Q: My Huffman didnt decode. What is wrong with my code ?
A: Huffman is creating different tree with every data, usually..
Remember to attach the tree to the encoded data, or if the decoder
is private for each place (.exe packers for example..), you may
attach it to the decompressor itself
Q: The data comes out weird from the memory. Why ?
A: Little Endian, the bit order is reversed, you can use bswap to fix it.
Since many formats are using Big Endian bit order. If you are writing
your own format, remember to be consistent, either swap on encoding
and decoding, or dont swap at all. The choice is yours.
Q: My LZ compressor is very slow, yet, other compressors are very fast.
How come my code is slow ?
A: Comparing each byte with all positions is slow, create some kind of
a data structure to fasten things up.
Q: Is there another way to decode Huffman besides a tree ?
A: An important property of the Huffman codes is that each prefix is unique.
If you have a code of n bits, say code A,
then you can be sure that there is no code of more than n bits,
where the first n bits are the same as A.
So, if you would place random bits behind A, so that you fill the n bits
up to m (say 16), there is still no way that the code can be interpreted
as any other code than A.
What this means is that you can create a table to decode quickly,
by using multiple bits at a time. You can grab m bits from the bitstream,
and use that m bit number as an index into the table.
You will have to bruteforce all codes up to m bits, and store the symbol and length
of each code into all bruteforced table-positions.
Now when you want to decode a code, you lookup in the table, find the actual symbol,
and actual length of the code. You output the symbol,
and advance the bitstream by the length of the code.
Another property of Huffman codes is that the shortest code belongs
to the most frequent symbol in the data.
If you have a large Huffman-tree, then a bruteforce table can be rather large,
and will not fit in L1 cache (for 16 bit, you will already need 2^16 entries of
at least 2 bytes (one byte for symbol, one for the length. So you need 128 kb).
It is a good idea to split the table in two.
One table should be tuned so that it fits nicely in L1 cache. This table will store
the smallest codes (which are also the most frequent ones. If you take for example 12 bits,
you get (2^12)*2 = 8 kb, which should fit nicely in L1-cache).
You can use a special value of the length (eg. -1) to indicate that the code was not found.
You then grab the remaining bits from the stream, and look in the second table.
Alternatively you could walk down the tree in such a case,
but usually that won't be necessary,
since even the second table will be quite fast (it's in L2-cache, and it will only be
filled partially, there will be 'gaps' for the codes that are already in the first table,
so only small parts of the second table will have to be cached, and they are rarely used anyway).
(Answer was contributed by Scali.)
Q: There are so many premade libs that are coded much better than what i'll be able to do,
So why should i learn how to program compression algorithms ?
A: You shouldn't. But then again, you have a hand-held calculator, why would you learn how
to add and multiply numbers ?
Learning compression is something that can help you dealing with programming and reversing
tasks differently, somewhat of a better way, you know assembly, but that means you use it
all the time ? What about C, and Delphi, they are useful in thier own fields. It's not
up to programming compression. It's about knowing how its working. Like the engine of a car,
you know how it works, so you can run the car better, and you know when to shift the gears.
Does that means you will definetly build your own engine ? No. Learning compression is good,
it might be troublesome with some topics, yet it can help you very much, the decision is only
made by you.
Q: I want to program a file protector that will pack them, what comperssion should i use ?
A: It's completely variable, you can combine some LZ variation and Huffman to retrive a good
compression, and you can create your own algorithm if you know how. Just remember to follow
some ground rules:
Prefix coding is important to many methods.
You can either choose one or two methods, but don't overdo it,
"One cannot see the wood for the trees".
If the decompression is on-fly while executing the file, don't forget to build your decompression
code in a place where you can decompress the code into its original place in memory.
Learn the executable type header before you engage your attempts, that way you won't ruin the file
everytime you try to pack it.
There are quite more of these ground rules, but since we are not going to deal with binary
packing in this article, so i wont add more.
Q: I wrote a compression code, but the compressed data won't decompress with my decompressor. What
did i do wrong ?
A: I can't tell exactly what you did wrong, try to write a new decompressor from scratch, use the exact
reverse of the compressing, if you use rol, use ror, and do it backwards, so if the rol was the last
instruction, make the ror as first instruction. After you have done that, your decompressor should
work flawlessly.
Q:
A:
Phase VI : Conclusions
Compression is art. You don't have to master it, you can always learn just the parts you need.
Yet, compression can help you dealing with programming issues, as well help you directly
with compressing data. You will probably run into many difficulties while trying to code
compression related thing. Never give up. Each try will be more successful, from a barely
running Huffman to your own live'n'kicking compressiong methods.
Regarding lossy compression methods, I couldn't bring them into this paper because
this is a paper for newbies, to explain the most simple methods that will allow the reader
who just learned about compression how to develop himself, and will help him to do understand
other compression related papers better. I might write a lossy compression article someday,
but now is not the time. Though there are plenty of texts around the internet, look for them.
Here are some more resources for compression:
Introduction to Data Compression, Second Edition by Khalid Sayood
Compression Algorithms for Real Programmers (For Real Programmers) by Peter Wayner
Data Compression: The Complete Reference by David Salomon
Text Compression (Prentice Hall Advanced Reference Series)
by Timothy C. Bell, Ian H. Witten, Ian Whitten (Contributor), John Cleary (Contributor)
Introduction to Information Theory and Data Compression by Darrel . Hankersson
The Data Compression Book by Mark Nelson
#Compression at EFNet (IRC)
http://www.cs.pdx.edu/~idr/compression/
http://www.ics.uci.edu/~dan/pubs/DataCompression.html
http://www.faqs.org/faqs/compression-faq/
http://datacompression.info/
http://www.data-compression.com/
http://www.dogma.net/markn/
http://www.arturocampos.com/
http://www.ross.net/compression/
http://www.cbloom.com/
These links should supply you with the basic information, which where you can develop your
knowledge alone from there and on.
This article was created for newbies. I hope i managed to make things clear here, and explain
the basic of compression. As for myself, I am a newbie when it comes to most of the compression
related issues if not all of them. So i would like the opportunity to thank some people who
helpped me learn the basics of the art,
Jorgen Ibsen (Jibz) who taught me everything from the start, i used his methods in this article
since i remembered how well they are appealing to newbies.
Scali, who taught me about Huffman tables brute force and contributed some questions and answers
to the FAQ.
X-Lock, who helpped me with RLE and LZ finishing touch, also for contributing resources for
compression.
Added Files: Programming Challange, Compression algorithm.
ParaBytes
Compression
-=-=-=-=-==-=-=-=-=-
Phase I : Introduction
Compression is the art of reducing size of a raw data.
I've encountered some places that called compression
encryption, which is can be true, since encryption
is the art of hiding data as other data, and compression
does that.
When you first time think of compression you usually think
of some number that divides every byte/word in the data,
and that compresses it,
the truth is far from that, since it might work, but you
usually ends up having the same size, since you'll need
to keep the modolus as well, so some case will even
increase the size of the code.
Compression itself divides into 2 major types,
lossless compression, which is a compression that
reduces the size, but when you decompress, you retrive
the exact data as the original data,
and a lossy compression, which reduce the size of the data
by removing parts which you don't need, so when you decompress
the packed data back to normal, it is most likely you'll get
a different data, this method is being used mainly in multimedia
since human senses has limits, and that can be used to reduce
sound and image by removing parts the humans cannot see or hear.
In this article, we are going to overview some simple lossless
methods. After that, i'll interduce a simple challange to the
coders among you.
Phase II : Run Length Encoded (RLE)
RLE is a most simple packing,its very useful in some places,
and might be completely useless in others,
lets assume you want to compress a data, that happend to have
graphical data inside (it is most useful with it..),
a raw data, that means, pixels, raw.
usually, bitmaps have areas filled with the same pixel,
which is basicly the same data, so instead of having:
00000000 00000000 00000000 00000000 (4 black pixels)
you can make it: 1600 to say,
we have 16 bytes of "00" in the next part of code,
its true, many times its not like this, still
this can be useful, less on text, since we have less
repeating characters in text, but for that we have the Lempel Ziv (LZ77)
Remember to use a tag code to take apart of the other raw pixels,
because when you have 2 pixels with the same color, its not always useful
to use RLE..
Phase III : Lempel Ziv '77 (LZ77)
Lempel Ziv is a dictionary based algorithm,
yet, it's not a word replacing, so most used words
will be replaced with a short sign.
This algorithm is creating its own dictionary.
By scanning the data, it creates a dictionary buffer,
then, it scans the next block(s), and when encountered
a previously used sign, it replaces it with the index of
that sign in the dictionary.
For instance:
WATCHMATCH is our data,
we define the dictionary size as 5 bytes (WATCH),
now, we will scan the next block for the data from the dictionary,
'W' will be kept plain, since it wasn't refered in the dictionary,
yet 'A' will be encoded as position 1, length 4,
since its a copy of position 1 (we do not count 'M' its 0)
and 4 bytes ahead, so, we can do like this:
HORSEWHORE, with dictionary size of 5 (HORSE).
We will have :
'W'
Pos: 0, Length: 5 (H)
Pos: 1, Length: 5 (O)
Pos: 2, Length: 5 (R)
Pos: 4, Lentgh: 4 (E)
Of course, you can use varius sizes of data, for instance:
SUPER-MANBATMAN
And use 9 letters dictionary, and each part is 3 letters,
now we have:
SUP|ER-|MAN
'BAT'
Pos: 2, Length: 1 (MAN)
So, like this, you can compress more data into less space (sometimes).
For example, Compressing 4-8 bytes at a time is more efficient,
rather than compressing byte after byte, which might be useful for text.
Most of the used compression methods, as RAR and Zip are using
one of LZ77 variant, each has its own benefits and loses.
Phase IV : Huffman
Huffman is a frequency based compression method, it can be very useful
to work with, unlike the LZ77 and the RLE, similar data don't have
to be next to each other, or in similar blocks, it can be spread all
over the data.
Huffman works in the next way:
-Frequency tests
-Building a Huffman tree
-Encoding
The frequency tests are tests to determine what byte is being used to most
in the data. Just like in subitute cryptography, where you try to determine
the most frequent letters, and replace them with logical decrypted letters.
As example, 'E' is the most frequent letter in the english language,
yet, you can write a sentence without any 'E'.
"Dan is away from his daddy" - See ? No 'E' in this sentence,
and if you think, you can create an even more complicated lines with no 'E'
so Huffman, instead of counting on a pre-made freuqency table, is creating
one everytime. This can be very useful for code, sicne some opcodes are repeating
more than others, FPU is less frequent than x86 code (we are talking on x86 platforms)
so some codes will be more frequent than others.
So why not using this to decrease the size of the code ?
Like, the opcode 85C0 (test eax,eax) appears more than D9E1 (fabs)
in most of codes, so why not replacing the 85C0 with 2 or 1 bit ?
That is exactly what Huffman is doing, after doing the frequency test,
the most frequent bytes/characters/words are replaced with shorts bits
sequence, and the less frequent are replaced with longer ones.
The Huffman tree is a binary tree to determine the decoded value
of a bit sequence, but first lets see how they are being made.
In a text we want to compress we have the next line:
"MONEY IS ROOT OF ALL EVIL"
Lets remove the spaces, "MONEYISROOTOFALLEVIL",
Hexadecimal by the ASCII table, this line would be represented as:
4D 4F 4E 45 59 49 53 52 4F 4F 54 4F 46 41 4C 4C 45 56 49 4C
Binary will represent it as:
0100 1101 0100 1111 0100 1110 0100 0101 0101 1001 0100 1001 0101 0011 0101 0010 0100
1111 0100 1111 0101 0100 0100 1111 0100 0110 0100 0001 0100 1100 0100 1100 0100 0101
0101 0110 0100 1001 0100 1100
Lets analyze the Hex string we have,
1 1
2 1
3 1
4 16
5 7
6 2
9 3
C 3
D 1
E 1
F 4
(1) (1) (1) (16) (7) (2) (3) (3) (1) (1) (4)
1 2 3 4 5 6 9 C D E F
Now, this is our basic branches list. all the nibbles we have in that line,
and the amount they appear in the text. To build it into a tree, we need to
step by step take two branches and unite them under a bigger branch with the
sum of amount of all the sub branches it has.
(2) (1) (16) (7) (2) (3) (3) (1) (1) (4)
(1) (1) 3 4 5 6 9 C D E F
1 2
And again, until we have a complete tree:
_________________(40)______________
/ \
_______(17)________ (23)
/ \ / \
/ (5) \ /(12)\ (16) (7)
(3) (2) (6) (6) [4] [5]
(1) (2) (1) (1) (3) (3) (4) (2)
[1] (1) (1) [D] [E] [9] [C] [F] [6]
[2] [3]
This is the final result. Let's review:
40 on the top means 40 nibbles in the code,
divides into 2, which is a 'case' of 0 and 1
if 0, go to the left branch, if 1, goto the right branch.
So:
11 = 5, because we went 1 right, so its either 4 or 5, and then again, right,
we get 5.
This is our new table:
Encoded|Real| Old
------------------
0000 | 1 | 0001
00010 | 2 | 0010
00011 | 3 | 0011
0010 | D | 1101
0011 | E | 1110
0100 | 9 | 1001
0101 | C | 1100
0110 | F | 1111
0111 | 6 | 0110
10 | 4 | 0100
11 | 5 | 0101
This tree is highly unoptimized. it can be done MUCH better,
yet, it compresses "MONEYISROOTOFALLEVIL" from this :
01001101010011110100111001000101010110010100100101010011010100100100
11110100111101010100010011110100011001000001010011000100110001000101
010101100100100101001100 (160 bit)
To that:
10001010011110001110110101100110100111000111100010100111100111111010
0101111001111000001001011001011011110111101001100101 (120 bit)
Almost cutted by a quarter. This is a nice ratio, an optimized tree would
create an even better ration, since it would reducde more than two
signs. (In here, as you can see, only 4 and 5 were reduced to half.)
Huffman can be very useful, especially if encoded with RLE and LZ77
or one of them. Since then you'll have LZ77 dictionary, reducing all
the 4's and 5's down, and the Huffman tree would shrink it even more
when you get the position/length down.
The decoding of Huffman is done in the reverse way, you read the first
bit, and follow the tree, one you reach a decoded value, you stop,
so if you read:
1 > Go right, either 4 or 5
0 > Go left, its 4.
Next :
0 > Go left, its not 4 or 5
And so on and on, until you finish decoding the entire data.
Remember that when you construct a tree, you can use whatever size you want,
depends on your amount of memory, and time, and space.
You can build a tree for 64bit values if you have the memory. Yet, i think
that 4-8 bits trees are the best when considering the speed and size.
They dont have many entries, and they are not complicated too much.
Huffman requires quite a time of sitting and coding. The theory,
as most of things related with logic, is simple, but code implementation
can be hard sometimes. The important thing is not to give up, once you get
the first hang to the code, you will be able to do many great things.
Phase V : Errors FAQ And Tips
These are questions i thought some of you might ask yourselves while
coding compression routines, i tried to answer these in the best way
i could. Any more questions you may feel free asking me.
Q: My RLE/LZ decoded values in a wrong way, what could have happend ?
A: Prefix coding is important when using these methods, remember to
use a tag that will NEVER appear as packed data, if you use text,
use 00 (NULL), or one of the control characters, since the dont appear
in text data.
Q: My Huffman didnt decode. What is wrong with my code ?
A: Huffman is creating different tree with every data, usually..
Remember to attach the tree to the encoded data, or if the decoder
is private for each place (.exe packers for example..), you may
attach it to the decompressor itself
Q: The data comes out weird from the memory. Why ?
A: Little Endian, the bit order is reversed, you can use bswap to fix it.
Since many formats are using Big Endian bit order. If you are writing
your own format, remember to be consistent, either swap on encoding
and decoding, or dont swap at all. The choice is yours.
Q: My LZ compressor is very slow, yet, other compressors are very fast.
How come my code is slow ?
A: Comparing each byte with all positions is slow, create some kind of
a data structure to fasten things up.
Q: Is there another way to decode Huffman besides a tree ?
A: An important property of the Huffman codes is that each prefix is unique.
If you have a code of n bits, say code A,
then you can be sure that there is no code of more than n bits,
where the first n bits are the same as A.
So, if you would place random bits behind A, so that you fill the n bits
up to m (say 16), there is still no way that the code can be interpreted
as any other code than A.
What this means is that you can create a table to decode quickly,
by using multiple bits at a time. You can grab m bits from the bitstream,
and use that m bit number as an index into the table.
You will have to bruteforce all codes up to m bits, and store the symbol and length
of each code into all bruteforced table-positions.
Now when you want to decode a code, you lookup in the table, find the actual symbol,
and actual length of the code. You output the symbol,
and advance the bitstream by the length of the code.
Another property of Huffman codes is that the shortest code belongs
to the most frequent symbol in the data.
If you have a large Huffman-tree, then a bruteforce table can be rather large,
and will not fit in L1 cache (for 16 bit, you will already need 2^16 entries of
at least 2 bytes (one byte for symbol, one for the length. So you need 128 kb).
It is a good idea to split the table in two.
One table should be tuned so that it fits nicely in L1 cache. This table will store
the smallest codes (which are also the most frequent ones. If you take for example 12 bits,
you get (2^12)*2 = 8 kb, which should fit nicely in L1-cache).
You can use a special value of the length (eg. -1) to indicate that the code was not found.
You then grab the remaining bits from the stream, and look in the second table.
Alternatively you could walk down the tree in such a case,
but usually that won't be necessary,
since even the second table will be quite fast (it's in L2-cache, and it will only be
filled partially, there will be 'gaps' for the codes that are already in the first table,
so only small parts of the second table will have to be cached, and they are rarely used anyway).
(Answer was contributed by Scali.)
Q: There are so many premade libs that are coded much better than what i'll be able to do,
So why should i learn how to program compression algorithms ?
A: You shouldn't. But then again, you have a hand-held calculator, why would you learn how
to add and multiply numbers ?
Learning compression is something that can help you dealing with programming and reversing
tasks differently, somewhat of a better way, you know assembly, but that means you use it
all the time ? What about C, and Delphi, they are useful in thier own fields. It's not
up to programming compression. It's about knowing how its working. Like the engine of a car,
you know how it works, so you can run the car better, and you know when to shift the gears.
Does that means you will definetly build your own engine ? No. Learning compression is good,
it might be troublesome with some topics, yet it can help you very much, the decision is only
made by you.
Q: I want to program a file protector that will pack them, what comperssion should i use ?
A: It's completely variable, you can combine some LZ variation and Huffman to retrive a good
compression, and you can create your own algorithm if you know how. Just remember to follow
some ground rules:
Prefix coding is important to many methods.
You can either choose one or two methods, but don't overdo it,
"One cannot see the wood for the trees".
If the decompression is on-fly while executing the file, don't forget to build your decompression
code in a place where you can decompress the code into its original place in memory.
Learn the executable type header before you engage your attempts, that way you won't ruin the file
everytime you try to pack it.
There are quite more of these ground rules, but since we are not going to deal with binary
packing in this article, so i wont add more.
Q: I wrote a compression code, but the compressed data won't decompress with my decompressor. What
did i do wrong ?
A: I can't tell exactly what you did wrong, try to write a new decompressor from scratch, use the exact
reverse of the compressing, if you use rol, use ror, and do it backwards, so if the rol was the last
instruction, make the ror as first instruction. After you have done that, your decompressor should
work flawlessly.
Q:
A:
Phase VI : Conclusions
Compression is art. You don't have to master it, you can always learn just the parts you need.
Yet, compression can help you dealing with programming issues, as well help you directly
with compressing data. You will probably run into many difficulties while trying to code
compression related thing. Never give up. Each try will be more successful, from a barely
running Huffman to your own live'n'kicking compressiong methods.
Regarding lossy compression methods, I couldn't bring them into this paper because
this is a paper for newbies, to explain the most simple methods that will allow the reader
who just learned about compression how to develop himself, and will help him to do understand
other compression related papers better. I might write a lossy compression article someday,
but now is not the time. Though there are plenty of texts around the internet, look for them.
Here are some more resources for compression:
Introduction to Data Compression, Second Edition by Khalid Sayood
Compression Algorithms for Real Programmers (For Real Programmers) by Peter Wayner
Data Compression: The Complete Reference by David Salomon
Text Compression (Prentice Hall Advanced Reference Series)
by Timothy C. Bell, Ian H. Witten, Ian Whitten (Contributor), John Cleary (Contributor)
Introduction to Information Theory and Data Compression by Darrel . Hankersson
The Data Compression Book by Mark Nelson
#Compression at EFNet (IRC)
http://www.cs.pdx.edu/~idr/compression/
http://www.ics.uci.edu/~dan/pubs/DataCompression.html
http://www.faqs.org/faqs/compression-faq/
http://datacompression.info/
http://www.data-compression.com/
http://www.dogma.net/markn/
http://www.arturocampos.com/
http://www.ross.net/compression/
http://www.cbloom.com/
These links should supply you with the basic information, which where you can develop your
knowledge alone from there and on.
This article was created for newbies. I hope i managed to make things clear here, and explain
the basic of compression. As for myself, I am a newbie when it comes to most of the compression
related issues if not all of them. So i would like the opportunity to thank some people who
helpped me learn the basics of the art,
Jorgen Ibsen (Jibz) who taught me everything from the start, i used his methods in this article
since i remembered how well they are appealing to newbies.
Scali, who taught me about Huffman tables brute force and contributed some questions and answers
to the FAQ.
X-Lock, who helpped me with RLE and LZ finishing touch, also for contributing resources for
compression.
Added Files: Programming Challange, Compression algorithm.
ParaBytes
[培训]《安卓高级研修班(网课)》月薪三万计划,掌握调试、分析还原ollvm、vmp的方法,定制art虚拟机自动化脱壳的方法
赞赏
看原图
赞赏
雪币:
留言: