首页
社区
课程
招聘
[翻译]新手的观点:压缩
发表于: 2004-9-1 10:12 5017

[翻译]新手的观点:压缩

2004-9-1 10:12
5017
A Newbie's View:
                   Compression
               -=-=-=-=-==-=-=-=-=-

Phase I : Introduction

Compression is the art of reducing size of a raw data.
I've encountered some places that called compression
encryption, which is can be true, since encryption
is the art of hiding data as other data, and compression
does that.
When you first time think of compression you usually think
of some number that divides every byte/word in the data,
and that compresses it,
the truth is far from that, since it might work, but you
usually ends up having the same size, since you'll need
to keep the modolus as well, so some case will even
increase the size of the code.

Compression itself divides into 2 major types,
lossless compression, which is a compression that
reduces the size, but when you decompress, you retrive
the exact data as the original data,
and a lossy compression, which reduce the size of the data
by removing parts which you don't need, so when you decompress
the packed data back to normal, it is most likely you'll get
a different data, this method is being used mainly in multimedia
since human senses has limits, and that can be used to reduce
sound and image by removing parts the humans cannot see or hear.

In this article, we are going to overview some simple lossless
methods. After that, i'll interduce a simple challange to the
coders among you.

Phase II : Run Length Encoded (RLE)

RLE is a most simple packing,its very useful in some places,
and might be completely useless in others,
lets assume you want to compress a data, that happend to have
graphical data inside (it is most useful with it..),
a raw data, that means, pixels, raw.
usually, bitmaps have areas filled with the same pixel,
which is basicly the same data, so instead of having:
00000000 00000000 00000000 00000000 (4 black pixels)
you can make it: 1600 to say,
we have 16 bytes of "00" in the next part of code,
its true, many times its not like this, still
this can be useful, less on text, since we have less
repeating characters in text, but for that we have the Lempel Ziv (LZ77)

Remember to use a tag code to take apart of the other raw pixels,
because when you have 2 pixels with the same color, its not always useful
to use RLE..

Phase III : Lempel Ziv '77 (LZ77)

Lempel Ziv is a dictionary based algorithm,
yet, it's not a word replacing, so most used words
will be replaced with a short sign.
This algorithm is creating its own dictionary.
By scanning the data, it creates a dictionary buffer,
then, it scans the next block(s), and when encountered
a previously used sign, it replaces it with the index of
that sign in the dictionary.
For instance:
WATCHMATCH is our data,
we define the dictionary size as 5 bytes (WATCH),
now, we will scan the next block for the data from the dictionary,
'W' will be kept plain, since it wasn't refered in the dictionary,
yet 'A' will be encoded as position 1, length 4,
since its a copy of position 1 (we do not count 'M' its 0)
and 4 bytes ahead, so, we can do like this:
HORSEWHORE, with dictionary size of 5 (HORSE).
We will have :
'W'
Pos: 0, Length: 5 (H)
Pos: 1, Length: 5 (O)
Pos: 2, Length: 5 (R)
Pos: 4, Lentgh: 4 (E)

Of course, you can use varius sizes of data, for instance:
SUPER-MANBATMAN
And use 9 letters dictionary, and each part is 3 letters,
now we have:
SUP|ER-|MAN
'BAT'
Pos: 2, Length: 1 (MAN)

So, like this, you can compress more data into less space (sometimes).
For example, Compressing 4-8 bytes at a time is more efficient,
rather than compressing byte after byte, which might be useful for text.

Most of the used compression methods, as RAR and Zip are using
one of LZ77 variant, each has its own benefits and loses.

Phase IV : Huffman

Huffman is a frequency based compression method, it can be very useful
to work with, unlike the LZ77 and the RLE, similar data don't have
to be next to each other, or in similar blocks, it can be spread all
over the data.
Huffman works in the next way:
-Frequency tests
-Building a Huffman tree
-Encoding

The frequency tests are tests to determine what byte is being used to most
in the data. Just like in subitute cryptography, where you try to determine
the most frequent letters, and replace them with logical decrypted letters.
As example, 'E' is the most frequent letter in the english language,
yet, you can write a sentence without any 'E'.
"Dan is away from his daddy" - See ? No 'E' in this sentence,
and if you think, you can create an even more complicated lines with no 'E'
so Huffman, instead of counting on a pre-made freuqency table, is creating
one everytime. This can be very useful for code, sicne some opcodes are repeating
more than others, FPU is less frequent than x86 code (we are talking on x86 platforms)
so some codes will be more frequent than others.
So why not using this to decrease the size of the code ?
Like, the opcode 85C0 (test eax,eax) appears more than D9E1 (fabs)
in most of codes, so why not replacing the 85C0 with 2 or 1 bit ?
That is exactly what Huffman is doing, after doing the frequency test,
the most frequent bytes/characters/words are replaced with shorts bits
sequence, and the less frequent are replaced with longer ones.

The Huffman tree is a binary tree to determine the decoded value
of a bit sequence, but first lets see how they are being made.
In a text we want to compress we have the next line:
"MONEY IS ROOT OF ALL EVIL"
Lets remove the spaces, "MONEYISROOTOFALLEVIL",

Hexadecimal by the ASCII table, this line would be represented as:
4D 4F 4E 45 59 49 53 52 4F 4F 54 4F 46 41 4C 4C 45 56 49 4C
Binary will represent it as:
0100 1101 0100 1111 0100 1110 0100 0101 0101 1001 0100 1001 0101 0011 0101 0010 0100
1111 0100 1111 0101 0100 0100 1111 0100 0110 0100 0001 0100 1100 0100 1100 0100 0101
0101 0110 0100 1001 0100 1100

Lets analyze the Hex string we have,

1 1
2 1
3 1
4 16
5 7
6 2
9 3
C 3
D 1
E 1
F 4

(1) (1) (1) (16) (7) (2) (3) (3) (1) (1) (4)
1   2   3   4    5   6   9   C   D   E   F

Now, this is our basic branches list. all the nibbles we have in that line,
and the amount they appear in the text. To build it into a tree, we need to
step by step take two branches and unite them under a bigger branch with the
sum of amount of all the sub branches it has.

  (2)  (1) (16) (7) (2) (3) (3) (1) (1) (4)
(1) (1) 3   4    5   6   9   C   D   E   F
1   2

And again, until we have a complete tree:
                      _________________(40)______________
                     /                                   \
            _______(17)________                         (23)
           /                   \                        /  \
       / (5) \               /(12)\                   (16) (7)
    (3)        (2)        (6)      (6)                [4]  [5]
(1)   (2)   (1) (1)    (3) (3)  (4) (2)
[1] (1) (1) [D] [E]    [9] [C]  [F]  [6]
     [2] [3]

This is the final result. Let's review:
40 on the top means 40 nibbles in the code,
divides into 2, which is a 'case' of 0 and 1
if 0, go to the left branch, if 1, goto the right branch.
So:
11 = 5, because we went 1 right, so its either 4 or 5, and then again, right,
we get 5.
This is our new table:

Encoded|Real| Old
------------------
0000   | 1  | 0001
00010  | 2  | 0010  
00011  | 3  | 0011
0010   | D  | 1101
0011   | E  | 1110
0100   | 9  | 1001
0101   | C  | 1100
0110   | F  | 1111
0111   | 6  | 0110
10     | 4  | 0100
11     | 5  | 0101

This tree is highly unoptimized. it can be done MUCH better,
yet, it compresses "MONEYISROOTOFALLEVIL" from this :

01001101010011110100111001000101010110010100100101010011010100100100
11110100111101010100010011110100011001000001010011000100110001000101
010101100100100101001100 (160 bit)

To that:

10001010011110001110110101100110100111000111100010100111100111111010
0101111001111000001001011001011011110111101001100101 (120 bit)

Almost cutted by a quarter. This is a nice ratio, an optimized tree would
create an even better ration, since it would reducde more than two
signs. (In here, as you can see, only 4 and 5 were reduced to half.)

Huffman can be very useful, especially if encoded with RLE and LZ77
or one of them. Since then you'll have LZ77 dictionary, reducing all
the 4's and 5's down, and the Huffman tree would shrink it even more
when you get the position/length down.

The decoding of Huffman is done in the reverse way, you read the first
bit, and follow the tree, one you reach a decoded value, you stop,
so if you read:
1 > Go right, either 4 or 5
0 > Go left, its 4.
Next :
0 > Go left, its not 4 or 5
And so on and on, until you finish decoding the entire data.

Remember that when you construct a tree, you can use whatever size you want,
depends on your amount of memory, and time, and space.
You can build a tree for 64bit values if you have the memory. Yet, i think
that 4-8 bits trees are the best when considering the speed and size.
They dont have many entries, and they are not complicated too much.

Huffman requires quite a time of sitting and coding. The theory,
as most of things related with logic, is simple, but code implementation
can be hard sometimes. The important thing is not to give up, once you get
the first hang to the code, you will be able to do many great things.

Phase V : Errors FAQ And Tips

These are questions i thought some of you might ask yourselves while
coding compression routines, i tried to answer these in the best way
i could. Any more questions you may feel free asking me.

Q: My RLE/LZ decoded values in a wrong way, what could have happend ?
A: Prefix coding is important when using these methods, remember to
   use a tag that will NEVER appear as packed data, if you use text,
   use 00 (NULL), or one of the control characters, since the dont appear
   in text data.

Q: My Huffman didnt decode. What is wrong with my code ?
A: Huffman is creating different tree with every data, usually..
   Remember to attach the tree to the encoded data, or if the decoder
   is private for each place (.exe packers for example..), you may
   attach it to the decompressor itself

Q: The data comes out weird from the memory. Why ?
A: Little Endian, the bit order is reversed, you can use bswap to fix it.
   Since many formats are using Big Endian bit order. If you are writing
   your own format, remember to be consistent, either swap on encoding
   and decoding, or dont swap at all. The choice is yours.

Q: My LZ compressor is very slow, yet, other compressors are very fast.
   How come my code is slow ?
A: Comparing each byte with all positions is slow, create some kind of
   a data structure to fasten things up.

Q: Is there another way to decode Huffman besides a tree ?
A: An important property of the Huffman codes is that each prefix is unique.
   If you have a code of n bits, say code A,
   then you can be sure that there is no code of more than n bits,
   where the first n bits are the same as A.
   So, if you would place random bits behind A, so that you fill the n bits
   up to m (say 16), there is still no way that the code can be interpreted
   as any other code than A.

   What this means is that you can create a table to decode quickly,
   by using multiple bits at a time. You can grab m bits from the bitstream,
   and use that m bit number as an index into the table.
   You will have to bruteforce all codes up to m bits, and store the symbol and length
   of each code into all bruteforced table-positions.
   Now when you want to decode a code, you lookup in the table, find the actual symbol,
   and actual length of the code. You output the symbol,
   and advance the bitstream by the length of the code.

   Another property of Huffman codes is that the shortest code belongs
   to the most frequent symbol in the data.
   If you have a large Huffman-tree, then a bruteforce table can be rather large,
   and will not fit in L1 cache (for 16 bit, you will already need 2^16 entries of
   at least 2 bytes (one byte for symbol, one for the length. So you need 128 kb).
   It is a good idea to split the table in two.
   One table should be tuned so that it fits nicely in L1 cache. This table will store
   the smallest codes (which are also the most frequent ones. If you take for example 12 bits,
   you get (2^12)*2 = 8 kb, which should fit nicely in L1-cache).
   You can use a special value of the length (eg. -1) to indicate that the code was not found.
   You then grab the remaining bits from the stream, and look in the second table.
   Alternatively you could walk down the tree in such a case,
   but usually that won't be necessary,
   since even the second table will be quite fast (it's in L2-cache, and it will only be
   filled partially, there will be 'gaps' for the codes that are already in the first table,
   so only small parts of the second table will have to be cached, and they are rarely used anyway).
   (Answer was contributed by Scali.)

Q: There are so many premade libs that are coded much better than what i'll be able to do,
   So why should i learn how to program compression algorithms ?
A: You shouldn't. But then again, you have a hand-held calculator, why would you learn how
   to add and multiply numbers ?
   Learning compression is something that can help you dealing with programming and reversing
   tasks differently, somewhat of a better way, you know assembly, but that means you use it
   all the time ? What about C, and Delphi, they are useful in thier own fields. It's not
   up to programming compression. It's about knowing how its working. Like the engine of a car,
   you know how it works, so you can run the car better, and you know when to shift the gears.
   Does that means you will definetly build your own engine ? No. Learning compression is good,
   it might be troublesome with some topics, yet it can help you very much, the decision is only
   made by you.

Q: I want to program a file protector that will pack them, what comperssion should i use ?
A: It's completely variable, you can combine some LZ variation and Huffman to retrive a good
   compression, and you can create your own algorithm if you know how. Just remember to follow
   some ground rules:
   Prefix coding is important to many methods.
   You can either choose one or two methods, but don't overdo it,
   "One cannot see the wood for the trees".
   If the decompression is on-fly while executing the file, don't forget to build your decompression
   code in a place where you can decompress the code into its original place in memory.
   Learn the executable type header before you engage your attempts, that way you won't ruin the file
   everytime you try to pack it.

   There are quite more of these ground rules, but since we are not going to deal with binary
   packing in this article, so i wont add more.

Q: I wrote a compression code, but the compressed data won't decompress with my decompressor. What
   did i do wrong ?
A: I can't tell exactly what you did wrong, try to write a new decompressor from scratch, use the exact
   reverse of the compressing, if you use rol, use ror, and do it backwards, so if the rol was the last
   instruction, make the ror as first instruction. After you have done that, your decompressor should
   work flawlessly.

Q:
A:

Phase VI : Conclusions

Compression is art. You don't have to master it, you can always learn just the parts you need.
Yet, compression can help you dealing with programming issues, as well help you directly
with compressing data. You will probably run into many difficulties while trying to code
compression related thing. Never give up. Each try will be more successful, from a barely
running Huffman to your own live'n'kicking compressiong methods.
Regarding lossy compression methods, I couldn't bring them into this paper because
this is a paper for newbies, to explain the most simple methods that will allow the reader
who just learned about compression how to develop himself, and will help him to do understand
other compression related papers better. I might write a lossy compression article someday,
but now is not the time. Though there are plenty of texts around the internet, look for them.

Here are some more resources for compression:

Introduction to Data Compression, Second Edition by Khalid Sayood
Compression Algorithms for Real Programmers (For Real Programmers) by Peter Wayner
Data Compression: The Complete Reference by David Salomon
Text Compression (Prentice Hall Advanced Reference Series)
by Timothy C. Bell, Ian H. Witten, Ian Whitten (Contributor), John Cleary (Contributor)
Introduction to Information Theory and Data Compression by Darrel . Hankersson
The Data Compression Book by Mark Nelson

#Compression at EFNet (IRC)

http://www.cs.pdx.edu/~idr/compression/
http://www.ics.uci.edu/~dan/pubs/DataCompression.html
http://www.faqs.org/faqs/compression-faq/
http://datacompression.info/
http://www.data-compression.com/
http://www.dogma.net/markn/
http://www.arturocampos.com/
http://www.ross.net/compression/
http://www.cbloom.com/

These links should supply you with the basic information, which where you can develop your
knowledge alone from there and on.

This article was created for newbies. I hope i managed to make things clear here, and explain
the basic of compression. As for myself, I am a newbie when it comes to most of the compression
related issues if not all of them. So i would like the opportunity to thank some people who
helpped me learn the basics of the art,
Jorgen Ibsen (Jibz) who taught me everything from the start, i used his methods in this article
since i remembered how well they are appealing to newbies.

Scali, who taught me about Huffman tables brute force and contributed some questions and answers
to the FAQ.

X-Lock, who helpped me with RLE and LZ finishing touch, also for contributing resources for
compression.

Added Files: Programming Challange, Compression algorithm.

ParaBytes

[培训]《安卓高级研修班(网课)》月薪三万计划,掌握调试、分析还原ollvm、vmp的方法,定制art虚拟机自动化脱壳的方法

收藏
免费 1
支持
分享
最新回复 (2)
雪    币: 323
活跃值: (589)
能力值: ( LV12,RANK:450 )
在线值:
发帖
回帖
粉丝
2
新手的观点:压缩
                            -=-=-=-=-==-=-=-=-=-

一:介绍
-=-=-=-=-==-=

压缩是一门减少原始数据大小的艺术。我在一些场合遇到有人把压缩称为加密,的确如此,因为加密是一门把数据隐藏成其他数据的艺术,而压缩也是如此。
当你第一次考虑压缩,你通常会想到某个数字,除以数据中的每个字节或字(byte/word),然后再压缩。事实远不止如此,尽管它可能有用,但通常你最终还是得到同样的大小。由于你还必须保持原样,所以某些情况下甚至会增加代码的大小。

压缩主要分成两种类型:无损压缩(这种压缩减少数据大小,但当你解压缩时,你重新获得与原始数据大小一模一样的数据。)和有损压缩(这种压缩通过去掉不必要的部分来减少数据大小,但当你解压时很可能你得到另一种数据。这种方法主要用于多媒体,因为人的感官有限。通过去除人类看不见或听不见的部分,它还可减少声音和图像文件的大小)。

本文中,我们将概述一些简单的无损压缩方法。然后,我还要向你们程序员提出一个简单的挑战。

二:运行长度编码(Run Length Encoded)(RLE)
-=-=-=-=-==-=-=-=-=--=-=-=-=-==-=-=-=-=--=-=-=

RLE是一种很简单的压缩方法,在一些地方非常的有用,而在另一些地方可能一点用都没有。假设你要压缩某个数据,它里面有图形数据(RLE对此十分有用..),它是原始数据,也就是说,原始像素。通常,位图有一些区域填充同样的像素,基本上是同一数据。所以不是写成:00000000 00000000 00000000 00000000 (4个黑色像素)(4 black pixels),你可以这样写:1600,意思是我们在代码的下一部分里有16字节(bytes)个“00”。多数情况下可不是这样的,真的,在文本文件中要少些,但这仍然很有用。因为我们在文本文件里重复字符要少些,但我们有Lempel Ziv (LZ77)对付它。

记住使用标记代码来拆分其他原始像素,因为如果有两个同样颜色的像素,使用RLE方法就不总是那么有用了。

三:Lempel Ziv '77 (LZ77)算法
-=-=-=-=-==-=-=-=-=--=-=-=-=-==-=-=

Lempel Ziv是基于算法的字典,它还不是字替换(word replacing),所以大多数已用字(used words)都将被替换为一个短符号(short sign)。这种算法创建自已的字典。扫描数据后,它创建一个字典缓冲区,然后扫描下一块(block(s)),当遇到前面的已用符号时,就用该符号在字典中的索引替换它。
举个例子:
WATCHMATCH是一个数据,我们定义字典大小为5个Bytes(WATCH),现在,我们要从字典中扫描数据的下一个块(block),由于'W'(译者:我觉得应该是'M',不知对不对?)在字典中未被引用,将被保留,而'A'将被编为位置1,长度为4(we do not count 'M' its 0),因为它是位置1的拷贝,且前面有4个字节(bytes)。所以,我们可以这么做:HORSEWHORE,字典大小为5(HORSE),我们有:
'W'
Pos: 0, Length: 5 (H)
Pos: 1, Length: 5 (O)
Pos: 2, Length: 5 (R)
Pos: 4, Lentgh: 4 (E)

当然,你可以使用各种大不的数据,例如:
SUPER-MANBATMAN
然后使用9个字母的字典,每个部分是3个字母,现在我们有:
SUP|ER-|MAN
'BAT'
Pos: 2, Length: 1 (MAN)

所以,像这样,你能将更多的数据压缩到更少的空间(有时候)。比如,一次压缩4-8个字节比逐字逐字压缩(对文本文件可能更有用)更有效率。

大多数已经使用的压缩方法,如RAR、ZIP,正使用LZ77的某个变种(LZ77 variant),每种都有其优缺点。

四:哈夫曼算法(Huffman)
-=-=-=-=-==-=-=-=-=--=-=-=-=-

哈夫曼算法是一种使用频率很高的基础压缩方法,它十分有用。它不像LZ77和RLE,相似的数据不必一个接一个,或在相似的块,它能覆盖所有的数据。
哈夫曼算法使用以下的方式:
--频率测试
--建立哈夫曼树
--编码

频率测试是决定哪个字节在数据中使用最多的测试。就像在固定的密码学(subitute cryptography)里,你努力找出使用频率最多的字母,然后用逻辑解密字母替换它们。例如,'E'是英语里使用频率最多的字母,你仍可以写出一个没有'E'的句子。"Dan is away from his daddy",看到吗?句子里没有'E'。如果你想想,你能创建一个更复杂的没有'E'的句子。所以哈夫曼算法并没有依靠已经做好的频率表,而是每次都创建一个新的。这对代码非常有用,因为一些机器码(opcodes)比其他的代码重复更多次,FPU比x86代码使用频率更低(我们正在x86平台上讨论),所以一些代码比其他的使用频率更高。那为什么不用它来减少代码的大小呢?像在多数代码中,机器码(opcode)85C0(test eax,eax)出现次数多于D9E1(fabs),为什么不用2或1位(bit)代替85C0呢?这恰恰是哈夫曼算法正在做的。做了频率测试后,使用频率最多的字节/字符/字(bytes/characters/words)被替换为短位序列(shorts bits sequence),使用频率较少的被替换为长位序列。

哈夫曼树是确定一个位序列的解码值的二进制树(binary tree),但首先让我们看看它们是如何建立的。在一文本文件里,我们要压缩下面这行:
"MONEY IS ROOT OF ALL EVIL"
去除空格,变成"MONEYISROOTOFALLEVIL"

查看ASCII表的十六进制和十进制,上面这行可表示成:
4D 4F 4E 45 59 49 53 52 4F 4F 54 4F 46 41 4C 4C 45 56 49 4C
用二进制表示为:
0100 1101 0100 1111 0100 1110 0100 0101 0101 1001 0100 1001 0101 0011 0101 0010 0100
1111 0100 1111 0101 0100 0100 1111 0100 0110 0100 0001 0100 1100 0100 1100 0100 0101
0101 0110 0100 1001 0100 1100

我们来分析十六进制字符串:(译者:也就是4D 4F 4E 45 59 49 53 52 4F 4F 54 4F 46 41 4C 4C 45 56 49 4C)
1 1
2 1
3 1
4 16
5 7
6 2
9 3
C 3
D 1
E 1
F 4

(1) (1) (1) (16) (7) (2) (3) (3) (1) (1) (4)
1   2   3   4    5   6   9   C   D   E   F

这是我们基本分支列表,包含该行所有的字(nibbles)以及它们在文本文件中出现的次数。要把它们建立成树,我们需要一步一步做两个分支并把它们统一在一个更大的分支,并加上它所拥有的所有子分支数的总和。

  (2)  (1) (16) (7) (2) (3) (3) (1) (1) (4)
(1) (1) 3   4    5   6   9   C   D   E   F
1   2

再重复,直到我们得到下面的完整树:
                      _________________(40)______________
                     /                                   \
            _______(17)________                         (23)
           /                   \                        /  \
       / (5) \               /(12)\                   (16) (7)
    (3)        (2)        (6)      (6)                [4]  [5]
(1)   (2)   (1) (1)    (3) (3)  (4) (2)
[1] (1) (1) [D] [E]    [9] [C]  [F]  [6]
     [2] [3]

这是最终结果,我们回顾一下:
顶部的40意味着代码中有40个字(译注:nibble,四位字,即半个字节),分成两组,每组根据关键词0或1,如果是0,则到左分支,如果是1,到右分支。
所以:
11 = 5,因为它到1右边,所以它要么是4要么是5,然后再重复一次,右边,我们得到5。
下面是我们的新表:

Encoded|Real| Old(重新编码|真正字符|原始编码)
------------------
0000   | 1  | 0001
00010  | 2  | 0010  
00011  | 3  | 0011
0010   | D  | 1101
0011   | E  | 1110
0100   | 9  | 1001
0101   | C  | 1100
0110   | F  | 1111
0111   | 6  | 0110
10     | 4  | 0100
11     | 5  | 0101

这棵树还未高度优化,它还能做得更好。
它仍可将"MONEYISROOTOFALLEVIL"从下面的

01001101010011110100111001000101010110010100100101010011010100100100
11110100111101010100010011110100011001000001010011000100110001000101
010101100100100101001100 (160 bit)

压缩成:

10001010011110001110110101100110100111000111100010100111100111111010
0101111001111000001001011001011011110111101001100101 (120 bit)

几乎剪掉了1/4。这是个很好的比率,一棵优化过的树可以产生更好的比率,因为它会减少不止两个符号(如你所见,这里仅4和5减掉了一半)。

哈夫曼算法会十分有用,特别是如果用RLE和LZ77,或者其中之一编过码。由于你有LZ77字典,减少所有的4和5,减少了位置和长度,哈夫曼树会缩得更小。

哈夫曼树的解码方向相反。先读取第一个位,然后顺着树,一旦读到一个解码值,停下来。
所以如果你读到:
1,往右,4或5
0,往左,是4
下一步:
0,往左,它不是4或5
就这样反复下去,直到你完成整个数据的解码。

记住,建立一个树时,你可以要多大空间就用多大空间,取决于你的内存、时间和空间。如果你有够大内存,你可以建立一个64bit值的树。可我认为考虑到速度和大小,4至8bit的树是最好的。它没有很多入口点,也不会太复杂。

哈夫曼需要很多时间坐下来编码。与多数与逻辑相关的事物一样,理论简单,但编码有时很难完成。最重要的是不要放弃,一旦你有了对代码的第一次的搞定,你就能做很多伟大的事。

五:错误FAQ 和技巧

我想这些问题在你们编写压缩软件的日常工作时可能会问过你们自己,我努力尽可能回答。还有什么问题你们可以随时问我。
Q:我的RLE/LZ错误地解码,这是怎么回事?
A:在使用这些方法时,前缀编码是很重要的。记住使用永远不会在待压缩数据(packed data)中出现的标记。如果是文本文件,用00(NULL),或一个控制字符,因为它不会出现在文本数据中。

Q:我的哈夫曼不会解码。我的代码有问题吗?
A:通常,哈夫曼在每个数据都创造不同的树。记住把树附载(attach)到已编码数据中,或者如果解码器在每个地方(如.exe包)都是私有的,你可以把树附载到解压缩器本身。

Q:来自内存里的数据奇怪得很,为什么?
A:Little Endian,位顺序保留了,你可以交换修复它(Little Endian, the bit order is reversed, you can use bswap to fix it.)。由于很多格式使用Big Endian bit order,所以如果你自己写格式,记住编码和解码要么交换,要么一点都不交换,要协调一致。选择取决于你。

Q:我的LZ压缩器十分慢,而其他的压缩器却很快。我的代码为什么这么慢呢?
A:将每个字节与全部位置比较是很慢,创建一种数据结构可以加快速度。

Q:除了树之外,有没有其他的方法解码哈夫曼树?
A:哈夫曼编码的一个重要属性就是每个节点(prefix)是唯一的。如果你有一个N位的代码,假设是代码A,然后你确定没有多于N位的代码,它的前N位与代码A是一样的。所以,如果你在A后面加入一些随机位数(random bits),以便将N位填充到M位(比如16),还是没有其他的方法,使得代码能够解释成除代码A以外的其他任何代码。
   上面的意思是说,你可以建立一个表,一次使用多个数位(multiple bits)快速解码。你可以从位流(bitstream)中截取M位,把这M位数用作表的索引。你必须努力使所有的代码达到M位,并把每个代码的符号和长度存储到所有花巨大努力做成的表位置(table-positions)里.
   现在,当你要解一个代码,查查表,找到代码的实际的符号和实际长度。输出该符号,并且以相应的代码长度完善位流(bitstream)。  
   哈夫曼编码的另一个属性是最短的代码属于数据中使用频率最多的符号。
   如果你有一棵很大的哈夫曼树,相应的表也会特别的大。它不适合(fit in)L1 cache(对于16位,你将需要至少2个字节的2^16个入口点(entries)(一个字节用于符号,另一字节用于长度。所以你需要128 Kb))。
   把表一分为二 是个好主意。
   表应该排序(turned)以更好的适合L1 cache。该表将存储最小的代码(也是使用频率最多的代码。举12位为例,你会得到(2^12)*2 = 8 kb,这个数字非常适合L1 cache)。
    你可以用一特殊长度值(如-1)去提示代码未找到。
    然后你从流中抓取剩余的位,查看第二个表。
    在这种情况下,你可以二者之中任选一棵树( Alternatively you could walk down the tree in such a case),但通常情况下大可不必,因为即使是第二张表也是十分的快(它在L2 cache中,只会部分填充,对第一张表中的代码来讲会有‘代沟’,所以第二张表仅有少部分存到缓存(cache),它们也很少使用。)
  答案由Scali贡献。

Q:已经有很多现成的库(libs),它们的代码比我编的好得多,那我为什么还要去如何编写压缩算法?
A:是的,你不必那样做。可是,你有一个袖珍(hand-held)计算器,为什么还要去学习如何加数和乘数?学习压缩能帮助你以不同的方式处理编程和逆向工程任务。你懂得汇编,也许是个更好的方法,但那意味着你一直用它吗?C, Delphi呢?他们在各自的领域也非常有用。对压缩编程来说,这还不够。关键是了解它是如何工作的。就像汽车引擎,你知道它如何工作,你就能更好的开车,知道何时换档。可这意味着你一定要做个自己的引擎吗?不。学习压缩是不错的,对一些课题来讲它可能有点麻烦,但它还是能给你很多帮助的。决定取决于你。

Q:我想编一个将文件打包的文件保护工具,我应使用哪种压缩呢?
A:方法是完全变化的,你可以结合某个LZ变体和Huffman设计一个良好的压缩,也可以创建你自己的算法,如果你知道的话。记住要遵守一些原则:
   对许多方法而言,前缀编码是重要的。
   你可以选择一种或两种方法,但不要太多了。
   “不要只见树木不见森林”。
   尝试努力前学习可执行类型头,这样你就不会每次试图打包时毁掉文件了。
   
   有很多这样的规则,可因为本文中我们不会处理二进制打包,所以我没有加它们。
   
Q:我写了一个压缩代码,但压缩后的数据不能用我的压缩器解压缩,我做错了什么吗?
A:我无法确切的告诉你在什么地方出错了,试着从新开始写一个新的解压器,使用压缩的正确逆向方法,如果你用rol,那么就是ror。相反方向做,如果rol是最后一个指令,让ror成为第一个指令。这样做以后,你的解压器应该运行无误了。

Q:
A:

六:结论

压缩是一门艺术。你不必掌握它,可以只学你需要的部分。除了直接帮助你压缩外,压缩还能帮助你解决编程问题。你可能在试着编写压缩代码时会遇到许多困难,不要放弃。从一个几乎不能运行的Huffman到你自己的生动活泼的压缩方法,每一次尝试都会更成功。
关于有损压缩方法,我不能把它们写在纸上,因为对新手来讲,这是一篇论文。要解释让只知道压缩的读者提高自己,帮助他更好理解与论文相关的压缩的最简单方法,我可能改天会写一篇有损压缩论文,但现在不是时候。网上有大量关于文本文件的论文,去搜搜它们。
下面是关于压缩的更丰富资源:

Introduction to Data Compression, Second Edition by Khalid Sayood
Compression Algorithms for Real Programmers (For Real Programmers) by Peter Wayner
Data Compression: The Complete Reference by David Salomon
Text Compression (Prentice Hall Advanced Reference Series)
by Timothy C. Bell, Ian H. Witten, Ian Whitten (Contributor), John Cleary (Contributor)
Introduction to Information Theory and Data Compression by Darrel . Hankersson
The Data Compression Book by Mark Nelson

#Compression at EFNet (IRC)

http://www.cs.pdx.edu/~idr/compression/
http://www.ics.uci.edu/~dan/pubs/DataCompression.html
http://www.faqs.org/faqs/compression-faq/
http://datacompression.info/
http://www.data-compression.com/
http://www.dogma.net/markn/
http://www.arturocampos.com/
http://www.ross.net/compression/
http://www.cbloom.com/

这些链接提供你最基本的信息,仅从上面你就能丰富你的知识。

本文为新手而作,我希望我已经把事情说清楚了,解释了压缩的基本内容。至于我,在多数,如果不是全部,压缩相关问题,我也是个新手。所以我想借此机会感谢曾帮助我学习这一艺术基本知识的人:
Jorgen Ibsen (Jibz),他教会我从头开始的一切,在本文中我应用了他的方法,因为我记得这些方法对新手是多么有吸引力。

Scali,他教会我哈夫曼表(Huffman tables)的巨大威力,对FAQ也贡献了一些问题和解答。

X-Lock,他在运行长度编码(RLE)和LZ算法方面帮助了我,同时对压缩资源亦有贡献。

附加文件:编程的挑战、压缩算法

ParaBytes

                                                        springkang[DFCG][NSSG][TT]
                                                                 译于2004.08.31晚12:00
2004-9-1 10:14
0
雪    币: 323
活跃值: (589)
能力值: ( LV12,RANK:450 )
在线值:
发帖
回帖
粉丝
3
兄弟们帮我看看,有没有译得不恰当的或译错的。
我对压缩,还有哈夫曼树不是很熟。
2004-9-1 10:16
0
游客
登录 | 注册 方可回帖
返回
//