00001 Compression Info, 10-11-95
00002 Jeff Wheeler
00003
00004 Source of Algorithm
00005 -------------------
00006
00007 The compression algorithms used here are based upon the algorithms developed and published by Haruhiko Okumura in a paper entitled "Data Compression Algorithms of LARC and LHarc." This paper discusses three compression algorithms, LSZZ, LZARI, and LZHUF. LZSS is described as the "first" of these, and is described as providing moderate compression with good speed. LZARI is described as an improved LZSS, a combination of the LZSS algorithm with adaptive arithmetic compression. It is described as being slower than LZSS but with better compression. LZHUF (the basis of the common LHA compression program) was included in the paper, however, a free usage license was not included.
00008
00009 The following are copies of the statements included at the beginning of each source code listing that was supplied in the working paper.
00010
00011 LZSS, dated 4/6/89, marked as "Use, distribute and
00012 modify this program freely."
00013
00014 LZARI, dated 4/7/89, marked as "Use, distribute and
00015 modify this program freely."
00016
00017 LZHUF, dated 11/20/88, written by Haruyasu Yoshizaki,
00018 translated by Haruhiko Okumura on 4/7/89. Not
00019 expressly marked as redistributable or modifiable.
00020
00021 Since both LZSS and LZARI are marked as "use, distribute and modify freely" we have felt at liberty basing our compression algorithm on either of these.
00022
00023 Selection of Algorithm
00024 ----------------------
00025
00026 Working samples of three possible compression algorithms are supplied in Okumura's paper. Which should be used?
00027
00028 LZSS is the fastest at decompression, but does not generated as small a compressed file as the other methods. The other two methods provided, perhaps, a 15% improvement in compression. Or, put another way, on a 100K file, LZSS might compress it to 50K while the others might approach 40-45K. For STEP purposes, it was decided that decoding speed was of more importance than tighter compression. For these reasons, the first compression algorithm implemented is the LZSS algorithm.
00029
00030 About LZSS Encoding
00031 -------------------
00032
00033 (adapted from Haruhiko Okumura's paper)
00034
00035 This scheme was proposed by Ziv and Lempel [1]. A slightly modified version is described by Storer and Szymanski [2]. An implementation using a binary tree has been proposed by Bell [3].
00036
00037 The algorithm is quite simple.
00038 1. Keep a ring buffer which initially contains all space characters.
00039 2. Read several letters from the file to the buffer.
00040 3. Search the buffer for the longest string that matches the letters just read, and send its length and position into the buffer.
00041
00042 If the ring buffer is 4096 bytes, the position can be stored in 12 bits. If the length is represented in 4 bits, the <position, length> pair is two bytes long. If the longest match is no more than two characters, then just one character is sent without encoding. The process starts again with the next character. An extra bit is sent each time to tell the decoder whether the next item is a character of a <position, length> pair.
00043
00044 [1] J. Ziv and A. Lempel, IEEE Transactions IT-23, 337-343 (1977).
00045 [2] J. A. Storer and T. G. Szymanski, J. ACM, 29, 928-951 (1982).
00046 [3] T.C. Gell, IEEE Transactions COM-34, 1176-1182 (1986).
00047
00048 void InitTree( // no return value
00049 void); // no parameters
00050
00051 void InsertNode( // no return value
00052 short int Pos); // position in the buffer
00053
00054 void DeleteNode( // no return value
00055 short int Node); // node to be removed
00056
00057 void Encode( // no return value
00058 void); // no parameters
00059
00060 void Decode( // no return value
00061 void); // no parameters
00062
00063 // The following are constant sizes used by the compression algorithm.
00064 //
00065 // N - This is the size of the ring buffer. It is set
00066 // to 4K. It is important to note that a position
00067 // within the ring buffer requires 12 bits.
00068 //
00069 // F - This is the maximum length of a character sequence
00070 // that can be taken from the ring buffer. It is set
00071 // to 18. Note that a length must be 3 before it is
00072 // worthwhile to store a position/length pair, so the
00073 // length can be encoded in only 4 bits. Or, put yet
00074 // another way, it is not necessary to encode a length
00075 // of 0-18, it is necessary to encode a length of
00076 // 3-18, which requires 4 bits.
00077 //
00078 // THRESHOLD - It takes 2 bytes to store an offset and
00079 // a length. If a character sequence only
00080 // requires 1 or 2 characters to store
00081 // uncompressed, then it is better to store
00082 // it uncompressed than as an offset into
00083 // the ring buffer.
00084 //
00085 // Note that the 12 bits used to store the position and the 4 bits
00086 // used to store the length equal a total of 16 bits, or 2 bytes.
00087
00088 #define N 4096
00089 #define F 18
00090 #define THRESHOLD 3
00091 #define NOT_USED N
00092
00093 // m_ring_buffer is a text buffer. It contains "nodes" of
00094 // uncompressed text that can be indexed by position. That is,
00095 // a substring of the ring buffer can be indexed by a position
00096 // and a length. When decoding, the compressed text may contain
00097 // a position in the ring buffer and a count of the number of
00098 // bytes from the ring buffer that are to be moved into the
00099 // uncompressed buffer.
00100 //
00101 // This ring buffer is not maintained as part of the compressed
00102 // text. Instead, it is reconstructed dynamically. That is,
00103 // it starts out empty and gets built as the text is decompressed.
00104 //
00105 // The ring buffer contain N bytes, with an additional F - 1 bytes
00106 // to facilitate string comparison.
00107
00108 unsigned char m_ring_buffer[N + F - 1];
00109
00110 // m_match_position and m_match_length are set by InsertNode().
00111 //
00112 // These variables indicate the position in the ring buffer
00113 // and the number of characters at that position that match
00114 // a given string.
00115
00116 short int m_match_position;
00117 short int m_match_length;
00118
00119 // m_lson, m_rson, and m_dad are the Japanese way of referring to
00120 // a tree structure. The dad is the parent and it has a right and
00121 // left son (child).
00122 //
00123 // For i = 0 to N-1, m_rson[i] and m_lson[i] will be the right
00124 // and left children of node i.
00125 //
00126 // For i = 0 to N-1, m_dad[i] is the parent of node i.
00127 //
00128 // For i = 0 to 255, rson[N + i + 1] is the root of the tree for
00129 // strings that begin with the character i. Note that this requires
00130 // one byte characters.
00131 //
00132 // These nodes store values of 0...(N-1). Memory requirements
00133 // can be reduces by using 2-byte integers instead of full 4-byte
00134 // integers (for 32-bit applications). Therefore, these are
00135 // defined as "short ints."
00136
00137 short int m_lson[N + 1];
00138 short int m_rson[N + 257];
00139 short int m_dad[N + 1];
00140
00141 /*
00142 -------------------------------------------------------------------------
00143 cLZSS::InitTree
00144
00145 This function initializes the tree nodes to "empty" states.
00146 -------------------------------------------------------------------------
00147 */
00148
00149 void cLZSS::InitTree( // no return value
00150 void) // no parameters
00151 throw() // exception list
00152
00153 {
00154 int i;
00155
00156 // For i = 0 to N - 1, m_rson[i] and m_lson[i] will be the right
00157 // and left children of node i. These nodes need not be
00158 // initialized. However, for debugging purposes, it is nice to
00159 // have them initialized. Since this is only used for compression
00160 // (not decompression), I don't mind spending the time to do it.
00161 //
00162 // For the same range of i, m_dad[i] is the parent of node i.
00163 // These are initialized to a known value that can represent
00164 // a "not used" state.
00165
00166 for (i = 0; i < N; i++)
00167 {
00168 m_lson[i] = NOT_USED;
00169 m_rson[i] = NOT_USED;
00170 m_dad[i] = NOT_USED;
00171 }
00172
00173 // For i = 0 to 255, m_rson[N + i + 1] is the root of the tree
00174 // for strings that begin with the character i. This is why
00175 // the right child array is larger than the left child array.
00176 // These are also initialzied to a "not used" state.
00177 //
00178 // Note that there are 256 of these, one for each of the possible
00179 // 256 characters.
00180
00181 for (i = N + 1; i <= (N + 256); i++)
00182 {
00183 m_rson[i] = NOT_USED;
00184 }
00185
00186 // Done.
00187 }
00188
00189 /*
00190 -------------------------------------------------------------------------
00191 cLZSS::InsertNode
00192
00193 This function inserts a string from the ring buffer into one of
00194 the trees. It loads the match position and length member variables
00195 for the longest match.
00196
00197 The string to be inserted is identified by the parameter Pos,
00198 A full F bytes are inserted. So, m_ring_buffer[Pos ... Pos+F-1]
00199 are inserted.
00200
00201 If the matched length is exactly F, then an old node is removed
00202 in favor of the new one (because the old one will be deleted
00203 sooner).
00204
00205 Note that Pos plays a dual role. It is used as both a position
00206 in the ring buffer and also as a tree node. m_ring_buffer[Pos]
00207 defines a character that is used to identify a tree node.
00208 -------------------------------------------------------------------------
00209 */
00210
00211 void cLZSS::InsertNode( // no return value
00212 short int Pos) // position in the buffer
00213 throw() // exception list
00214
00215 {
00216 short int i;
00217 short int p;
00218 int cmp;
00219 unsigned char * key;
00220
00221 ASSERT(Pos >= 0);
00222 ASSERT(Pos < N);
00223
00224 cmp = 1;
00225 key = &(m_ring_buffer[Pos]);
00226
00227 // The last 256 entries in m_rson contain the root nodes for
00228 // strings that begin with a letter. Get an index for the
00229 // first letter in this string.
00230
00231 p = (short int) (N + 1 + key[0]);
00232
00233 // Set the left and right tree nodes for this position to "not
00234 // used."
00235
00236 m_lson[Pos] = NOT_USED;
00237 m_rson[Pos] = NOT_USED;
00238
00239 // Haven't matched anything yet.
00240
00241 m_match_length = 0;
00242
00243 for ( ; ; )
00244 {
00245 if (cmp >= 0)
00246 {
00247 if (m_rson[p] != NOT_USED)
00248 {
00249 p = m_rson[p];
00250 }
00251 else
00252 {
00253 m_rson[p] = Pos;
00254 m_dad[Pos] = p;
00255 return;
00256 }
00257 }
00258 else
00259 {
00260 if (m_lson[p] != NOT_USED)
00261 {
00262 p = m_lson[p];
00263 }
00264 else
00265 {
00266 m_lson[p] = Pos;
00267 m_dad[Pos] = p;
00268 return;
00269 }
00270 }
00271
00272 // Should we go to the right or the left to look for the
00273 // next match?
00274
00275 for (i = 1; i < F; i++)
00276 {
00277 cmp = key[i] - m_ring_buffer[p + i];
00278 if (cmp != 0)
00279 break;
00280 }
00281
00282 if (i > m_match_length)
00283 {
00284 m_match_position = p;
00285 m_match_length = i;
00286
00287 if (i >= F)
00288 break;
00289 }
00290 }
00291
00292 m_dad[Pos] = m_dad[p];
00293 m_lson[Pos] = m_lson[p];
00294 m_rson[Pos] = m_rson[p];
00295
00296 m_dad[ m_lson[p] ] = Pos;
00297 m_dad[ m_rson[p] ] = Pos;
00298
00299 if (m_rson[ m_dad[p] ] == p)
00300 {
00301 m_rson[ m_dad[p] ] = Pos;
00302 }
00303 else
00304 {
00305 m_lson[ m_dad[p] ] = Pos;
00306 }
00307
00308 // Remove "p"
00309
00310 m_dad[p] = NOT_USED;
00311 }
00312
00313 /*
00314 -------------------------------------------------------------------------
00315 cLZSS::DeleteNode
00316
00317 This function removes the node "Node" from the tree.
00318 -------------------------------------------------------------------------
00319 */
00320
00321 void cLZSS::DeleteNode( // no return value
00322 short int Node) // node to be removed
00323 throw() // exception list
00324
00325 {
00326 short int q;
00327
00328 ASSERT(Node >= 0);
00329 ASSERT(Node < (N+1));
00330
00331 if (m_dad[Node] == NOT_USED)
00332 {
00333 // not in tree, nothing to do
00334 return;
00335 }
00336
00337 if (m_rson[Node] == NOT_USED)
00338 {
00339 q = m_lson[Node];
00340 }
00341 else if (m_lson[Node] == NOT_USED)
00342 {
00343 q = m_rson[Node];
00344 }
00345 else
00346 {
00347 q = m_lson[Node];
00348 if (m_rson[q] != NOT_USED)
00349 {
00350 do
00351 {
00352 q = m_rson[q];
00353 }
00354 while (m_rson[q] != NOT_USED);
00355
00356 m_rson[ m_dad[q] ] = m_lson[q];
00357 m_dad[ m_lson[q] ] = m_dad[q];
00358 m_lson[q] = m_lson[Node];
00359 m_dad[ m_lson[Node] ] = q;
00360 }
00361
00362 m_rson[q] = m_rson[Node];
00363 m_dad[ m_rson[Node] ] = q;
00364 }
00365
00366 m_dad[q] = m_dad[Node];
00367
00368 if (m_rson[ m_dad[Node] ] == Node)
00369 {
00370 m_rson[ m_dad[Node] ] = q;
00371 }
00372 else
00373 {
00374 m_lson[ m_dad[Node] ] = q;
00375 }
00376
00377 m_dad[Node] = NOT_USED;
00378 }
00379
00380 /*
00381 -------------------------------------------------------------------------
00382 cLZSS::Encode
00383
00384 This function "encodes" the input stream into the output stream.
00385 The GetChars() and SendChars() functions are used to separate
00386 this method from the actual i/o.
00387 -------------------------------------------------------------------------
00388 */
00389
00390 void cLZSS::Encode( // no return value
00391 void) // no parameters
00392
00393 {
00394 short int i; // an iterator
00395 short int r; // node number in the binary tree
00396 short int s; // position in the ring buffer
00397 unsigned short int len; // len of initial string
00398 short int last_match_length; // length of last match
00399 short int code_buf_pos; // position in the output buffer
00400 unsigned char code_buf[17]; // the output buffer
00401 unsigned char mask; // bit mask for byte 0 of out buf
00402 unsigned char c; // character read from string
00403
00404 // Start with a clean tree.
00405
00406 InitTree();
00407
00408 // code_buf[0] works as eight flags. A "1" represents that the
00409 // unit is an unencoded letter (1 byte), and a "0" represents
00410 // that the next unit is a <position,length> pair (2 bytes).
00411 //
00412 // code_buf[1..16] stores eight units of code. Since the best
00413 // we can do is store eight <position,length> pairs, at most 16
00414 // bytes are needed to store this.
00415 //
00416 // This is why the maximum size of the code buffer is 17 bytes.
00417
00418 code_buf[0] = 0;
00419 code_buf_pos = 1;
00420
00421 // Mask iterates over the 8 bits in the code buffer. The first
00422 // character ends up being stored in the low bit.
00423 //
00424 // bit 8 7 6 5 4 3 2 1
00425 // | |
00426 // | first sequence in code buffer
00427 // |
00428 // last sequence in code buffer
00429
00430 mask = 1;
00431
00432 s = 0;
00433 r = (short int) N - (short int) F;
00434
00435 // Initialize the ring buffer with spaces...
00436
00437 // Note that the last F bytes of the ring buffer are not filled.
00438 // This is because those F bytes will be filled in immediately
00439 // with bytes from the input stream.
00440
00441 memset(m_ring_buffer, ' ', N - F);
00442
00443 // Read F bytes into the last F bytes of the ring buffer.
00444 //
00445 // This function loads the buffer with X characters and returns
00446 // the actual amount loaded.
00447
00448 len = GetChars(&(m_ring_buffer[r]), F);
00449
00450 // Make sure there is something to be compressed.
00451
00452 if (len == 0)
00453 return;
00454
00455 // Insert the F strings, each of which begins with one or more
00456 // 'space' characters. Note the order in which these strings
00457 // are inserted. This way, degenerate trees will be less likely
00458 // to occur.
00459
00460 for (i = 1; i <= F; i++)
00461 {
00462 InsertNode((short int) (r - i));
00463 }
00464
00465 // Finally, insert the whole string just read. The
00466 // member variables match_length and match_position are set.
00467
00468 InsertNode(r);
00469
00470 // Now that we're preloaded, continue till done.
00471
00472 do
00473 {
00474
00475 // m_match_length may be spuriously long near the end of
00476 // text.
00477
00478 if (m_match_length > len)
00479 {
00480 m_match_length = len;
00481 }
00482
00483 // Is it cheaper to store this as a single character? If so,
00484 // make it so.
00485
00486 if (m_match_length < THRESHOLD)
00487 {
00488 // Send one character. Remember that code_buf[0] is the
00489 // set of flags for the next eight items.
00490
00491 m_match_length = 1;
00492 code_buf[0] |= mask;
00493 code_buf[code_buf_pos++] = m_ring_buffer[r];
00494 }
00495
00496 // Otherwise, we do indeed have a string that can be stored
00497 // compressed to save space.
00498
00499 else
00500 {
00501 // The next 16 bits need to contain the position (12 bits)
00502 // and the length (4 bits).
00503
00504 code_buf[code_buf_pos++] = (unsigned char) m_match_position;
00505 code_buf[code_buf_pos++] = (unsigned char) (
00506 ((m_match_position >> 4) & 0xf0) |
00507 (m_match_length - THRESHOLD) );
00508 }
00509
00510 // Shift the mask one bit to the left so that it will be ready
00511 // to store the new bit.
00512
00513 mask = (unsigned char) (mask << 1);
00514
00515 // If the mask is now 0, then we know that we have a full set
00516 // of flags and items in the code buffer. These need to be
00517 // output.
00518
00519 if (mask == 0)
00520 {
00521 // code_buf is the buffer of characters to be output.
00522 // code_buf_pos is the number of characters it contains.
00523
00524 SendChars(code_buf, code_buf_pos);
00525
00526 // Reset for next buffer...
00527
00528 code_buf[0] = 0;
00529 code_buf_pos = 1;
00530 mask = 1;
00531 }
00532
00533 last_match_length = m_match_length;
00534
00535 // Delete old strings and read new bytes...
00536
00537 for (i = 0; i < last_match_length; i++)
00538 {
00539
00540 // Get next character...
00541
00542 if (GetChars(&c, 1) != 1)
00543 break;
00544
00545 // Delete "old strings"
00546
00547 DeleteNode(s);
00548
00549 // Put this character into the ring buffer.
00550 //
00551 // The original comment here says "If the position is near
00552 // the end of the buffer, extend the buffer to make
00553 // string comparison easier."
00554 //
00555 // That's a little misleading, because the "end" of the
00556 // buffer is really what we consider to be the "beginning"
00557 // of the buffer, that is, positions 0 through F.
00558 //
00559 // The idea is that the front end of the buffer is duplicated
00560 // into the back end so that when you're looking at characters
00561 // at the back end of the buffer, you can index ahead (beyond
00562 // the normal end of the buffer) and see the characters
00563 // that are at the front end of the buffer wihtout having
00564 // to adjust the index.
00565 //
00566 // That is...
00567 //
00568 // 1234xxxxxxxxxxxxxxxxxxxxxxxxxxxxx1234
00569 // | | |
00570 // position 0 end of buffer |
00571 // |
00572 // duplicate of front of buffer
00573
00574 m_ring_buffer[s] = c;
00575
00576 if (s < F - 1)
00577 {
00578 m_ring_buffer[s + N] = c;
00579 }
00580
00581 // Increment the position, and wrap around when we're at
00582 // the end. Note that this relies on N being a power of 2.
00583
00584 s = (short int) ( (s + 1) & (N - 1) );
00585 r = (short int) ( (r + 1) & (N - 1) );
00586
00587 // Register the string that is found in
00588 // m_ring_buffer[r..r+F-1].
00589
00590 InsertNode(r);
00591 }
00592
00593 // If we didn't quit because we hit the last_match_length,
00594 // then we must have quit because we ran out of characters
00595 // to process.
00596
00597 while (i++ < last_match_length)
00598 {
00599 DeleteNode(s);
00600
00601 s = (short int) ( (s + 1) & (N - 1) );
00602 r = (short int) ( (r + 1) & (N - 1) );
00603
00604 // Note that len hitting 0 is the key that causes the
00605 // do...while() to terminate. This is the only place
00606 // within the loop that len is modified.
00607 //
00608 // Its original value is F (or a number less than F for
00609 // short strings).
00610
00611 if (--len)
00612 {
00613 InsertNode(r); /* buffer may not be empty. */
00614 }
00615 }
00616
00617 // End of do...while() loop. Continue processing until there
00618 // are no more characters to be compressed. The variable
00619 // "len" is used to signal this condition.
00620 }
00621 while (len > 0);
00622
00623 // There could still be something in the output buffer. Send it
00624 // now.
00625
00626 if (code_buf_pos > 1)
00627 {
00628 // code_buf is the encoded string to send.
00629 // code_buf_ptr is the number of characters.
00630
00631 SendChars(code_buf, code_buf_pos);
00632 }
00633
00634 // Done!
00635 }
00636
00637 /*
00638 -------------------------------------------------------------------------
00639 cLZSS::Decode
00640
00641 This function "decodes" the input stream into the output stream.
00642 The GetChars() and SendChars() functions are used to separate
00643 this method from the actual i/o.
00644 -------------------------------------------------------------------------
00645 */
00646
00647 void cLZSS::Decode( // no return value
00648 void) // no parameters
00649
00650 {
00651 int k;
00652 int r; // node number
00653 unsigned char c[F]; // an array of chars
00654 unsigned char flags; // 8 bits of flags
00655 int flag_count; // which flag we're on
00656 short int pos; // position in the ring buffer
00657 short int len; // number of chars in ring buffer
00658
00659 // Initialize the ring buffer with a common string.
00660 //
00661 // Note that the last F bytes of the ring buffer are not filled.
00662
00663 memset(m_ring_buffer, ' ', N - F);
00664
00665 r = N - F;
00666
00667 flags = (char) 0;
00668 flag_count = 0;
00669
00670 for ( ; ; )
00671 {
00672
00673 // If there are more bits of interest in this flag, then
00674 // shift that next interesting bit into the 1's position.
00675 //
00676 // If this flag has been exhausted, the next byte must
00677 // be a flag.
00678
00679 if (flag_count > 0)
00680 {
00681 flags = (unsigned char) (flags >> 1);
00682 flag_count--;
00683 }
00684 else
00685 {
00686 // Next byte must be a flag.
00687
00688 if (GetChars(&flags, 1) != 1)
00689 break;
00690
00691 // Set the flag counter. While at first it might appear
00692 // that this should be an 8 since there are 8 bits in the
00693 // flag, it should really be a 7 because the shift must
00694 // be performed 7 times in order to see all 8 bits.
00695
00696 flag_count = 7;
00697 }
00698
00699 // If the low order bit of the flag is now set, then we know
00700 // that the next byte is a single, unencoded character.
00701
00702 if (flags & 1)
00703 {
00704 if (GetChars(c, 1) != 1)
00705 break;
00706
00707 if (SendChars(c, 1) != 1)
00708 break;
00709
00710 // Add to buffer, and increment to next spot. Wrap at end.
00711
00712 m_ring_buffer[r] = c[0];
00713 r = (short int) ( (r + 1) & (N - 1) );
00714 }
00715
00716 // Otherwise, we know that the next two bytes are a
00717 // <position,length> pair. The position is in 12 bits and
00718 // the length is in 4 bits.
00719
00720 else
00721 {
00722 // Original code:
00723 // if ((i = getc(infile)) == EOF)
00724 // break;
00725 // if ((j = getc(infile)) == EOF)
00726 // break;
00727 // i |= ((j & 0xf0) << 4);
00728 // j = (j & 0x0f) + THRESHOLD;
00729 //
00730 // I've modified this to only make one input call, and
00731 // have changed the variable names to something more
00732 // obvious.
00733
00734 if (GetChars(c, 2) != 2)
00735 break;
00736
00737 // Convert these two characters into the position and
00738 // length. Note that the length is always at least
00739 // THRESHOLD, which is why we're able to get a length
00740 // of 18 out of only 4 bits.
00741
00742 pos = (short int) ( c[0] | ((c[1] & 0xf0) << 4) );
00743
00744 len = (short int) ( (c[1] & 0x0f) + THRESHOLD );
00745
00746 // There are now "len" characters at position "pos" in
00747 // the ring buffer that can be pulled out. Note that
00748 // len is never more than F.
00749
00750 for (k = 0; k < len; k++)
00751 {
00752 c[k] = m_ring_buffer[(pos + k) & (N - 1)];
00753
00754 // Add to buffer, and increment to next spot. Wrap at end.
00755
00756 m_ring_buffer[r] = c[k];
00757 r = (short int) ( (r + 1) & (N - 1) );
00758 }
00759
00760 // Add the "len" characters to the output stream.
00761
00762 if (SendChars(c, len) != len)
00763 break;
00764 }
00765 }
00766 }
00767
1.2.15