banner
davirain

davirain

twitter
github
知乎
twitter

Lossless Data Compression Algorithm History

Introduction#

History#

  • Legal Issues
  • The Rise of Deflate

Current Archival Software#

Future Developments#

Compression Techniques#

  • Run-Length Encoding
  • Burrows-Wheeler Transform
  • Entropy Encoding
    • Shannon-Fano Coding
    • Huffman Coding
    • Arithmetic Coding

Compression Algorithms#

  • Sliding Window Algorithms
    • LZ77
    • LZR
    • DEFLATE
    • DEFLATE64
    • LZSS
    • LZH
    • LZB
    • ROLZ
    • LZP
    • LZRW11
    • LZJB
    • LZS
    • LZX
    • LZO
    • LZMA
    • LZMA2

Dictionary Algorithms#

  • LZ78
  • LZW
  • ZLC
  • LZTL
  • ZMW
  • LZAP
  • LZWL
  • LZJ

Non-dictionary Algorithms#

  • PPM
  • bzip2
  • PAQ

References#

1 Introduction#

Compression algorithms mainly fall into two categories: lossy and lossless. Lossy data compression algorithms typically reduce file size by removing small details that require a lot of fidelity. In lossy data compression, it is impossible to restore the original file due to the removal of essential data. Lossy data compression is often used for storing image and audio data, although very high compression ratios can be achieved through data removal, this article does not cover that. Lossless data compression reduces the size of files in such a way that a decompression function can completely restore the original file without any data loss. Lossless data compression is ubiquitous in computing, from saving space on personal computers to sending data over networks, communicating via secure shell, or viewing PNG or GIF images.

Lossless data compression algorithms work on the principle that any non-random file contains redundant information that can be condensed using statistical modeling techniques to determine the probability of characters or phrases occurring. These statistical models can then be used to generate codes for specific characters or phrases based on their occurrence probabilities, assigning the shortest codes to the most common data. These techniques include entropy encoding, run-length encoding, and dictionary-based compression. By using these techniques and others, an 8-bit character or a string of such characters can be represented with just a few bits, thus eliminating a significant amount of redundant data.

2 History#

In the 1970s, the internet became increasingly popular, and the Lempel-Ziv algorithm was invented, marking the beginning of data compression's significant role in computing. However, its history outside of computing is much older. Morse code, invented in 1838, is one of the earliest examples of data compression, as the most common letters in English, such as 'e' and 't', were assigned shorter Morse codes. Later, as large computers began to dominate in 1949, Claude Shannon and Robert Fano invented Shannon-Fano coding. Their algorithm was based on assigning codes to symbols in a given data block according to their probabilities of occurrence. The probability of a symbol's occurrence is inversely proportional to the length of its code, thus representing data in a shorter form.

Two years later, David Huffman studied information theory at MIT and took a class with Robert Fano, who allowed students to choose between writing a term paper or taking a final exam. Huffman chose the term paper, which was about finding the most efficient binary coding method. After working for several months with no success, Huffman was ready to abandon all his work and start studying for the final instead of writing the paper. At that moment, he suddenly had a breakthrough and discovered a method very similar to Shannon-Fano coding but more efficient. The key difference between Shannon-Fano coding and Huffman coding is that the former's probability tree is built bottom-up, resulting in suboptimal outcomes, while the latter is built top-down.

Early implementations of Shannon-Fano and Huffman coding were accomplished using hardware and hardware encoding. It wasn't until the 1970s, with the rise of the internet and online storage, that software compression became feasible, with Huffman coding being dynamically generated based on input data. Later, in 1997, Abraham Lempel and Jacob Ziv published their groundbreaking LZ77 algorithm, the first algorithm to use dictionary data. More specifically, LZ77 frequently employs a dynamic dictionary known as a sliding window. In 1978, the same two individuals released their LZ78 algorithm, which also used a dictionary but differed from LZ77 in that it parsed the input data and generated a static dictionary instead of dynamically.

Image

Both the LZ77 and LZ78 algorithms quickly gained popularity, leading to many variants, as shown in the image.

Since the invention of these algorithms, most have become obsolete, and now only a few are widely used, including DEFLATE, LZMA, and LZX. Most commonly used algorithms are derived from the LZ77 algorithm. This is not due to technical superiority but because the LZ78 algorithm was patented by Sperry in 1984 for the derivative LZW algorithm, leading to lawsuits against software vendors, server administrators, and even terminals for using the GIF format without a license, thus becoming a patent-restricted algorithm.

At that time, the UNIX compress utility made very minor modifications to the LZW algorithm, calling it LZC, which was later discontinued due to patent issues. Other UNIX developers also began to move away from using the LZW algorithm in favor of open-source algorithms. This led to the UNIX community adopting gzip based on deflate and bzip2 based on the Burrows-Wheeler transform. In the long run, this was beneficial for the UNIX community, as gzip and bzip2 almost always achieved higher compression ratios than the LZW format. With the expiration of the LZW algorithm's patent in 2003, the patent issues surrounding LZW have subsided. Nevertheless, the LZW algorithm has largely been replaced and is only commonly used in GIF compression. Since then, some LZW derivatives have emerged, but they have not gained widespread use, and the LZ77 algorithm remains dominant.

Another legal battle over the LZS algorithm erupted in 1993, developed by Stac Electronics for disk compression software like Stacker. Microsoft used the LZS algorithm to develop disk compression software, released under MS-DOS 6.0, which reportedly doubled disk capacity. When Stac Electronics discovered its intellectual property was being used, it sued Microsoft. Microsoft was later found guilty and ordered to pay Stac Electronics $120 million in damages, minus a $13.6 million counterclaim, estimating Microsoft's infringement. Although the Stac Electronics lawsuit against Microsoft had significant rulings, it did not hinder the development of the Lempel-Ziv algorithms as the LZW patent disputes did. The only outcome seems to be that LZS was not assigned to any new algorithms.

2.2 The Rise of Deflate#

Since the publication of the Lempel-Ziv algorithm, businesses and other large entities have been using data compression due to their increasing storage needs, which data compression can help meet. However, it wasn't until the rise of the internet that data compression began to see widespread application, as demand for data compression emerged in the late 1980s. Bandwidth was either limited or expensive, and data compression helped alleviate these bottlenecks. When the World Wide Web was developed, compression became particularly important as people began sharing more images and other formats that were much larger than text. To meet this demand, several new file formats were developed, including ZIP, GIF, and PNG, which incorporated compression technology.

In 1985, Thom Henderson released the first commercially successful archive format, ARC, through his company System Enhancement Associates. ARC was one of the first programs capable of packaging and compressing files, making it particularly popular in the BBS community, and it was also open-source. The ARC format used a modification of the LZW algorithm to compress data. Phil Katz noticed the popularity of ARC and decided to improve it by writing compression and decompression programs in assembly language. He released the PKARC program as shareware in 1987 and was sued by Henderson for copyright infringement. He was found guilty and forced to pay royalties and other fines as part of a cross-licensing agreement. He was found guilty because PKARC was an obvious copy of ARC; in some cases, even the misspellings in the comments were the same.

Due to the restrictions of the cross-licensing agreement, Phil Katz could no longer sell PKARC after 1988. Therefore, he created an adjusted version of PKARC in 1989, now known as the ZIP format. Since it used LZW, it was considered patent-restricted, and Katz later chose to switch to the new IMPLODE algorithm. In 1993, Katz released PKZIP 2.0, implementing the DEFLATE algorithm along with other features such as split volumes. Although it has been around for a long time, almost all .zip files today follow the PKZIP 2.0 format, making this version of the ZIP format ubiquitous.

GIF, which stands for Graphics Interchange Format, is a graphic exchange format developed by CompuServe in 1987, designed to allow bitmap images to be transmitted without losing data (although the format limits each frame to 256 colors), while significantly reducing file size for transmission over dial-up modems. However, like the ZIP format, GIF is also based on the LZW algorithm. Despite being patent-restricted, Unisys failed to adequately enforce its patent to prevent the format's proliferation. Even over 20 years later, GIF remains widely used, particularly for its ability to create animations.

Although GIF could not be stopped, CompuServe still sought a format without patent restrictions and launched the Portable Network Graphics (PNG) format in 1994. Like ZIP, the PNG standard uses the DEFLATE algorithm for compression. Although DEFLATE was patented by Katz, this patent was never enforced, allowing PNG and other DEFLATE-based formats to avoid patent infringement. While LZW was widely adopted in the early days of compression, it has gradually faded from history due to Unisys's litigious nature, replaced by the faster and more efficient DEFLATE algorithm. DEFLATE is currently the most commonly used data compression algorithm, akin to a Swiss Army knife in compression.

In addition to being used in PNG and ZIP formats, DEFLATE is also very common in other areas of computing. For example, the gzip (.gz) file format uses DEFLATE, as it is essentially an open-source version of ZIP. Other uses of DEFLATE include HTTP, SSL, and other technologies aimed at achieving efficient data compression over networks.

Unfortunately, Phil Katz did not live to see his DEFLATE algorithm conquer the computing world. He struggled with alcoholism for years, and his life began to unravel in the late 1990s, leading to multiple arrests for DUI and other offenses. On April 14, 2000, the 37-year-old Katz was found dead in a hotel room. The cause of death was acute pancreatitis, resulting from alcohol poisoning indicated by numerous empty bottles found beside his body.

2.3 Current Archival Software#

By the mid-1990s, new and better formats began to emerge, and the ZIP format and other DEFLATE-based formats no longer dominated. In 1993, Eugene Roshal released his compression software WinRAR, using the proprietary RAR format. The latest version of RAR employs a combination of PPM and LZSS algorithms, but earlier implementations are not well known. RAR has become the standard format for sharing files on the internet, particularly for distributing pirated media. An open-source implementation of the Burrows-Wheeler transform called bzip2 was released in 1996, quickly becoming significant competition for the DEFLATE-based gzip format on UNIX platforms. Another open-source compression program, 7-Zip or .7z format, was released in 1999. Due to its generally higher compression ratios and the modular and open nature of the format, 7-Zip may be the first format to challenge the dominance of ZIP and RAR. This format is not limited to a single compression algorithm but allows for selection among algorithms like bzip2, LZMA, LZMA2, and PPMd. Finally, at the forefront of archival software is the PAQ* format. The first PAQ format, PAQ1, was released by Matt Mahoney in 2002. PAQ significantly improved the PPM algorithm by combining two or more statistical models using a technique called context mixing to generate better predictions for the next symbol than any single model could.

3 Future Developments#

The future is always uncertain, but based on current trends, some predictions can be made about the future of data compression. As context-mixing algorithms like PAQ and its variants gain popularity, they often achieve the highest compression ratios, though they tend to be slower. With hardware speeds increasing exponentially, following Moore's Law, context-mixing algorithms are likely to shine in high-compression scenarios as speed penalties are overcome by faster hardware. New variants of the prediction by partial matching (PPM) algorithm designed for improvement may also emerge. Finally, the Lempel-Ziv Markov Chain algorithm (LZMA) has consistently demonstrated an excellent balance between speed and high compression ratios, likely leading to more variants. Since LZMA has been widely adopted in many competing compression formats since its introduction in the 7-Zip format, it may even become the "winner." Another potential development direction is the use of substring enumeration compression (CSE), an emerging compression technique that has yet to see many software implementations. In its naive form, it performs similarly to bzip2 and PPM, and researchers have been working to improve its efficiency.

4 Compression Techniques#

Many different techniques are used to compress data. Most compression techniques cannot exist independently and must be combined to form a compression algorithm. Those that can exist independently often become more effective when combined with other compression techniques. Most of these techniques belong to entropy encoders, but there are also some other commonly used techniques, such as run-length encoding and the Burrows-Wheeler transform.

4.1 Run-Length Encoding#

Run-length encoding is a very simple compression technique that represents the consecutive occurrences of the same character with a number followed by that character; a single character is encoded as one consecutive occurrence. RLE is very useful for highly redundant data, indexed images with many rows of the same color pixels, or when combined with other compression techniques like the Burrows-Wheeler transform.

Here is a quick example of RLE:

Input: AAABBCCCCDEEEEEEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Output: 3A2B4C1D6E38A

4.2 Burrows-Wheeler Transform#

The Burrows-Wheeler transform is a compression technique invented in 1994, designed to reversibly transform input data blocks to maximize the number of repeated occurrences of the same character. BWT itself does not perform any compression operations; it merely transforms the input into a form that can be encoded more efficiently by a run-length encoder or other secondary compression techniques.

The algorithm for BWT is quite simple:

  1. Create an array of strings.
  2. Generate all possible rotations of the input string and store each rotation in the array.
  3. Sort the array in lexicographical order.
  4. Return the last column of the array.

BWT typically works best on long inputs with many alternating occurrences of the same character. Below is an example of running the algorithm on ideal input. Note that & is the end-of-file character:

Due to its alternating identical characters, performing BWT on this input yields an optimal result, and another algorithm can further compress it, such as RLE, producing "3H&3A." Although this example produces an optimal result, it does not yield optimal results on most real data.

4.3 Entropy Encoding#

In data compression, entropy refers to the average number of bits needed to represent a symbol or character. Basic entropy encoders combine statistical models and encoders. The input file is parsed and used to generate a statistical model composed of the probabilities of given symbols occurring. The encoder then uses the statistical model to determine what bit or byte code to assign to each symbol, making the most common symbols have the shortest codes and the least common symbols have the longest codes.

4.3.1 Shannon-Fano Coding#

This is one of the earliest compression techniques, invented by Claude Shannon and Robert Fano in 1949. This technique involves generating a binary tree to represent the probabilities of each symbol's occurrence. These symbols are sorted so that the most common symbols appear at the top of the tree, while the least likely symbols appear at the bottom.

The encoding for a given symbol is obtained by searching for that symbol in the Shannon-Fano tree and appending the value (0 or 1) for each left or right branch. For example, if "A" is two left branches and one right branch, its encoding would be "0012." Because it builds the binary tree from the bottom up, Shannon-Fano coding does not always generate optimal codes. Therefore, Huffman coding is often used, as it can generate optimal codes for any given input.

The algorithm for generating Shannon-Fano codes is quite simple:

  1. Parse the input and count the occurrences of each symbol.
  2. Use the symbol counts to determine the probability of each symbol.
  3. Sort the symbols by probability, with the most likely symbols at the front.
  4. Generate leaf nodes for each symbol.
  5. Split the list into two parts while ensuring that the probability of the left branch is roughly equal to that of the right branch.
  6. Add 0 and 1 as prefixes to the left and right nodes' codes, respectively.
  7. Recursively apply steps 5 and 6 to the left and right subtrees until each node is a leaf in the tree.

4.3.2 Huffman Coding#

Huffman coding is another variant of entropy encoding that works similarly to Shannon-Fano coding, but the binary tree is constructed from top to bottom to generate optimal results.

The algorithm for generating Huffman codes shares its initial steps with Shannon-Fano:

  1. Parse the input and count the occurrences of each symbol.

  2. Use the symbol counts to determine the probability of each symbol. Sort the symbols by probability, with the most likely at the front.

  3. Generate leaf nodes for each symbol, including P, and add them to a queue.

  4. While the number of nodes in the queue > 1, perform the following:

    • Remove the two nodes with the lowest probabilities from the queue.
    • Add 0 and 1 as prefixes to the left and right nodes' codes, respectively.
    • Create a new node whose value equals the sum of the probabilities of the two nodes.
    • Assign the first node to the left branch and the second node to the right branch.
    • Add the node to the queue.
  5. The last remaining node in the queue is the root node of the Huffman tree.

4.3.3 Arithmetic Coding#

This method was developed by IBM in 1979 while the company was researching data compression techniques for its mainframe computers. If the goal is to achieve the best compression ratio, arithmetic coding can be considered the most excellent entropy encoding technique, as it typically yields better results than Huffman coding. However, arithmetic coding is significantly more complex than other coding techniques.

Arithmetic coding does not split the probabilities of symbols into a tree structure but instead transforms the input data into a rational number between 0 and 1 by changing the base and assigning a single value to each unique symbol between 0 and the base. It is then further converted into a binary number with a fixed number of decimal places, which is the encoding result. The value can be decoded back to the original output by changing the base from binary back to the original base and replacing the values with their corresponding symbols.

The general algorithm for calculating arithmetic coding is as follows:

  1. Count the number of unique symbols in the input.
  2. This number represents the base b for arithmetic coding (e.g., base 2 is binary).
  3. Assign values from 0 to b to each unique symbol in the order they appear.
  4. Replace the symbols in the input with their codes using the values from step 2.
  5. Convert the result from step 3 from base b to a sufficiently long fixed decimal binary number to retain precision.
  6. Record the length of the input string in the result, as this is necessary for the decoding process.

Here is an example encoding operation, given the input "ABCDAABD":

  1. Discover that there are 4 unique symbols in the input, so the base is 4.
    Length is 8.
  2. Assign values to the symbols: A=0, B=1, C=2, D=3.
  3. Replace the input with codes: "0.012300134," where the leading 0 is not a symbol.
  4. Convert "0.012311234" from base 4 to base 2: "0.011011000001112."
  5. Find the result. Note that the input length is 8 in the result.

Assuming 8-bit characters, the input length is 64 bits, while its arithmetic coding is only 15 bits, resulting in an excellent compression ratio of 24%. This example demonstrates how arithmetic coding performs well with a given finite character set.

5 Compression Algorithms#

5.1 Sliding Window Algorithms#

5.1.1 LZ77#

The LZ77 algorithm, published in 1977, is a groundbreaking algorithm. It introduced the concept of the "sliding window," which brought about greater improvements in compression ratios than more primitive algorithms. LZ77 uses triplets to maintain a dictionary, where these triplets represent an offset, run length, and a literal character. The offset indicates the distance from the beginning of the file where a given phrase starts, while the run length indicates the number of characters from the offset to offset + length. The literal character simply indicates that a new phrase has been found, and that phrase equals the characters from offset to offset + length plus the literal character. The dictionary used changes dynamically based on the sliding window as the file is parsed. For example, the sliding window might be 64MB, meaning the dictionary will contain entries for the past 64MB of input data.

Given the input "abbadabba," the output would be similar to "abb(0,1,'d')(0,3,'a')," as shown in the example below:

Image

While this replacement is slightly larger than the input, it typically yields smaller results when the input data is longer.

5.1.2 LZR#

LZR is an improved version of the LZ77 algorithm developed by Michael Rodeh in 1981. The algorithm aims to be a linear-time alternative to LZ77. However, the encoding pointers can point to any offset in the file, meaning LZR consumes a significant amount of memory. Coupled with its poorer compression ratio (LZ77 is usually superior), this makes LZR an unfeasible variant.

5.1.3 DEFLATE#

DEFLATE, invented by Phil Katz in 1993, is the foundation for most compression tasks today. It simply combines an LZ77 or LZSS preprocessor with a backend Huffman coding to achieve moderate compression results in a short time.

5.1.4 DEFLATE64#

DEFLATE64 is a proprietary extension of the DEFLATE algorithm that increases the dictionary size to 64KB (hence the name) and allows for greater distances in the sliding window. Compared to DEFLATE, it improves performance and compression ratios. However, the proprietary nature of DEFLATE64 and its modest improvements over DEFLATE have limited its adoption. Instead, open-source algorithms like LZMA are typically used.

5.1.5 LZSS#

The LZSS (Lempel-Ziv-Storer-Szymanski) algorithm was first released by James Storer and Thomas Szymanski in 1982. LZSS improves upon LZ77 by detecting whether a replacement will reduce file size. If there is no size reduction, the input is retained as a literal value in the output. Otherwise, the portion of the input is replaced with an (offset, length) pair, where the offset indicates how many bytes from the beginning of the input, and the length indicates how many characters to read from that position. Another improvement of LZSS over LZ77 is the elimination of the "next character," using only the offset and length pair.

Here is a brief example, given the input "these theses," producing "these(0,6)s," which saves only one byte but saves more space for larger inputs.

Image

LZSS is still used by many popular archive formats, the most notable being RAR. It is sometimes also used for network data compression.

5.1.6 LZH#

LZH, short for "Lempel-Ziv Huffman," was developed in 1987. It is a variant of LZSS that utilizes Huffman coding to compress pointers, achieving slightly better compression. However, the improvements brought by using Huffman coding are negligible, and the compression results do not justify the performance cost of using Huffman coding.

5.1.7 LZB#

LZB is an LZSS variant developed by Timothy Bell and others in 1987. Similar to LZH, LZB aims to reduce the size of compressed files by more efficiently encoding LZSS pointers. It achieves this by gradually increasing the size of the pointers, making them larger as the sliding window increases. Compared to LZSS and LZH, it can achieve higher compression rates, but due to the additional encoding steps for the pointers, it is still slower than LZSS.

5.1.8 ROLZ#

ROLZ stands for "Reduced Offset Lempel-Ziv," which aims to improve the compression ratio of LZ77 by limiting the offset length to reduce the amount of data required for encoding the offset-length pairs. This derivative of LZ77 first appeared in Ross Williams's LZRW4 algorithm, with other implementations including BALZ, QUAD, and RZM. Highly optimized ROLZ can achieve compression ratios nearly identical to LZMA, but due to a lack of widespread application, ROLZ's popularity is lower.

5.1.9 LZP#

LZP stands for "Lempel-Ziv + Prediction." It is a special case of the ROLZ algorithm where the offset is reduced to 1. There are several different variants that use different techniques to achieve faster operations or better compression ratios. LZW4 implements an arithmetic encoder to achieve the best compression ratio, but at the cost of slower speeds.

5.1.10 LZRW1#

The LZRW1 algorithm was created by Ron Williams in 1991, introducing the concept of "Reduced Offset Lempel-Ziv Compression." LZRW1 can achieve high compression ratios while maintaining speed and efficiency. Ron Williams also created several variants that improve LZRW1, such as LZRW1-A, 2, 3, 3-A, and 4.

5.1.11 LZJB#

Jeff Bonwick created the Lempel-Ziv Jeff Bonwick algorithm in 1998 for the Solaris Z file system (ZFS). It is considered a variant of the LZRW algorithm, particularly the LZRW1 variant, aimed at achieving maximum compression speed. Since it is used for file systems, speed is particularly important to ensure that the compression algorithm does not become a bottleneck for disk operations.

5.1.12 LZS#

The Lempel-Ziv-Stac algorithm was developed by Stac Electronics in 1994 as an improved LZ77 algorithm for disk compression software. It distinguishes between literal symbols and offset-length pairs in the output while omitting the next encountered symbol. The LZS algorithm is functionally most similar to the LZSS algorithm.

5.1.13 LZX#

The LZX algorithm was developed by Jonathan Forbes and Tomi Poutanen in 1995 for the Amiga computer. The X in LZX has no special meaning. Forbes sold the algorithm to Microsoft in 1996 and worked for Microsoft to further improve it for use in Microsoft's cabinet (.CAB) format. The algorithm was also used by Microsoft to compress compiled HTML help (CHM) files, Windows Imaging Format (WIM) files, and Xbox Live Avatars.

5.1.14 LZO#

LZO was developed by Markus Oberhumer in 1996, with the goal of achieving fast compression and decompression. It allows for adjustable compression levels, and the highest compression level requires only an additional 64KB of memory, while decompression requires only input and output buffers. LZO's functionality is very similar to the LZSS algorithm, but it is focused on speed rather than compression ratio.

5.1.15 LZMA#

The Lempel-Ziv Markov chain algorithm was first made public in 1998 with the release of the 7-Zip compression software for the .7z file format. It typically achieves better compression than bzip2, DEFLATE, and other algorithms. LZMA employs a series of compression techniques to achieve its output. First, it uses a modified version of the LZ77 algorithm to parse data at the bit level rather than the traditional byte level. Then, the output of the LZ77 algorithm is subjected to arithmetic coding. Depending on the specific LZMA implementation, more techniques may be applied. The result is often better than the compression ratios of most other LZ variants, primarily due to the use of bit-level compression rather than byte-level compression.

5.1.16 LZMA2#

LZMA2 is a progressive improvement over the original LZMA algorithm, first introduced in 2009 through an update to the 7-Zip archiving software. LZMA2 enhances the multithreading capabilities and performance of the LZMA algorithm and better handles incompressible data, resulting in slightly better compression.

5.1.17 Statistical Lempel-Ziv#

Statistical Lempel-Ziv is a concept proposed by Dr. Sam Kwong and Yu Fan Ho in 2001. Its fundamental principle is to combine statistical analysis of data with LZ77 variant algorithms to further optimize the encoding stored in the dictionary.

5.2 Dictionary Algorithms#

5.2.1 LZ78#

The LZ78 algorithm was created by Lempel and Ziv in 1978, hence the "78" in the abbreviation. Unlike the method of generating a dictionary using a sliding window, the input data can be preprocessed to generate a dictionary with an infinite range of inputs or form a dictionary while parsing the file. LZ78 adopts the latter strategy. The dictionary size is typically limited to a few megabytes, or all codes are a certain number of bytes, such as 8 bytes; this is done to reduce memory requirements. Most LZ78-type algorithms differ in how they handle a full dictionary.

During the parsing of a file, the LZ78 algorithm adds each new character or string encountered to the dictionary. For each symbol in the input, a dictionary entry is generated in the form of a dictionary index and an unknown symbol; if a symbol is already in the dictionary, the dictionary is searched for the substring of the current symbol and the symbols following it. The index of the longest substring match is used as the dictionary index. The data pointed to by the dictionary index will be added to the last character of the unknown substring. If the current symbol is unknown, the dictionary index is set to 0 to indicate it is a single-character entry. These entries form a data structure similar to a linked list.

For example, the input "abbadabbaabaad" will generate the output "(0,a)(0,b)(2,a)(0,d)(1,b)(3,a)(6,d)," as demonstrated in the example below:

Image

5.2.2 LZW#

LZW is the Lempel-Ziv-Welch algorithm, created by Terry Welch in 1984. Despite serious patent issues, LZW is the most widely used algorithm in the LZ78 family. Similar to LZ78, LZW improves upon LZ78 by eliminating redundant characters in the output and making the output entirely composed of pointers. Before starting compression, it also includes each character in the dictionary and employs other tricks to enhance compression, such as encoding the last character of each new phrase as the first character of the next phrase. LZW commonly appears in the Graphics Interchange Format (GIF) as well as in the early specifications of the ZIP format and other specialized applications. LZW is very fast but has poorer compression results compared to most newer algorithms, while some algorithms are both faster and achieve better compression.

5.2.3 LZC#

LZC (Lempel-Ziv Compress) is a slight modification of the LZW algorithm used in the UNIX compress utility. The main difference between LZC and LZW is that LZC monitors the compression ratio of the output. Once the ratio exceeds a certain threshold, the dictionary is discarded and rebuilt.

5.2.4 LZT#

The Lempel-Ziv Tischer algorithm is an improvement on LZC, which deletes the least recently used phrases when the dictionary is full and replaces them with new entries. There are some other incremental improvements, but both LZC and LZT are not commonly used today.

5.2.5 LZMW#

The LZMW algorithm was invented by Victor Miller and Mark Wegman in 1984, similar to LZT, adopting a recently unused phrase replacement strategy. However, LZMW does not merge similar entries in the dictionary but instead merges the last two encoded phrases and stores the result as a new entry. As a result, the size of the dictionary can quickly expand, necessitating more frequent discarding of LRUs. Compared to LZT, LZMW typically achieves better compression, but it is another less common algorithm.

5.2.6 LZAP#

LZAP was created by James Storer in 1988 as a modification of the LZMW algorithm. AP stands for "All Prefixes," meaning the dictionary stores not just single phrases but every permutation. For example, if the previous phrase is "last" and the current phrase is "next," the dictionary will store "lastn," "lastne," "lastnex," and "lastnext."

5.2.7 LZWL#

LZWL is a modified LZW algorithm created in 2006 that uses syllables instead of single characters for compression. The design of LZWL aims to better handle specific datasets with many common syllables, such as XML data. Typically, such algorithms are used with preprocessors that break down input data into syllables.

5.2.8 LZJ#

Matti Jakobsson released the LZJ algorithm in 1985, which is one of the few LZ78 algorithms that differs from LZW. This algorithm works by storing each unique string (up to an arbitrary maximum length) processed in the input in the dictionary and assigning a code to each string. When the dictionary is full, all entries that appear only once are deleted.

5.3 Non-dictionary Algorithms#

5.3.1 PPM#

Prediction by Partial Matching is a statistical modeling technique that uses a set of previously seen symbols in the input to predict the next symbol, thereby reducing the entropy of the output data. This differs from dictionaries, as PPM predicts the next symbol rather than attempting to find the next symbol in the dictionary for encoding. PPM is typically combined with encoders such as arithmetic coding or adaptive Huffman coding. PPM or its variant PPMd is implemented in many archive formats, including 7-Zip and RAR.

5.3.2 bzip2#

bzip2 is an open-source implementation of the Burrows-Wheeler transform. It works very simply, achieving a very good balance between speed and compression ratio, making the bzip2 format very popular in UNIX environments. First, run-length encoding is applied to the data. Next, the Burrows-Wheeler transform is applied. Then, a move-to-front transform is applied, aimed at creating large runs of the same symbol for another run-length encoder to use. Finally, the result is Huffman coded and wrapped in a header.

5.3.3 PAQ#

PAQ was created by Matt Mahoney in 2002, aiming to improve the older PPM(d) algorithm. It uses a revolutionary technique called context mixing, intelligently combining multiple statistical models (with PPM being one of them) to better predict the next symbol than any single model could. PAQ is one of the most promising algorithms due to its extremely high compression ratios and very active development. Since its inception, over 20 variants have been created, some of which have achieved record-breaking compression ratios. The biggest drawback of PAQ is its slow speed due to the use of multiple statistical models to achieve optimal compression. However, as hardware continues to get faster, it may become the standard in the future. PAQ is slowly being adopted and can be found in the PeaZip program on Windows, with 64-bit support and major speed improvements in the PAQ8O variant. Most other PAQ formats are primarily command-line only.

6 References#

  1. Wolfram, Stephen. A New Kind of Science. Champaign, IL: Wolfram Media, 2002. 1069. Print.

  2. Ken Huffman. Profile: David A. Huffman, Scientific American, September 1991, pp. 54–58.

  3. Ziv J., Lempel A., “A Universal Algorithm for Sequential Data Compression”, IEEE Transactions on Information Theory, Vol. 23, No. 3 (1977), pp. 337-343.

  4. Ziv J., Lempel A., “Compression of Individual Sequences via Variable-Rate Coding”, IEEE Transactions on Information Theory, Vol. 24, No. 5, pp. 530-536.

  5. USPTO Patent #4814746. See http://www.theregister.co.uk/1999/09/01/unisys_demands_5k_licence_fee

  6. blog [Fedora & Linux Tips]

  7. http://www.msversus.org/archive/stac.html

  8. ARC Info

  9. comp.compression Frequently Asked Questions (part 1/3)Section - [8] What about patents on data compression algorithms?

  10. USPTO Patent #5051745

  11. Phil Katz' Death

  12. Iwata, K., Arimura, M., and Shima, Y., "An Improvement in Lossless Data Compression via Substring Enumeration", , 2011 IEEE/ACIS 10th International Conference on Computer and Information Science (ICIS).

  13. Burrows M., and Wheeler, D. J. 1994. A Block-Sorting Lossless Data Compression Algorithm. SRC Research Report 124, Digital Systems Research Center.

  14. http://www.cs.tau.ac.il/~dcor/Graphics/adv-slides/entropy.pdf

  15. Shannon, C.E. (July 1948). "A Mathematical Theory of Communication". Bell System Technical Journal 27: 379–423.

  16. HUFFMAN, D. A. 1952. A method for the construction of minimum-redundancy codes. In Proceedings of the Institute of Electrical and Radio Engineers 40, 9 (Sept.), pp. 1098-1101.

  17. RISSANEN, J., AND LANGDON, G. G. 1979. Arithmetic coding. IBM J. Res. Dev. 23, 2 (Mar.), 149-162.

  18. RODEH, M., PRATT, V. R., AND EVEN, S. 1981. Linear algorithm for data compression via string matching. J. ACM 28, 1 (Jan.), 16-24.

  19. Bell, T., Witten, I., Cleary, J., "Modeling for Text Compression", ACM Computing Surveys, Vol. 21, No. 4 (1989).

  20. DEFLATE64 benchmarks

  21. STORER, J. A., AND SZYMANSKI, T. G. 1982. Data compression via textual substitution. J. ACM 29, 4 (Oct.), 928-951.

  22. Bloom, C., "LZP: a new data compression algorithm", Data Compression Conference, 1996. DCC '96. Proceedings, p. 425 10.1109/DCC.1996.488353.

  23. Dr Ross's Compression Crypt

  24. "Data Compression Method - Adaptive Coding with Sliding Window for Information Interchange", American National Standard for Information Systems, August 30, 1994.

  25. LZX Sold to Microsoft

  26. LZO Info

  27. LZMA Accessed on 12/10/2011.

  28. LZMA2 Release Date

  29. Kwong, S., Ho, Y.F., "A Statistical Lempel-Ziv Compression Algorithm for Personal Digital Assistant (PDA)", IEEE Transactions on Consumer Electronics, Vol. 47, No. 1, February 2001, pp 154-162.

  30. David Salomon, Data Compression – The complete reference, 4th ed., page 212

  31. Chernik, K., Lansky, J., Galambos, L., "Syllable-based Compression for XML Documents", Dateso 2006, pp 21-31, ISBN 80-248-1025-5.

  32. Jakobsson, M., "Compression of Character Strings by an Adaptive Dictionary", BIT Computer Science and Numerical Mathematics, Vol. 25 No. 4 (1985). doi>10.1007/BF01936138

  33. Cleary, J., Witten, I., "Data Compression Using Adaptive Coding and Partial String Matching", IEEE Transactions on Communications, Vol. COM-32, No. 4, April 1984, pp 396-402.

  34. Seward, J., "bzip2 and libbzip2", bzip2 Manual, March 2000.

  35. Mahoney, M., "Adaptive Weighting of Context Models for Lossless Data Compression", Unknown, 2002.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.