To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

Lossless compression

From Wikipedia, the free encyclopedia

Lossless compression is a class of data compression algorithms that allows the original data to be perfectly reconstructed from the compressed data. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with improved compression rates (and therefore reduced file sizes).

Lossless data compression is used in many applications. For example, it is used in the ZIP file format and in the GNU tool gzip. It is also often used as a component within lossy data compression technologies (e.g. lossless mid/side joint stereo preprocessing by MP3 encoders and other lossy audio encoders).

Lossless compression is used in cases where it is important that the original and the decompressed data be identical, or where deviations from the original data would be unfavourable. Typical examples are executable programs, text documents, and source code. Some image file formats, like PNG or GIF, use only lossless compression, while others like TIFF and MNG may use either lossless or lossy methods. Lossless audio formats are most often used for archiving or production purposes, while smaller lossy audio files are typically used on portable players and in other cases where storage space is limited or exact replication of the audio is unnecessary.

YouTube Encyclopedic

  • 1/5
    Views:
    161 216
    9 350
    11 612
    772
    12 610
  • Compression: Crash Course Computer Science #21
  • Lossy and Lossless (RLE) Compression
  • Lossless and Lossy Data Compression
  • Huffman Coding (Lossless Compression Algorithm)
  • Lossy vs. Lossless Compression

Transcription

This episode is brought to you by Curiosity Stream. Hi, I'm Carrie Anne, and welcome to Crash Course Computer Science! Last episode we talked about Files, bundles of data, stored on a computer, that are formatted and arranged to encode information, like text, sound or images. We even discussed some basic file formats, like text, wave, and bitmap. While these formats are perfectly fine and still used today, their simplicity also means they’re not very efficient. Ideally, we want files to be as small as possible, so we can store lots of them without filling up our hard drives, and also transmit them more quickly. Nothing is more frustrating than waiting for an email attachment to download. Ugh! The answer is compression, which literally squeezes data into a smaller size. To do this, we have to encode data using fewer bits than the original representation. That might sound like magic, but it’s actually computer science! INTRO Lets return to our old friend from last episode, Mr. Pac-man! This image is 4 pixels by 4 pixels. As we discussed, image data is typically stored as a list of pixel values. To know where rows end, image files have metadata, which defines properties like dimensions. But, to keep it simple today, we’re not going to worry about it. Each pixel’s color is a combination of three additive primary colors: red, green and blue. We store each of those values in one byte, giving us a range of 0 to 255 for each color. If you mix full intensity red, green and blue - that’s 255 for all three values - you get the color white. If you mix full intensity red and green, but no blue (it’s 0), you get yellow. We have 16 pixels in our image, and each of those needs 3 bytes of color data. That means this image’s data will consume 48 bytes of storage. But, we can compress the data and pack it into a smaller number of bytes than 48! One way to compress data is to reduce repeated or redundant information. The most straightforward way to do this is called Run-Length Encoding. This takes advantage of the fact that there are often runs of identical values in files. For example, in our pac-man image, there are 7 yellow pixels in a row. Instead of encoding redundant data: yellow pixel, yellow pixel, yellow pixel, and so on, we can just say “there’s 7 yellow pixels in a row” by inserting an extra byte that specifies the length of the run, like so: And then we can eliminate the redundant data behind it. To ensure that computers don’t get confused with which bytes are run lengths and which bytes represent color, we have to be consistent in how we apply this scheme. So, we need to preface all pixels with their run-length. In some cases, this actually adds data, but on the whole, we’ve dramatically reduced the number of bytes we need to encode this image. We’re now at 24 bytes, down from 48. That’s 50% smaller! A huge saving! Also note that we haven’t lost any data. We can easily expand this back to the original form without any degradation. A compression technique that has this characteristic is called lossless compression, because we don’t lose anything. The decompressed data is identical to the original before compression, bit for bit. Let's take a look at another type of lossless compression, where blocks of data are replaced by more compact representations. This is sort of like “don’t forget to be awesome” being replaced by DFTBA. To do this, we need a dictionary that stores the mapping from codes to data. Lets see how this works for our example. We can view our image as not just a string of individual pixels, but as little blocks of data. For simplicity, we’re going to use pixel pairs, which are 6 bytes long, but blocks can be any size. In our example, there are only four pairings: White-yellow, black-yellow, yellow-yellow and white-white. Those are the data blocks in our dictionary we want to generate compact codes for. What’s interesting, is that these blocks occur at different frequencies. There are 4 yellow-yellow pairs, 2 white-yellow pairs, and 1 each of black-yellow and white-white. Because yellow-yellow is the most common block, we want that to be substituted for the most compact representation. On the other hand, black-yellow and white-white, can be substituted for something longer because those blocks are infrequent. One method for generating efficient codes is building a Huffman Tree, invented by David Huffman while he was a student at MIT in the 1950s. His algorithm goes like this. First, you layout all the possible blocks and their frequencies. At every round, you select the two with the lowest frequencies. Here, that’s Black-Yellow and White-White, each with a frequency of 1. You combine these into a little tree... ...which have a combined frequency of 2, so we record that. And now one step of the algorithm done. Now we repeat the process. This time we have three things to choose from. Just like before, we select the two with the lowest frequency, put them into a little tree, and record the new total frequency of all the sub items. Ok, we’re almost done. This time it’s easy to select the two items with the lowest frequency because there are only two things left to pick. We combine these into a tree, and now we’re done! Our tree looks like this, and it has a very cool property: it’s arranged by frequency, with less common items lower down. So, now we have a tree, but you may be wondering how this gets us to a dictionary. Well, we use our frequency-sorted tree to generate the codes we need by labeling each branch with a 0 or a 1, like so: With this, we can write out our code dictionary. Yellow-yellow is encoded as just a single 0. White-yellow is encoded as 1 0 (“one zero”) Black-Yellow is 1 1 0 and finally white-white is 1 1 1. The really cool thing about these codewords is that there’s no way to have conflicting codes, because each path down the tree is unique. This means our codes are prefix-free, that is no code starts with another complete code. Now, let’s return to our image data and compress it! Our first pixel pair, white-yellow, is substituted for the bits “1 0”. The next pair is black-yellow, which is substituted for “1 1 0”. Next is yellow-yellow with the incredibly compact substitution of just “0”. And this process repeats for the rest of the image: So instead of 48 bytes of image data ...this process has encoded it into 14 bits -- NOT BYTES -- BITS!! That’s less than 2 bytes of data! But, don’t break out the champagne quite yet! This data is meaningless unless we also save our code dictionary. So, we’ll need to append it to the front of the image data, like this. Now, including the dictionary, our image data is 30 bytes long. That’s still a significant improvement over 48 bytes. The two approaches we discussed, removing redundancies and using more compact representations, are often combined, and underlie almost all lossless compressed file formats, like GIF, PNG, PDF and ZIP files. Both run-length encoding and dictionary coders are lossless compression techniques. No information is lost; when you decompress, you get the original file. That’s really important for many types of files. Like, it’d be very odd if I zipped up a word document to send to you, and when you decompressed it on your computer, the text was different. But, there are other types of files where we can get away with little changes, perhaps by removing unnecessary or less important information, especially information that human perception is not good at detecting. And this trick underlies most lossy compression techniques. These tend to be pretty complicated, so we’re going to attack this at a conceptual level. Let’s take sound as an example. Your hearing is not perfect. We can hear some frequencies of sound better than others. And there are some we can’t hear at all, like ultrasound. Unless you’re a bat. Basically, if we make a recording of music, and there’s data in the ultrasonic frequency range, we can discard it, because we know that humans can’t hear it. On the other hand, humans are very sensitive to frequencies in the vocal range, like people singing, so it’s best to preserve quality there as much as possible. Deep bass is somewhere in between. Humans can hear it, but we’re less attuned to it. We mostly sense it. Lossy audio compressors takes advantage of this, and encode different frequency bands at different precisions. Even if the result is rougher, it’s likely that users won’t perceive the difference. Or at least it doesn’t dramatically affect the experience. And here comes the hate mail from the audiophiles! You encounter this type of audio compression all the time. It’s one of the reasons you sound different on a cellphone versus in person. The audio data is being compressed, allowing more people to take calls at once. As the signal quality or bandwidth get worse, compression algorithms remove more data, further reducing precision, which is why Skype calls sometimes sound like robots talking. Compared to an uncompressed audio format, like a WAV or FLAC (there we go, got the audiophiles back) compressed audio files, like MP3s, are often 10 times smaller. That’s a huge saving! And it’s why I’ve got a killer music collection on my retro iPod. Don’t judge. This idea of discarding or reducing precision in a manner that aligns with human perception is called perceptual coding, and it relies on models of human perception, which come from a field of study called Psychophysics. This same idea is the basis of lossy compressed image formats, most famously JPEGs. Like hearing, the human visual system is imperfect. We’re really good at detecting sharp contrasts, like the edges of objects, but our perceptual system isn’t so hot with subtle color variations. JPEG takes advantage of this by breaking images up into blocks of 8x8 pixels, then throwing away a lot of the high-frequency spatial data. For example, take this photo of our directors dog - Noodle. So cute! Let’s look at patch of 8x8 pixels. Pretty much every pixel is different from its neighbor, making it hard to compress with loss-less techniques because there’s just a lot going on. Lots of little details. But human perception doesn’t register all those details. So, we can discard a lot of that detail, and replace it with a simplified patch like this. This maintains the visual essence, but might only use 10% of the data. We can do this for all the patches in the image and get this result. You can still see it’s a dog, but the image is rougher. So, that’s an extreme example, going from a slightly compressed JPEG to a highly compressed one, one-eighth the original file size. Often, you can get away with a quality somewhere in between, and perceptually, it’s basically the same as the original. The one on the left is one-third the file size of the one on the right. That’s a big savings for essentially the same thing. Can you tell the difference between the two? Probably not, but I should mention that video compression plays a role in that too, since I’m literally being compressed in a video right now. Videos are really just long sequences of images, so a lot of what I said about them applies here too. But videos can do some extra clever stuff, because between frames, a lot of pixels are going to be the same. Like this whole background behind me! This is called temporal redundancy. We don’t need to re-transmit those pixels every frame of the video. We can just copy patches of data forward. When there are small pixel differences, like the readout on this frequency generator behind me, most video formats send data that encodes just the difference between patches, which is more efficient than re-transmitting all the pixels afresh, again taking advantage of inter-frame similarity. The fanciest video compression formats go one step further. They find patches that are similar between frames, and not only copy them forward, with or without differences, but also can apply simple effects to them, like a shift or rotation. They can also lighten or darken a patch between frames. So, if I move my hand side to side like this the video compressor will identify the similarity, capture my hand in one or more patches, then just move these patches around between frames. You’re actually seeing my hand from the past… kinda freaky, but it uses a lot less data. MPEG-4 videos, a common standard, are often 20 to 200 times smaller than the original, uncompressed file. However, encoding frames as translations and rotations of patches from previous frames can go horribly wrong when you compress too heavily, and there isn’t enough space to update pixel data inside of the patches. The video player will forge ahead, applying the right motions, even if the patch data is wrong. And this leads to some hilarious and trippy effects, which I’m sure you’ve seen. Overall, it’s extremely useful to have compression techniques for all the types of data I discussed today. (I guess our imperfect vision and hearing are “useful,” too.) And it’s important to know about compression because it allows users to store pictures, music, and videos in efficient ways. Without it, streaming your favorite Carpool Karaoke videos on YouTube would be nearly impossible, due to bandwidth and the economics of transmitting that volume of data for free. And now when your Skype calls sound like they’re being taken over by demons, you’ll know what’s really going on. I’ll see you next week. Hey guys, this week’s episode was brought to you by CuriosityStream which is a streaming service full of documentaries and non­fiction titles from some really great filmmakers, including exclusive originals. Now I normally give computer science recommendations since this is Crash Course Computer Science and all and Curiosity Stream has a ton of great ones. But you absolutely have to check out “Miniverse” starring everyone’s favorite space-station-singing-Canadian astronaut, Chris Hadfield, as he takes a roadtrip across the Solar System scaled down the the size of the United States. It’s basically 50 minutes of Chris and his passengers geeking out about our amazing planetary neighbors and you don’t want to miss it. So get unlimited access today, and your first two months are free if you sign up at curiositystream.com/crashcourse and use the promo code "crashcourse" during the sign up process.

Contents

Lossless compression techniques

Most lossless compression programs do two things in sequence: the first step generates a statistical model for the input data, and the second step uses this model to map input data to bit sequences in such a way that "probable" (e.g. frequently encountered) data will produce shorter output than "improbable" data.

The primary encoding algorithms used to produce bit sequences are Huffman coding (also used by DEFLATE) and arithmetic coding. Arithmetic coding achieves compression rates close to the best possible for a particular statistical model, which is given by the information entropy, whereas Huffman compression is simpler and faster but produces poor results for models that deal with symbol probabilities close to 1.

There are two primary ways of constructing statistical models: in a static model, the data is analyzed and a model is constructed, then this model is stored with the compressed data. This approach is simple and modular, but has the disadvantage that the model itself can be expensive to store, and also that it forces using a single model for all data being compressed, and so performs poorly on files that contain heterogeneous data. Adaptive models dynamically update the model as the data is compressed. Both the encoder and decoder begin with a trivial model, yielding poor compression of initial data, but as they learn more about the data, performance improves. Most popular types of compression used in practice now use adaptive coders.

Lossless compression methods may be categorized according to the type of data they are designed to compress. While, in principle, any general-purpose lossless compression algorithm (general-purpose meaning that they can accept any bitstring) can be used on any type of data, many are unable to achieve significant compression on data that are not of the form for which they were designed to compress. Many of the lossless compression techniques used for text also work reasonably well for indexed images.

Multimedia

These techniques take advantage of the specific characteristics of images such as the common phenomenon of contiguous 2-D areas of similar tones. Every pixel but the first is replaced by the difference to its left neighbor. This leads to small values having a much higher probability than large values. This is often also applied to sound files, and can compress files that contain mostly low frequencies and low volumes. For images, this step can be repeated by taking the difference to the top pixel, and then in videos, the difference to the pixel in the next frame can be taken.

A hierarchical version of this technique takes neighboring pairs of data points, stores their difference and sum, and on a higher level with lower resolution continues with the sums. This is called discrete wavelet transform. JPEG2000 additionally uses data points from other pairs and multiplication factors to mix them into the difference. These factors must be integers, so that the result is an integer under all circumstances. So the values are increased, increasing file size, but hopefully the distribution of values is more peaked.[citation needed]

The adaptive encoding uses the probabilities from the previous sample in sound encoding, from the left and upper pixel in image encoding, and additionally from the previous frame in video encoding. In the wavelet transformation, the probabilities are also passed through the hierarchy.

Historical legal issues

Many of these methods are implemented in open-source and proprietary tools, particularly LZW and its variants. Some algorithms are patented in the United States and other countries and their legal usage requires licensing by the patent holder. Because of patents on certain kinds of LZW compression, and in particular licensing practices by patent holder Unisys that many developers considered abusive, some open source proponents encouraged people to avoid using the Graphics Interchange Format (GIF) for compressing still image files in favor of Portable Network Graphics (PNG), which combines the LZ77-based deflate algorithm with a selection of domain-specific prediction filters. However, the patents on LZW expired on June 20, 2003.[1]

Many of the lossless compression techniques used for text also work reasonably well for indexed images, but there are other techniques that do not work for typical text that are useful for some images (particularly simple bitmaps), and other techniques that take advantage of the specific characteristics of images (such as the common phenomenon of contiguous 2-D areas of similar tones, and the fact that color images usually have a preponderance of a limited range of colors out of those representable in the color space).

As mentioned previously, lossless sound compression is a somewhat specialized area. Lossless sound compression algorithms can take advantage of the repeating patterns shown by the wave-like nature of the data – essentially using autoregressive models to predict the "next" value and encoding the (hopefully small) difference between the expected value and the actual data. If the difference between the predicted and the actual data (called the error) tends to be small, then certain difference values (like 0, +1, −1 etc. on sample values) become very frequent, which can be exploited by encoding them in few output bits.

It is sometimes beneficial to compress only the differences between two versions of a file (or, in video compression, of successive images within a sequence). This is called delta encoding (from the Greek letter Δ, which in mathematics, denotes a difference), but the term is typically only used if both versions are meaningful outside compression and decompression. For example, while the process of compressing the error in the above-mentioned lossless audio compression scheme could be described as delta encoding from the approximated sound wave to the original sound wave, the approximated version of the sound wave is not meaningful in any other context.

Lossless compression methods

By operation of the pigeonhole principle, no lossless compression algorithm can efficiently compress all possible data. For this reason, many different algorithms exist that are designed either with a specific type of input data in mind or with specific assumptions about what kinds of redundancy the uncompressed data are likely to contain.

Some of the most common lossless compression algorithms are listed below.

General purpose

Audio

Graphics

  • PNG – Portable Network Graphics
  • TIFF – Tagged Image File Format
  • WebP – (high-density lossless or lossy compression of RGB and RGBA images)
  • BPG – Better Portable Graphics (lossless/lossy compression based on HEVC)
  • FLIF – Free Lossless Image Format
  • JPEG-LS – (lossless/near-lossless compression standard)
  • TGA – Truevision TGA
  • PCX – PiCture eXchange
  • JPEG 2000 – (includes lossless compression method, as proven by Sunil Kumar, Prof San Diego State University[citation needed])
  • JPEG XR – formerly WMPhoto and HD Photo, includes a lossless compression method
  • ILBM – (lossless RLE compression of Amiga IFF images)
  • JBIG2 – (lossless or lossy compression of B&W images)
  • PGF – Progressive Graphics File (lossless or lossy compression)

3D Graphics

  • OpenCTM – Lossless compression of 3D triangle meshes

Video

See this list of lossless video codecs.

Cryptography

Cryptosystems often compress data (the "plaintext") before encryption for added security. When properly implemented, compression greatly increases the unicity distance by removing patterns that might facilitate cryptanalysis. However, many ordinary lossless compression algorithms produce headers, wrappers, tables, or other predictable output that might instead make cryptanalysis easier. Thus, cryptosystems must utilize compression algorithms whose output does not contain these predictable patterns.

Genetics and Genomics

Genetics compression algorithms (not to be confused with genetic algorithms) are the latest generation of lossless algorithms that compress data (typically sequences of nucleotides) using both conventional compression algorithms and specific algorithms adapted to genetic data. In 2012, a team of scientists from Johns Hopkins University published the first genetic compression algorithm that does not rely on external genetic databases for compression. HAPZIPPER was tailored for HapMap data and achieves over 20-fold compression (95% reduction in file size), providing 2- to 4-fold better compression and in much faster time than the leading general-purpose compression utilities.[2]

Genomic sequence compression algorithms, also known as DNA sequence compressors, explore the fact that DNA sequences have characteristic properties, such as inverted repeats. The most successful compressors are XM and GeCo.[3] For eukaryotes XM is slightly better in compression ratio, though for sequences larger than 100 MB its computational requirements are impractical.

Executables

Self-extracting executables contain a compressed application and a decompressor. When executed, the decompressor transparently decompresses and runs the original application. This is especially often used in demo coding, where competitions are held for demos with strict size limits, as small as 1k. This type of compression is not strictly limited to binary executables, but can also be applied to scripts, such as JavaScript.

Lossless compression benchmarks

Lossless compression algorithms and their implementations are routinely tested in head-to-head benchmarks. There are a number of better-known compression benchmarks. Some benchmarks cover only the data compression ratio, so winners in these benchmarks may be unsuitable for everyday use due to the slow speed of the top performers. Another drawback of some benchmarks is that their data files are known, so some program writers may optimize their programs for best performance on a particular data set. The winners on these benchmarks often come from the class of context-mixing compression software.

The benchmarks listed in the 5th edition of the Handbook of Data Compression (Springer, 2009) are:[4]

  • The Maximum Compression benchmark, started in 2003 and updated until November 2011, includes over 150 programs. Maintained by Werner Bergmans, it tests on a variety of data sets, including text, images, and executable code. Two types of results are reported: single file compression (SFC) and multiple file compression (MFC). Not surprisingly, context mixing programs often win here; programs from the PAQ series and WinRK often are in the top. The site also has a list of pointers to other benchmarks.[5]
  • UCLC (the ultimate command-line compressors) benchmark by Johan de Bock is another actively maintained benchmark including over 100 programs. The winners in most tests usually are PAQ programs and WinRK, with the exception of lossless audio encoding and grayscale image compression where some specialized algorithms shine.
  • Squeeze Chart by Stephan Busch is another frequently updated site.
  • The EmilCont benchmarks by Berto Destasio are somewhat outdated having been most recently updated in 2004. A distinctive feature is that the data set is not public, to prevent optimizations targeting it specifically. Nevertheless, the best ratio winners are again the PAQ family, SLIM and WinRK.
  • The Archive Comparison Test (ACT) by Jeff Gilchrist included 162 DOS/Windows and 8 Macintosh lossless compression programs, but it was last updated in 2002.
  • The Art Of Lossless Data Compression by Alexander Ratushnyak provides a similar test performed in 2003.

Matt Mahoney, in his February 2010 edition of the free booklet Data Compression Explained, additionally lists the following:[6]

  • The Calgary Corpus dating back to 1987 is no longer widely used due to its small size. Matt Mahoney currently maintains the Calgary Compression Challenge [1], created and maintained from May 21, 1996 through May 21, 2016 by Leonid A. Broukhis [2].
  • The Large Text Compression Benchmark[7] and the similar Hutter Prize both use a trimmed Wikipedia XML UTF-8 data set.
  • The Generic Compression Benchmark[8], maintained by Mahoney himself, test compression on random data.
  • Sami Runsas (author of NanoZip) maintains Compression Ratings, a benchmark similar to Maximum Compression multiple file test, but with minimum speed requirements. It also offers a calculator that allows the user to weight the importance of speed and compression ratio. The top programs here are fairly different due to speed requirement. In January 2010, the top programs were NanoZip followed by FreeArc, CCM, flashzip, and 7-Zip.
  • The Monster of Compression benchmark by N. F. Antonio tests compression on 1Gb of public data with a 40-minute time limit. As of Dec. 20, 2009 the top ranked archiver is NanoZip 0.07a and the top ranked single file compressor is ccmx 1.30c, both context mixing.

The Compression Ratings website published a chart summary of the "frontier" in compression ratio and time.[9]

The Compression Analysis Tool[10] is a Windows application that enables end users to benchmark the performance characteristics of streaming implementations of LZF4, DEFLATE, ZLIB, GZIP, BZIP2 and LZMA using their own data. It produces measurements and charts with which users can compare the compression speed, decompression speed and compression ratio of the different compression methods and to examine how the compression level, buffer size and flushing operations affect the results.

The Squash Compression Benchmark uses the Squash library to compare more than 25 compression libraries in many different configurations using numerous different datasets on several different machines, and provides a web interface to help explore the results. There are currently over 50,000 results to compare.

Limitations

Lossless data compression algorithms cannot guarantee compression for all input data sets. In other words, for any lossless data compression algorithm, there will be an input data set that does not get smaller when processed by the algorithm, and for any lossless data compression algorithm that makes at least one file smaller, there will be at least one file that it makes larger. This is easily proven with elementary mathematics using a counting argument, as follows:

  • Assume that each file is represented as a string of bits of some arbitrary length.
  • Suppose that there is a compression algorithm that transforms every file into an output file that is no longer than the original file, and that at least one file will be compressed into an output file that is shorter than the original file.
  • Let M be the least number such that there is a file F with length M bits that compresses to something shorter. Let N be the length (in bits) of the compressed version of F.
  • Because N<M, every file of length N keeps its size during compression. There are 2N such files. Together with F, this makes 2N+1 files that all compress into one of the 2N files of length N.
  • But 2N is smaller than 2N+1, so by the pigeonhole principle there must be some file of length N that is simultaneously the output of the compression function on two different inputs. That file cannot be decompressed reliably (which of the two originals should that yield?), which contradicts the assumption that the algorithm was lossless.
  • We must therefore conclude that our original hypothesis (that the compression function makes no file longer) is necessarily untrue.

Any lossless compression algorithm that makes some files shorter must necessarily make some files longer, but it is not necessary that those files become very much longer. Most practical compression algorithms provide an "escape" facility that can turn off the normal coding for files that would become longer by being encoded. In theory, only a single additional bit is required to tell the decoder that the normal coding has been turned off for the entire input; however, most encoding algorithms use at least one full byte (and typically more than one) for this purpose. For example, DEFLATE compressed files never need to grow by more than 5 bytes per 65,535 bytes of input.

In fact, if we consider files of length N, if all files were equally probable, then for any lossless compression that reduces the size of some file, the expected length of a compressed file (averaged over all possible files of length N) must necessarily be greater than N.[citation needed] So if we know nothing about the properties of the data we are compressing, we might as well not compress it at all. A lossless compression algorithm is useful only when we are more likely to compress certain types of files than others; then the algorithm could be designed to compress those types of data better.

Thus, the main lesson from the argument is not that one risks big losses, but merely that one cannot always win. To choose an algorithm always means implicitly to select a subset of all files that will become usefully shorter. This is the theoretical reason why we need to have different compression algorithms for different kinds of files: there cannot be any algorithm that is good for all kinds of data.

The "trick" that allows lossless compression algorithms, used on the type of data they were designed for, to consistently compress such files to a shorter form is that the files the algorithms are designed to act on all have some form of easily modeled redundancy that the algorithm is designed to remove, and thus belong to the subset of files that that algorithm can make shorter, whereas other files would not get compressed or even get bigger. Algorithms are generally quite specifically tuned to a particular type of file: for example, lossless audio compression programs do not work well on text files, and vice versa.

In particular, files of random data cannot be consistently compressed by any conceivable lossless data compression algorithm: indeed, this result is used to define the concept of randomness in algorithmic complexity theory.

It's provably impossible to create an algorithm that can losslessly compress any data.[11] While there have been many claims through the years of companies achieving "perfect compression" where an arbitrary number N of random bits can always be compressed to N − 1 bits, these kinds of claims can be safely discarded without even looking at any further details regarding the purported compression scheme. Such an algorithm contradicts fundamental laws of mathematics because, if it existed, it could be applied repeatedly to losslessly reduce any file to length 0. Allegedly "perfect" compression algorithms are often derisively referred to as "magic" compression algorithms for this reason.

On the other hand, it has also been proven[citation needed] that there is no algorithm to determine whether a file is incompressible in the sense of Kolmogorov complexity. Hence it's possible that any particular file, even if it appears random, may be significantly compressed, even including the size of the decompressor. An example is the digits of the mathematical constant pi, which appear random but can be generated by a very small program. However, even though it cannot be determined whether a particular file is incompressible, a simple theorem about incompressible strings shows that over 99% of files of any given length cannot be compressed by more than one byte (including the size of the decompressor).

Mathematical background

Abstractly, a compression algorithm can be viewed as a function on sequences (normally of octets). Compression is successful if the resulting sequence is shorter than the original sequence (and the instructions for the decompression map). For a compression algorithm to be lossless, the compression map must form an injection from "plain" to "compressed" bit sequences.

The pigeonhole principle prohibits a bijection between the collection of sequences of length N and any subset of the collection of sequences of length N−1. Therefore, it is not possible to produce a lossless algorithm that reduces the size of every possible input sequence.

Psychological background

Most everyday files are relatively 'sparse' in an information entropy sense, and thus, most lossless algorithms a layperson is likely to apply on regular files compress them relatively well. This may, through misapplication of intuition, lead some individuals to conclude that a well-designed compression algorithm can compress any input, thus, constituting a magic compression algorithm.[citation needed]

Points of application in real compression theory

Real compression algorithm designers accept that streams of high information entropy cannot be compressed, and accordingly, include facilities for detecting and handling this condition. An obvious way of detection is applying a raw compression algorithm and testing if its output is smaller than its input. Sometimes, detection is made by heuristics; for example, a compression application may consider files whose names end in ".zip", ".arj" or ".lha" uncompressible without any more sophisticated detection. A common way of handling this situation is quoting input, or uncompressible parts of the input in the output, minimizing the compression overhead. For example, the zip data format specifies the 'compression method' of 'Stored' for input files that have been copied into the archive verbatim.[12]

The Million Random Digit Challenge

Mark Nelson, in response to claims of magic compression algorithms appearing in comp.compression, has constructed a 415,241 byte binary file of highly entropic content, and issued a public challenge of $100 to anyone to write a program that, together with its input, would be smaller than his provided binary data yet be able to reconstitute it without error.[13]

The FAQ for the comp.compression newsgroup contains a challenge by Mike Goldman offering $5,000 for a program that can compress random data. Patrick Craig took up the challenge, but rather than compressing the data, he split it up into separate files all of which ended in the number 5, which was not stored as part of the file. Omitting this character allowed the resulting files (plus, in accordance with the rules, the size of the program that reassembled them) to be smaller than the original file. However, no actual compression took place, and the information stored in the names of the files was necessary to reassemble them in the correct order in the original file, and this information was not taken into account in the file size comparison. The files themselves are thus not sufficient to reconstitute the original file; the file names are also necessary. Patrick Craig agreed that no meaningful compression had taken place, but argued that the wording of the challenge did not actually require this. A full history of the event, including discussion on whether or not the challenge was technically met, is on Patrick Craig's web site.[14]

See also

References

  1. ^ Unisys | LZW Patent and Software Information Archived 2009-06-02 at the Wayback Machine.
  2. ^ Chanda, Elhaik, and Bader (2012). "HapZipper: sharing HapMap populations just got easier". Nucleic Acids Res. 40 (20): 1–7. doi:10.1093/nar/gks709. PMC 3488212. PMID 22844100.
  3. ^ Pratas, D.; Pinho, A. J.; Ferreira, P. J. S. G. (2016). "Efficient compression of genomic sequences". Data Compression Conference. Snowbird, Utah.
  4. ^ David Salomon, Giovanni Motta, (with contributions by David Bryant), Handbook of Data Compression, 5th edition, Springer, 2009, ISBN 1-84882-902-7, pp. 16–18.
  5. ^ "Compression Benchmarks (links and spreadsheets)"". www.maximumcompression.com.
  6. ^ Matt Mahoney (2010). "Data Compression Explained" (PDF). pp. 3–5.
  7. ^ "Large Text Compression Benchmark". mattmahoney.net.
  8. ^ "Generic Compression Benchmark". mattmahoney.net.
  9. ^ Visualization of compression ratio and time
  10. ^ Ltd, Noemax Technologies. "Compression Analysis Tool – Noemax". www.noemax.com.
  11. ^ comp.compression FAQ list entry #9: Compression of random data (WEB, Gilbert and others)
  12. ^ ZIP file format specification by PKWARE, Inc., chapter V, section J
  13. ^ Nelson, Mark (2006-06-20). "The Million Random Digit Challenge Revisited".
  14. ^ Craig, Patrick. "The $5000 Compression Challenge". Retrieved 2009-06-08.

External links

This page was last edited on 14 November 2018, at 14:32
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.