Calgary corpus

The Calgary corpus is a collection of text and binary data files, commonly used for comparing data compression algorithms. It was created by Ian Witten, Tim Bell and John Cleary from the University of Calgary in 1987 and was commonly used in the 1990s. In 1997 it was replaced by the Canterbury corpus,^[1] based on concerns about how representative the Calgary corpus was,^[2] but the Calgary corpus still exists for comparison and is still useful for its originally intended purpose.

In its most commonly used form, the corpus consists of 14 files totaling 3,141,622 bytes as follows.

Size (bytes)	File name	Description
111,261	BIB	ASCII text in UNIX "refer" format – 725 bibliographic references.
768,771	BOOK1	unformatted ASCII text – Thomas Hardy: Far from the Madding Crowd.
610,856	BOOK2	ASCII text in UNIX "troff" format – Witten: Principles of Computer Speech.
102,400	GEO	32 bit numbers in IBM floating point format – seismic data.
377,109	NEWS	ASCII text – USENET batch file on a variety of topics.
21,504	OBJ1	VAX executable program – compilation of PROGP.
246,814	OBJ2	Macintosh executable program – "Knowledge Support System" of B.R. Gaines.
53,161	PAPER1	UNIX "troff" format – Witten, Neal, Cleary: Arithmetic Coding for Data Compression.
82,199	PAPER2	UNIX "troff" format – Witten: Computer (in)security.
513,216	PIC	1728 x 2376 bitmap image (MSB first): text in French and line diagrams.
39,611	PROGC	Source code in C – UNIX compress v4.0.
71,646	PROGL	Source code in Lisp – system software.
49,379	PROGP	Source code in Pascal – program to evaluate PPM compression.
93,695	TRANS	ASCII and control characters – transcript of a terminal session.

There is also a less commonly used 18 file version which include 4 additional text files in UNIX "troff" format, PAPER3 through PAPER6. The maintainers of the Canterbury corpus website notes that "they don't add to the evaluation".^[3]

Benchmarks

The Calgary corpus was a commonly used benchmark for data compression in the 1990s. Results were most commonly listed in bits per byte (bpb) for each file and then summarized by averaging. More recently, it has been common to just add the compressed sizes of all of the files. This is called a weighted average because it is equivalent to weighting the compression ratios by the original file sizes. The UCLC benchmark^[4] by Johan de Bock uses this method.

For some data compressors it is possible to compress the corpus smaller by combining the inputs into an uncompressed archive (such as a tar file) before compression because of mutual information between the text files. In other cases, the compression is worse because the compressor handles nonuniform statistics poorly. This method was used in a benchmark in the online book Data Compression Explained by Matt Mahoney.^[5]

The table below shows the compressed sizes of the 14 file Calgary corpus using both methods for some popular compression programs. Options, when used, select best compression. For a more complete list, see the above benchmarks.

Compressor	Options	As 14 separate files	As a tar file
Uncompressed		3,141,622	3,152,896
compress		1,272,772	1,319,521
Info-ZIP 2.32	-9	1,020,781	1,023,042
gzip 1.3.5	-9	1,017,624	1,022,810
bzip2 1.0.3	-9	828,347	860,097
7-zip 9.12b		848,687	824,573
bzip3 1.1.8		765,939	779,795
ppmd Jr1	-m256 -o16	740,737	754,243
ppmonstr J		675,485	669,497
ZPAQ v7.15	-method 5	659,709	659,853

Compression challenge

The "Calgary corpus Compression and SHA-1 crack Challenge"^[6] is a contest started by Leonid A. Broukhis on May 21, 1996 to compress the 14 file version of the Calgary corpus. The contest offers a small cash prize which has varied over time. Currently the prize is US $1 per 111 byte improvement over the previous result.

According to the rules of the contest, an entry must consist of both the compressed data and the decompression program packed into one of several standard archive formats. Time and memory limits, archive formats, and decompression languages have been relaxed over time. Currently the program must run within 24 hours on a 2000 MIPS machine under Windows or Linux and use less than 800 MB memory. An SHA-1 challenge was later added. It allows the decompression program to output files different from the Calgary corpus as long as they hash to the same values as the original files. So far, that part of the challenge has not been met.

The first entry received was 759,881 bytes in September, 1997 by Malcolm Taylor, author of RK and WinRK. The most recent entry was 580,170 bytes by Alexander Ratushnyak on July 2, 2010. The entry consists of a compressed file of size 572,465 bytes and a decompression program written in C++ and compressed to 7700 bytes as a PPMd var. I archive, plus 5 bytes for the compressed file name and size. The history is as follows.

Size (bytes)	Month/year	Author
759,881	09/1997	Malcolm Taylor
692,154	08/2001	Maxim Smirnov
680,558	09/2001	Maxim Smirnov
653,720	11/2002	Serge Voskoboynikov
645,667	01/2004	Matt Mahoney
637,116	04/2004	Alexander Ratushnyak
608,980	12/2004	Alexander Ratushnyak
603,416	04/2005	Przemysław Skibiński
596,314	10/2005	Alexander Ratushnyak
593,620	12/2005	Alexander Ratushnyak
589,863	05/2006	Alexander Ratushnyak
580,170	07/2010	Alexander Ratushnyak

References

^ Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. ISBN 9781558605701.
^ Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN 9781846286032.
^ "The Canterbury Corpus". corpus.canterbury.ac.nz.
^ "UC Learning Center". 6 January 2023.
^ "Data Compression Explained". mattmahoney.net.
^ "The Compression/SHA-1 Challenge". mailcom.com.

External links

v t e Standard test items
Pangram Reference implementation Sanity check Standard test image
Artificial intelligence	Chinese room Turing test
Television (test card)	SMPTE color bars EBU colour bars Indian-head test pattern EIA 1956 resolution chart BBC Test Card A, B, C, D, E, F, G, H, J, W, X ETP-1 Philips circle pattern (PM 5538, PM 5540, PM 5544, PM 5644) Snell & Wilcox SW2/SW4 Telefunken FuBK TVE test card UEIT
Computer languages	"Hello, World!" program Quine Trabb Pardo–Knuth algorithm Man or boy test Just another Perl hacker
Data compression	Calgary corpus Canterbury corpus Silesia corpus enwik8, enwik9
3D computer graphics	Cornell box Stanford bunny Stanford dragon Utah teapot List
Machine learning	ImageNet MNIST database List
Typography (filler text)	Etaoin shrdlu Hamburgevons Lorem ipsum The quick brown fox jumps over the lazy dog
Other	3DBenchy Acid 1 2 3 "Bad Apple!!" EICAR test file functions for optimization GTUBE Harvard sentences Lenna "The North Wind and the Sun" "Tom's Diner" SMPTE universal leader EURion constellation Shakedown Webdriver Torso 1951 USAF resolution test chart

Data compression methods

Lossless

Entropy type	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary type	Byte pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other types	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM bzip2 (RLE + BWT + MTF + Huffman)

Lossy

Transform type	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive type	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT

Theory

Community

Hutter Prize
Global Data Compression Competition
encode.su

People

Matt Mahoney
Mark Adler

This page was last edited on 19 June 2023, at 13:48

From Wikipedia, the free encyclopedia

Contents

Benchmarks

Compression challenge

See also

References

External links