gzip

from Wikipedia, the free encyclopedia
gzip

logo
Screenshot
Help display in the command line
Basic data

developer Jean-Loup Gailly and Mark Adler
Publishing year 1992
Current  version 1.10
( December 30, 2018 )
operating system available across platforms
programming language C.
category Data compression
License GPL ( Free Software )
www.gnu.org/software/gzip

gzip is a free compression program that, like the corresponding gzip file format, is available for practically all computer operating systems (under the terms of the GPL, also in the source code).

In general, gzip is the short form for " GNU zip", where "zip" was borrowed from the English word for zip. OpenBSD has made a BSD-licensed re-implementation under the name gzip(1), gunzip(1)as well gzcat(1), which is fully compatible with the GNU tools.

gzip offers a degree of compression that is satisfactory for text and does not contain any patented algorithms ( deflate is used). It was originally developed by Jean-Loup Gailly to replace the compress used under Unix . Mark Adler wrote the decompression program gunzip .

technology

gzip is based on the Deflate algorithm, which is a combination of LZ77 and Huffman coding . Deflate was developed in response to the patents that passed on LZW and other compression algorithms . The ZIP file format also mainly uses deflate for compression, but otherwise must not be confused with gzip.

The window size with gzip is 32 KiB. If a sequence of bytes does not repeat itself in the previous 32 KiB, it is saved uncompressed in the .gz file. This window size is outdated compared to modern compression programs (e.g. bzip2 with 100 to 900 KiB block size, rzip as an extreme case with 900 MiB window size), but gzip is still one of the fastest compression programs and can be used in many ways, for example in connection with a so-called pipeline - the output ("standard out") of a program can represent the input ("standard in") of gzip and vice versa.

To simplify the development of software that uses data compression, the zlib library was written. It supports the gzip file format and deflate compression. The library is widely used because it is small, efficient, and versatile.

construction

Source: RFC 1952 GZIP File Format Specification version 4.3.

Initialization [0-1]

The first two bytes form the so-called “identification code” of the format, which is always the same in the gzip format. This header is standardized with bytes 0x1f and 0x8b ( hexadecimal ) or 31 and 139 ( decimal ). These bytes are used to verify the file format (gzip) and to identify the first noticeable defects in the file. If this initialization is wrong or not done at all, there will be errors that either cause an error or produce incorrect end files (after decompression).

Compression method [2]

The byte on the index (starting from 0) 2 indicates which compression method is involved or which action was used to save the file (s).

Byte value meaning
0 Copy of the file (no action taken)
1 Compression
2 Packing
3 LZH format
4th Reserved
5 Reserved
6th Reserved
7th Reserved
8th "Deflate" is supposed to provide a faster alternative to compressing (currently only little information)

Special information ("flags") [3]

The byte that is used for special information is located on index 3. Again there are certain values ​​that have a fixed meaning.

It is important that the bits are always taken into account here. This means that z. B. "10111000" (binary) (19 [decimal]; 0x13 [hexadecimal]) says the following: The file is ASCII text, is a single file, has a CRC-16 number, has extra information and the original name is known.

Byte value meaning
1 (bit position 1) The file has an ASCII text
2 (bit position 2) CRC-16 present (serves as a check value to determine whether the file is possibly damaged or was not transferred correctly).
4 (bit position 3) Determines whether additional information is provided
8 (bit position 4) Original name available
16 (bit position 5) Comment available
32 (bit position 6) Reserved (must be 0)
64 (bit position 7) Reserved (must be 0)
128 (bit position 8) Reserved (must be 0)

Last modification (time) [4-7]

This value is determined by 4 bytes and indicates a time in Unix time .

Additional special information ("Extra Flags") [8]

The definition is analogous to "Special information" on byte 3.

Example for compression method "Deflate":

Byte value meaning
2 (bit position 2) Compressor uses maximum compression and the slowest algorithm
4 (bit position 3) Compressor uses the fastest algorithm

Operating system [9]

This byte indicates on which operating system the file was compressed.

Byte value meaning
0 FAT (file system)
1 AmigaOS
2 VMS or OpenVMS
3 Unix
4th VM or CMS
5 Atari TOS
6th HPFS (file system)
7th Macintosh (platform), Mac OS (operating system)
8th Z system
9 CP / M
10 TOPS-20
11 NTFS (file system)
12 QDOS
13 Acorn RISC OS
255 Unknown

Sample calls

Pack a file:

gzip <Dateiname>

Unzip a packed file:

gzip -d <Dateiname>

or

gunzip <Dateiname>

Recursively pack all files in a directory and specify the compression rate:

gzip -rv <Verzeichnis>

Output a compressed text file:

zcat <Dateiname>

Unpack a defective compressed file to the point of failure:

zcat <gzip-Datei> > <Ziel-Datei>

gzip compressed files

gzip
File extension : .gz
MIME type : application / gzip
Magic number : \ x1F \ x8B \ x08

( ASCII-C notation )

Developed by: Jean-Loup Gailly and Mark Adler
Type: Data compression
Container for: any file
Extended by: compress
Standard (s) : RFC 1952
Website : gzip.org

The usual file extension for gzip-compressed files is today .gzand in the past .z.

Since gzip can only compress individual files, several files or directory trees are usually first combined with tar to form an archive file called tarball , which is then compressed with gzip.

First files (circles) are packed with tar, then this archive is compressed with gzip.

Such compressed archive files then usually have the double extension .tar.gzor simply .tgz. This method enables better compression overall, since redundancies between the individual files can be exploited ( progressive compression ), but makes access to the individual components more difficult.

distribution

Under Unix , compression with gzip is standard today because it enables a good compromise between high speed and good data reduction for many tasks. However, where speed is less important than minimum file sizes (e.g. when data is widely distributed over relatively slow networks), bzip2 and LZMA are increasingly used (as is the case with gzip in combination with tar).

The zlib-compressed file format, the deflate algorithm and the gzip file format were standardized in 1996 as Request for Comments RFC 1950 , RFC 1951 and RFC 1952 .

See also

  • List of data compression programs
  • zopfli is an encoder programmed by Google employees that creates compatible and smaller gzip files, but at the expense of very long compression times
  • pigzis a version of gzip, programmed by Mark Adler , which uses all available processor cores and threads, thus noticeably accelerating the compression

Web links

Individual evidence

  1. gzip-1.10 released [stable] . December 30, 2018 (accessed December 30, 2018).
  2. gzip(1) : compress and expand data (deflate mode) -  OpenBSD General Commands Manual
  3. compress : compress data -  Open Group Base Specification
  4. Jean-loup Gailly, Mark Adler: Compression algorithm (deflate) ( Memento from February 16, 2014 in the Internet Archive ) on gzip.org. September 1, 1997 ( Last-Modified ).
  5. P. German:  RFC 1952 . - GZIP File Format Specification version 4.3 . May 1996. (English).
  6. tools.ietf.org