Lempel-Ziv-Markow algorithm

from Wikipedia, the free encyclopedia

The Lempel-Ziv-Markow algorithm ( LZMA ) is a free data compression algorithm that has been developed by Igor Wiktorowitsch Pawlow since 1998 and achieves comparatively good compression rates and a high speed when unpacking. It is named after Abraham Lempel and Jacob Ziv , who developed the LZ77 algorithm, and after Andrei Andrejewitsch Markow , after whom the Markow chains were named.

The algorithm works with a dictionary procedure similar to LZ77 and can therefore in principle be seen as a further development of Deflate .

features

  • Very good compression (on average better than bzip2 )
  • Fast decompression (about twice as fast as bzip2)
  • Very large dictionaries are supported (up to four gigabytes)
  • pronounced asymmetry
    • Only a fraction of the working memory used for compression (packing) is required for decompression (unpacking) (under Windows approx. 2 MB + dictionary size) while the working memory required for packing is a multiple of the dictionary size.
    • Decompression is typically about 10 to 20 times faster than compression.

technology

LZMA uses an improved variant of the LZ77 algorithm, Markov chains and a range coder (an implementation of arithmetic coding ) for entropy coding .

The code for unpacking LZMA usually only occupies about 5 kB in compiled form. The amount of memory required for unpacking depends on the size of the dictionary used for packing. Due to the small size of the unpacker and the very low memory requirement (especially with smaller dictionaries), the method is particularly suitable for embedded applications.

In the 7-Zip implementation , different variants of hash nodes , binary trees and Patricia tries are used for dictionary searches.

With LZMA2, the entire amount of data to be compressed can be divided into sections. The parts can be processed by in-house processes and put together later. A common dictionary is built up by all processes, but the remaining data is necessarily processed as separate data streams during entropy coding, which means that the suboptimality of the area coding (less than one byte per part) has a corresponding effect and each block still has its own minimalistic header data. Due to the possibility of parallelization, the packing and unpacking on modern multi-processor and multi-core systems can be significantly accelerated, since it enables the work to be split up and does not depend on linear steps like LZMA, which limits the use to one core.

commitment

In addition to using special file formats for compressed data, LZMA support has also been integrated into many other systems. For the transparent compression and decompression of executable files with UPX (from version 2.92, beta) or Upack and for compressing file systems such as SquashFS (or CramFS , with the appropriate patches), the LZMA is also available. With a large number of Linux distributions ( Arch Linux since March 2010 (since 2020, however, Zstandard ), source code packages of the Gentoo Linux distribution , Slackware Linux since May 8, 2009, openSUSE since March 27, 2008, Pardus' package management and the Debian Package management systems offer support, ...) LZMA-compressed installation packages can now be used. Software installation systems for Windows such as the Nullsoft Scriptable Install System and Inno Setup also create a type of extended self-extracting archive files that can be compressed with LZMA.

File formats

Originally it could only be used with the new 7z format from 7-Zip. Several other formats are now available. In the case of xz , a new format was created especially with regard to LZMA support, which is specifically intended for exclusive use with LZMA (the same applies to the lzip format). In the case of the new ALZip format ( .eggfiles), a new, more modern (more flexible) file format has been created that is now mainly used with LZMA-compressed content as part of the modernization of the file format capabilities, which is indicated by the availability of more modern methods, as part of a compatibility breach in the format becomes. In the case of Zip 2 (for example with WinZip from version 12.0 or 7-Zip from version 4.61, beta) LZMA support was added to an existing (expandable) format.

software

The reference implementation of LZMA was done in free software. It initially came in the form of the 7-Zip programs and is now also published in isolation in the form of the LZMA SDK. The free reference library for LZMA compression was written in C ++ and supports multithreading .

There are currently three working broadcasts on Unix-like platforms:

  • p7zip is a current port of the command line tool 7z , so it offers full support of the 7z archive format and often serves as a substructure for the 7z functions of graphical tools with 7z support such as Karchiver and WinRAR .
  • lzip was the first LZMA solution for Unix-like operating systems that completely copied the familiar concept of gzip .
  • XZ Utils are a port of the LZMA code from 7-Zip, which, under Linux, offer largely the same handling for the LZMA packing method as the established gzip and bzip2 , which do not support 7z archives.

Furthermore, GRUB 2 has been using LZMA as standard since July 2008 instead of the previously used LZO (initially only for i386-PC).

history

  • The reference implementation 7-Zip was published in 2000.
  • The source code of LZMA SDK took effect on 23 November 2008 (version 4.61 beta) public domain published (English "public domain").
  • With version 9.04 beta of 7-Zip the algorithm LZMA2 was introduced on May 30th, 2009 , which is a slightly modified variant of the original algorithm, which supports multithreading better and improves the handling of non-compressible content.

Unix platforms

Since extensive use of Windows-specific features is made in the source code of 7-Zip, it took some time after its first publication before a Unix- compatible version was released, even though it was free software. LZMA was first usable in 2004 with a port of the command line version of 7-Zip called p7zip on Unix platforms. In the same year the (much more portable) LZMA SDK became available, which contains the command line program “lzma_alone”. Similar to gzip or bzip2, lzma_alone was used together with tar in order to be able to record file metadata and rights information from Unix file and operating systems. Less than a year after the LZMA SDK was first published, Lasse Collin published the LZMA Utils, which (initially only consisting of a set of wrapper scripts) created a (gzip-like) user interface to lzma_alone that is familiar to Unix users.

In 2008 Antonio Diaz published lzip, which instead of the raw LZMA data stream offered a container format with checksums and magic numbers . This provided a complete solution for the use of LZMA in Unix style, which, however, was only able to prevail in part before the LZMA Utils were further developed and now offered something similar under the name "XZ Utils". The XZ Utils now seem to be gaining acceptance as an LZMA implementation for Unix-like platforms. Your xz file format is now also supported by the reference implementations.

Web links

swell

  1. a b http://sourceforge.net/projects/sevenzip/forums/forum/45797/topic/2965956
  2. ^ Pierre Schmitz: Switching to xz compression for new packages archlinux.org, March 23, 2010.
  3. http://sevenzip.sourceforge.jp/chm/cmdline/switches/method.htm#LZMA2
  4. ^ Brian Lindholm: New Options in the World of File Compression . In: Linux Gazette . No. 162 , May 2009 (English, linuxgazette.net [accessed January 7, 2011]).