Venti (storage technology)
Venti is a network storage system that permanently stores blocks of data. The 160-bit SHA-1 hash code of the block ( called score ) is used to address the data. This ensures that the same data blocks are only saved once, even if they are written several times, since the same blocks also receive the same address. However, blocks cannot be removed again. This makes Venti suitable for perpetual archives and backups. Venti is mostly used with the archive programs vac and vtstore , or with Fossil - a file system that offers permanent snapshots.
history
Venti was developed by Sean Quinlan and Sean Dorward at Bell Laboratories . It first appeared in Plan 9 in 2002 . The development was continued by Russ Cox , who implemented most of the new software and developed a library for Venti-optimized data structures (files, directories, metadata). Venti is available in both Plan 9 distributions and forks as well as a port for Unixoid systems (as part of Plan 9 from User Space ).
Details
Venti is a userspace - daemon . Clients connect to via TCP and communicate via a simple RPC protocol. The main RPC messages are as follows:
- read (score, type) , returns the data block for the given score and type
- write (data, type) , saves a data block with a given type . The score is made up of the calculated SHA-1 hash code and the type.
There is no provision for deleting data blocks. The data blocks to be saved must not be smaller than 512 bytes or larger than 56 kB. If a client wants to save larger files, it breaks them down into blocks of up to 56 kB and saves the resulting scores in a suitable data structure in the Venti. For example, Fossil uses a hash tree to store large files. Venti itself does not care about the content or structure of the stored data, it only saves the type of data block.
The Venti design offers some technical advantages over comparable alternatives:
- Since all write operations are permanent, i.e. only new data is added (which enables a very simple implementation, so that there is less risk of data loss through bugs), there is also no fragmentation of the file system.
- Clients can check the correctness of the server: the score of the returned data must always match the requested score. In this way, not only technical defects, but also attempts at counterfeiting are uncovered.
- Data cannot be overwritten. If a score is available, the associated data is also available.
- Little need for user authentication: the data cannot be manipulated or destroyed and one can only read if the appropriate score is known. At most, the threat is that an attacker could fill the storage space.
- Compression becomes possible without complicated disk structures.
- Incremental data backup of separate media (e.g. CD-ROM / DVD, tape drive, network) becomes very easy.
- Redundant, automatically synchronizing Venti implementations are easy and possible without adapting the clients.
The data blocks are stored in files of a fixed size - called arenas -, typically hard drives or partitions, for example in a RAID network. The individual arenas are usually dimensioned so that they can be easily written onto other media (e.g. CD / DVD ) or magnetic tape. Taking all of Arenas together form the data log ( data log ). Another set of files form the index, which shows scores on specific positions within the data log . The data structure of the index is a hash table with a fixed bucket size. Venti relies on the random distribution of the hash codes so that the buckets don't get overfilled. Since every index lookup costs one disk seek, it is advisable to distribute the index over several small disks with short access times. The strict division into data log and index has another advantage: the index can be reconstructed in the event of data loss.
Hash collisions
A fundamental principle of information theory , the pigeonhole principle, says that with a function that maps a large set A to a small set B, this mapping is not one-to-one or reversible. So there are several elements from A that are mapped to the same element from B. Theoretically, there are many different data blocks with the same hash code; this is also called a collision .
However, Venti does not address this question. It is argued that with a SHA-1 hash, the risk of a (random) collision is infinitesimal, even on the order of exabytes. However, in 2005 there were significant advances in reducing the computational effort required to create artificial SHA-1 collisions (but it remains enormously expensive). In 2017, a SHA-1 collision was presented by Google and CWI Amsterdam.
Web links
- Venti: a new approach to archival storage , paper describing Venti (Also available as a PDF )
- New Venti manual page (overview) , section 1 and 8 venti manual pages also exist
- Old Venti manual page , no longer in use on Plan 9.
- Old Venti utilities manual page , no longer in use on Plan 9.
Individual evidence
- ^ Sean Quinlan, Sean Dorward: Venti: a new approach to archival storage. Retrieved October 23, 2018 : “Consider an even larger system that contains an exabyte (10 ^ 18 bytes) stored as 8 Kbyte blocks (~ 10 ^ 14 blocks). Using the Sha1 hash function, the probability of a collision is less than 10 ^ -20. Such a scenario seems sufficiently unlikely that we ignore it and use the Sha1 hash as a unique identifier for a block. "
- ↑ https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html