Content-Addressed Storage

from Wikipedia, the free encyclopedia

Content Addressed Storage ( CAS ) is a special storage method on hard disks that enables direct access to individual objects and at the same time ensures that the stored information cannot be changed. With the content-addressed storage system, stored information is not accessed via its location on the physical medium, but via the content of the information. It is usually used for high-speed storage and queries of static content. This "fixed content" (unchangeable content) refers to data that is written once and then not changed, e.g. B. business documents, receipts, accounting data, etc. in electronic form. Possible areas of application for CAS systems for electronic archiving are media, healthcare or finance. An unchangeable storage is often required on the basis of laws and regulations (e.g. GDPdU , GoBS , HGB, etc.) or other regulations (e.g. GxP , FDA, etc.). In this context, one speaks of audit-proof archiving .

functionality

The first commercially available CAS system, the EMC Centera platform, is characteristic of a CAS solution. It was specially developed to store unchangeable digital data and long-term information on fast hard drives. Until then, only digital optical storage disks ( WORM ) were used for this. CAS technology supports online access with guaranteed authenticity in terms of content and scalability in the petabyte range. The system consists of a number of network nodes, which are divided into storage and access nodes. The access nodes contain a synchronized directory of content addresses and the associated storage node in which each address can be found. When a new data item or blob (binary large object) is added, the storage unit calculates the content hash and returns it as the content address of the data item. The hash value is used to check that identical content is not saved a second time. If the same value occurs, the second file is discarded and the first original file is referenced. After checking, new data records are forwarded to a storage node and written to the physical medium.

If a content address is provided to the unit as part of an access, the directory for the physical storage location of the content address is first queried. This information is obtained from a storage node. The data hash is now recalculated and verified. When this is complete, the unit can transmit this queried data to the client. In the CAS system, each content address represents a number of specific data records or blobs, as well as possible metadata. Whenever a client adds an additional data record / blob to an already existing content block, the system will recalculate the content address.

Another typical implementation is iCAS from iTernity. The iTernity concept is based on containers (CSC-Content Storage Container). Each container is addressed by its hash value. Each container contains several unchangeable documents, so that the individual container cannot be changed and the hash values ​​cannot be changed after a container has been created.

In addition to the CAS procedure from EMC, there are similar procedures from other providers that achieve the same effect - immutability of the archived information - with other technological approaches. These include B. IBM , NetApp , FAST LTA , Hitachi , HP and Grau Data. An open source CAS + implementation was published under the name Twisted Storage. The open-source version of the Grau Archive Manager (GAM) is called Openarchive .

Difference to conventional storage technologies

Directly attached storage - DAS and the storage area network SAN - stand opposite content-addressed storage . With this memory management, the position of each data element is recorded on the physical medium for later use. A future request for a specific object only contains the address (for example path and file name) of the data. The storage unit can then use this information to locate and retrieve the data on the physical medium. When new information is written to the data carrier, it is simply stored in free space without paying attention to its content.

CAS solutions came onto the market for the first time in 2004 and have since replaced WORM memories and jukeboxes as archive systems.

Hash function

Hash functions are used to be able to make an allocation between content and storage space . The Centera relies on the 128-bit MD5 algorithm, while iTernity uses the SHA in the 512-bit version. Since the MD5 algorithm has been considered to be cracked since 2004, Caringo uses an algorithm for dynamic hash updating, but the manufacturer is silent about its exact functionality.

Strengths and weaknesses

CAS works efficiently with a database that rarely changes. The aim is to speed up the search for specific document content and to ensure that the document found is identical to the saved original. In addition, it is guaranteed that a data record is saved in a CAS system according to its content. This means that it cannot happen that two identical data records are stored on the storage medium. According to the CAS allocation procedure, two identical documents would have the same content address and thus the same storage position.

Traditional disk storage systems are suitable for storing data in volumes of ten to one hundred terabytes. However, they are unable to efficiently manage and scale large amounts of fixed content - and that can be hundreds of terabytes to petabytes. An additional challenge to the storage system is the balance between data backup and capacity planning on the one hand and long-term authenticity on the other.

For data that changes frequently, the CAS system is less efficient than conventional storage-addressing technology. In such cases, the CAS system would have to recalculate the address position of all changed data records. The management system for the stored objects would thus be forced to permanently update its information about where the document is now located.

The CAS systems can always be used economically where very large amounts of documents come together with high access rates and short response times are required. CAS systems are often not profitable for small amounts of information. Unlike databases and file servers, in which changes are made constantly, the value of fixed content lies in the combination of extended usability, authenticity and durability.

Fixed content

It is assumed that 80% of all stored data will not change because the documents are finished or a copy of the original document has to be retained in the event of a change (one speaks in this context of fixed content), which is why the use of CAS systems for all fixed content data are worthwhile.

default

With XAM, some of the CAS manufacturers want to develop a standard for controlling CAS systems.

Individual evidence

  1. http://twistedstorage.sourceforge.net
  2. OpenArchive. In: GRAU DATA. Retrieved December 17, 2019 (American English).
  3. http://www.snia.org/forums/xam/

literature

See also