DRBD

from Wikipedia, the free encyclopedia
DRBD

DRBD logo.svg
Basic data

Maintainer Philipp Reisner, Lars Ellenberg
developer LINBIT HA-Solutions GmbH, Vienna and LINBIT USA LLC, Oregon
Current  version 9.0.24-1
( June 30, 2020 )
operating system GNU / Linux
programming language C.
License GPL ( Free Software )
drbd.linbit.com/
Overview of the DRBD concept

DRBD ( Distributed Replicated Block Device ) is free network storage software for Linux . As a kernel module together with a management application in userspace and a script, it is used to mirror a block device on a productive (primary) server in real time to another (secondary) server. This procedure is used to implement high availability (HA) in the Linux environment and thus to achieve good availability of various services. Alternatives to this are distributed file systems such as B. GlusterFS .

functionality

The primary task of the DRBD software is to ensure the availability of one and the same data set on more than one block device - such as a hard drive or an SSD - in terms of high availability . The respective block devices are usually located in different servers in order to guarantee redundancy in the event of a server failure. DRBD has several functions to circumvent the problems usually associated with data replication or to mitigate its effects; In addition to fully synchronous replication, it also masters semi-synchronous and asynchronous replication. It also supports multiple network transport media . The activity log ensures that the entire content of a block device does not have to be synchronized again between the nodes of a high-availability cluster if the network connection between the two systems breaks down.

Basic functionality

For communication with a block device, DRBD uses the block storage device management of the Linux kernel ("Block Device Layer"), so that a DRBD drive ("Resource") also represents a block device even on a Linux system. At the application level, access to DRBD resources works in exactly the same way as access to hard disks or SSDs; likewise, a DRBD resource requires a file system or a comparable device to enable coordinated read and write access. DRBD is therefore agnostic about the software it uses.

On the one hand, a DRBD resource forwards incoming read and write requests from the application side directly to the block device connected to it in blocks (“backing device”). On the other hand, the DRBD resource also transfers the same data to the network management (“network stack”) of the Linux kernel on the system, from where it sends it to the other node. This mechanism guarantees the synchronization of the data in DRBD.

It is true that DRBD masters various network transport media for data transmission; however, the standard configuration provides for the use of an existing Ethernet connection using the TCP / IP protocol. Optionally, two DRBD systems can also be connected directly - without any network switches in between - ("Cross Link") if suitable network cards are installed in the systems.

DRBD masters the automatic resynchronization of the backing devices of a DRBD resource if the network connection of this resource has been disconnected in the meantime. As soon as the network connection is restored, DRBD automatically checks the status of the resource on both hosts and initiates a resynchronization if necessary. This is considered to have been successfully completed as soon as the data record on the backing devices of the DRBD resource is identical on all participating servers.

A special form of resynchronization is the initial creation of a DRBD resource: Here the content of the backing devices may differ at the level of the block devices; However, this is irrelevant because when using DRBD, the data on the backing devices is successively overwritten anyway. DRBD therefore offers the option of using special commands to skip the synchronization of the backing devices when the resource is created for the first time.

Functionality up to DRBD 8.4

Up to and including DRBD 8.4, DRBD only offered the option of mirroring the same data record from one system to another. In this respect, it was common to use it in high-availability clusters with two nodes, with write access, for example by applications, usually only taking place on one system.

In a cluster consisting of two nodes, a DRBD resource generally has the role of primary on one system and the role of secondary on the other system . With DRBD, read and write access are always only possible on the node on which the respective DRBD resource has the primary role. The DRBD resource in Secondary mode only receives updates for the data record of the respective drive, which the resource in Primary mode sends to it.

A function frequently used up to and including DRBD 8.4 to circumvent the limitation on two nodes was the so-called “device stacking”. DRBD offered the possibility of combining several DRBD resources on a host within the block device management of the Linux kernel in such a way that changes can be passed on to one another in a defined order. The primary goal of this measure was to enable redundant, local replication on the one hand and, at the same time, replication to another location on the other. However, stacked resources were inferior to their “normal” counterparts in terms of achievable performance.

With the functionality of DRBD 8.4, DRBD is currently an official part of the Linux kernel.

Functionality from DRBD 9.0

Starting with DRBD 9, its maintainers have significantly expanded the functionality of the software. In particular, the restriction that a DRBD resource with the primary role can only copy its data to another DRBD resource with the secondary role has been removed. Current DRBD versions of DRBD 9 instead offer the option of copying data from one primary resource to up to 31 secondary resources at the same time.

Functionality as Software Defined Storage (SDS)

Since DRBD 9, DRBD can be used indirectly drbdmanageas a solution for software-defined storage through the introduction of management software : DRBD resources can be created dynamically from environments such as OpenStack on the servers of a DRBD cluster that have just free resources - so available space - offer. The functions of DRBD 9 thus indirectly enable operation as a memory that can be scaled in width ( “Scale Out” ).

With the introduction of replication to multiple destinations, the stacking of resources known from DRBD 8.4 (“device stacking”) in DRBD 9 has de facto lost its importance.

Roles of a DRBD resource

DRBD differentiates between the primary and secondary roles for a resource .

  • Primary : The resource enables read and write access
  • Secondary : The resource only receives updates from a primary resource and cannot be used locally for either reading or writing

The role of a DRBD resource is fundamentally independent of the current status of the resource with regard to its network connection. A resource in DRBD knows three states for its own network connection:

  • Connected: There is an active connection to the other cluster node for the respective resource
  • StandAlone: ​​There is no active connection to the other cluster node for the respective resource and the resource does not actively attempt to establish such a connection
  • Connecting (up to DRBD 8.4: WFConnection): There is no active connection to the other cluster node for the respective resource; however, it actively tries to establish such a connection

In total, the following possible states result for a DRBD resource on a host:

role active connection Wait for connection connection lost
Primary Primary / Connected Primary / Connecting or WFConnection Primary / StandAlone
Secondary Secondary / Connected Secondary / Connecting or WFConnection Secondary / StandAlone

Replication modes

DRBD supports various modes of replication, referred to as "protocols" in DRBD. The available protocols differ significantly with regard to the point in time at which the DRBD resource in primary mode signals to an application that has write access to it that the write process has been successfully completed.

The standard configuration for a drive ensures fully synchronous replication ( protocol C ); In this case, in a setup with two nodes, the DRBD resource of the application accessing it to write only confirms that a write process has been successfully completed as soon as the same DRBD resource on both cluster nodes has successfully written the change to its local block device. In DRBD 9, the number of nodes on which a write process must be successfully completed before it is considered successful in terms of protocol C can be configured by the admin. This replication mode is the only one of the modes supported by DRBD to offer transaction security . It is therefore the mode found in most setups.

DRBD describes its implementation of semi-synchronous replication as protocol B: Here, the packets only need to have reached the network card of the opposite cluster node so that the primary resource signals the successful completion of the write process to the application. The mode is not transaction-safe because in the event of a power failure at the node with the resource in secondary mode, the data there may not have been written to the disk. In practice, Protocol B is only of secondary importance.

Protocol A describes the principle of asynchronous replication in DRBD: The DRBD resource in primary mode signals the successful completion of a write process to the local application as soon as the data has reached the block device on which the DRBD is based on the same node. Protocol A is usually not used for local replication between two nodes; Instead, it is particularly suitable for replication between different geographical locations if the latency that can be achieved via the network is unsatisfactory there.

Dual primary resources

DRBD up to and including version 8.4 basically offers the possibility of operating related DRBD resources on different hosts in parallel in the primary role. The reliable operation of a DRBD resource in the primary role on several systems, however, almost always places considerably higher demands on the setup than the classic operation according to the primary-secondary scheme; it requires, for example, the use of a locking mechanism in order to prevent competing write processes, for which the Distributed Lock Manager (DLM) is usually used under Linux . For DRBD 8.4, the DRBD developers advise against appropriate setups in the majority of cases; the only legitimate use is therefore the short-term parallel operation in dual primary mode for the purpose of live migration of virtual machines .

DRBD 9 basically supports the operation of a resource in dual-primary mode, but the developers of the software no longer consider this application scenario to be of practical relevance, with the exception of the live migration of virtual machines mentioned. The provider does not currently plan to further develop the function.

Layout of the backing device

For its proper functionality, DRBD needs so-called metadata on the backing devices of a resource. When creating a resource in DRBD 8.4 or previous versions, the admin creates the metadata drbdadmhimself on all participating systems; When using DRBD 9, there is drbdmanagean alternative option to have the metadata created automatically.

The area with the metadata is usually at the beginning or at the end of the backing device; alternatively, the metadata can also be stored on an external block device. This setup is primarily used to enable the replication of block devices using DRBD afterwards. The outsourcing of the metadata of newly created DRBD resources, however, is unusual and is only used in special scenarios.

The metadata area has a variable size, which essentially depends on the size of the backing device as a whole. In day-to-day operations, this can lead to problems if a resource uses internal metadata and is to be enlarged by the admin. In such cases it is necessary to first move the metadata of the DRBD resource to the end of the backing device after the new size of the backing device has been determined.

The information to be found in the metadata includes in particular the DRBD activity log and information about role changes of the respective DRBD resource in the past.

Activity log

In the day-to-day operations of a high-availability cluster, the break in the network connection between the various cluster nodes is a common error scenario. The failure of a cluster node, for example, which inevitably goes hand in hand with the failure of the network connection, is even the classic application scenario for a high-availability cluster. From the point of view of DRBD, as well as from the point of view of any network-based replication solution, the interruption of the communication connection to the cluster partner is a real problem.

This is particularly true if the now failed cluster node previously operated DRBD resources in the primary role. Despite fully synchronous replication using protocol C, it cannot be prevented that on that cluster node changes to the data record of the backing device of the DRBD resources take place directly from the failure of the node, which can no longer be successfully synchronized to the cluster partner. In the worst case, the backing device of the resource of the failed node will contain more current data than the backing device of the resource of the remaining system.

If a cluster manager such as Pacemaker is used, it normally activates all resources on the remaining node for use there immediately after a node fails; It then switches DRBD resources to the primary role there. Because, at least in protocol C, any transactions on the previous primary node were not considered to have been successfully completed, it is expressly not a split-brain situation. As soon as the cluster node that has failed in the meantime rejoins the cluster, the remaining cluster node must, however, ensure that the no longer correct data on the backing devices of the DRBD resources on the system that has failed in the meantime is deleted.

There are different approaches for this: On the one hand, the node with the DRBD resources in the primary role could synchronize the entire content of the backing device from the primary to the secondary node. However, especially with large DRBD resources, this would take some time, during which the performance of the DRBD resources would not be optimal. DRBD therefore uses a technique called "Activity Log" instead: In the activity log, which is located in the metadata of a DRBD resource, DRBD records which extents of the backing device were last changed. The number of extents recorded in the activity log can be adapted to local conditions using a configuration file. As soon as the network connection is established again, the node with the DRBD resources in the primary role only synchronizes those extents that are recorded in the activity log of the resource on the other node. In this way, DRBD bypasses the complete resynchronization of the backing devices.

Different network transports

In DRBD 8.4, both the replication logic and the logic for transporting the data to be replicated were part of the common DRBD kernel module drbd.ko, the developers have significantly modified the network logic of the solution in DRBD 9. Since then, the actual kernel module for DRBD has been exclusively responsible for the replication of data and transfers this via an internal, generic interface to a module for the network transport of the DRBD data that is also to be landed in the kernel. The new design enables the use of different network transport layers; In addition to Ethernet, DRBD 9 also supports replication via InfiniBand . Support for additional transport media is planned by the manufacturer and is in preparation.

performance

Unlike distributed storage solutions such as Ceph or GlusterFS , DRBD does not reorganize the data itself. As part of the block device layer of the Linux kernel, it only forwards incoming write and read requests to the respective backing device on the one hand and to the network stack of the local system on the other. Because the movement of data within the block device layer of the Linux kernel happens practically in real time, the throughput to be achieved on a DRBD resource roughly corresponds to that of the physical block device below. This throughput can be increased accordingly by using DRBD RAID networks with many spindles, whose bandwidth is bundled by corresponding RAID controllers.

With regard to the expected latency for write access to DRBD resources, the replication guaranteed by DRBD has a negative effect. The DRBD resource in the primary role can only report the successful completion of a write operation to the writing application after the cluster partner has received confirmation of the successful writing of the data on the block device there. This creates an overall latency consisting of the latency of the two data carriers and double the network latency between the systems. However , this parameter can be optimized by using alternative network solutions such as Infiniband , whose implicit latency is significantly lower than that of Ethernet.

In total, the latency of write access to DRBD resources is significantly below that of distributed storage solutions, even when using Ethernet. In return, these usually offer significantly higher bandwidths because they combine the bandwidth of many nodes by dividing the data to be written into binary objects.

Related solutions

DRBD itself only offers rudimentary functions for cross-node functionality in high-availability clusters. Since DRBD 9, there has been the option of having a resource automatically switched to the primary role by DRBD , if this is to be accessed locally with write access.

However, this functionality is not sufficient for complex installations. For example, on systems it is usually not enough to switch a resource to the primary role; In addition to this, a file system located on the DRBD resource usually has to be mounted in the file system of the host . In addition to a service that may have to be started, such as a database that then uses the attached file system, classic high-availability clusters also need special services such as a special IP address at cluster level that swings back and forth between the cluster nodes together with the services (“ failover ”). This is the only way to achieve HA functionality without having to reconfigure the clients that are supposed to access a service offered by a high availability cluster.

In the context of high availability clusters, DRBD is therefore regularly used with the Cluster Resource Manager (CRM) of the Linux HA project, Pacemaker , and its Cluster Communication Manager (CCM), Corosync . As a DRBD supervisor, LINBIT offers an "OCF Resource Agent" with which DRBD resources can be integrated and managed in Pacemaker.

Management tools

The tools used on the command line to manage DRBD resources have changed several times over the history of the DRBD version.

DRBD 8.4 and earlier

Up to and including DRBD 8.4, administrators had four tools available to manage DRBD resources:

  • /proc/drbd
  • drbd-overview
  • drbdsetup
  • drbdadm

The file drbdwithin procfs contains basic details about the locally available DRBD resources. The DRBD kernel driver records all active resources and their roles on the local system and, if necessary, on the cluster partner.

The tool drbd-overviewreads /proc/drbdout and correlates the data found there with the contents of the DRBD configuration files; Finally, it shows an appropriately prepared overview of all resources. drbd-overviewhowever, it is obsolete and should no longer be used.

drbdsetup is the low-level tool for managing DRBD resources. The admin rarely uses it directly; however, the tool drbdadmcalls drbdsetupin the background with the correct parameters. drbdsetupis also the recommended way to get information about the DRBD resources from /proc/drbd.

drbdadm is in DRBD 8.4 the main tool for admins to create, manage and delete DRBD resources.

DRBD 9

Due to the greatly increased complexity of the fundamentally possible setups in DRBD 9 compared to DRBD 8, the manufacturer LINBIT has also presented a new management tool called DRBD 9 drbdmanage. drbdmanageis based on a server-client architecture and enables the cluster-wide creation, administration and deletion of DRBD resources. The first version of drbdmanageis based on the Python scripting language; LINBIT is currently working on a new version of drbdmanage, which should be based on Java.

As part of the introduction of DRBD 9, the DRBD developers have also separated the development of DRBD and the associated administration tools so that they now follow different release cycles.

LINBIT caused controversy at the end of 2016 when it changed the license from drbdmanagethe free GPL v3 to a commercial license, which could not be reconciled with the requirements of the definition of free software of the GNU project. LINBIT revised the decision a little later, however, so that the provisions of the GNU GPL v3 apply again for drbdmanage.

LINBIT replaced Linstor drbdmanagein July 2018 to meet the new requirements in storage management.

Differentiation from other solutions

DRBD is regularly mentioned, especially in version 9, at the same time as other storage solutions such as Ceph or GlusterFS; OCFS2 or GFS2 are also terms that are used regularly in the DRBD context. However, DRBD differs markedly from all of these solutions.

Distributed storage solutions

Unlike DRBD, solutions like Ceph or GlusterFS are massively distributed storage systems. These are characterized by the fact that they use an algorithm - such as a hash algorithm - to distribute data to any number of physical storage devices based on certain criteria. Replication is usually an implicit part of the solution.

Cluster file systems

Cluster file systems such as OCFS2 or GFS2 make it possible to have read and write access to the same data record from multiple clients within a cluster. DRBD is therefore not an alternative to cluster file systems, but in dual primary mode it can form the basis for such approaches.

Advantages over shared cluster storage

Conventional computer cluster systems usually use a type of shared memory that is used for the cluster resources. However, this approach has a number of disadvantages that DRBD circumvents.

  • Shared storage typically results in a single point of failure , as both cluster systems are dependent on the same shared storage. When using DRBD, there is no danger here, as the required cluster resources are replicated locally and are not stored on shared storage that may be lost. However, mirroring functions can be used in modern SAN solutions, which eliminate this previously unavoidable point of error.
  • Shared storage is usually addressed via a SAN or NAS , which requires a certain amount of additional work for read access. With DRBD, this effort is significantly reduced, since read access always takes place locally.

Applications

DRBD works within the Linux kernel at block level and is therefore transparent for the layers on top. DRBD can thus be used as a basis for:

  • conventional file systems
  • shared cluster file systems such as B. GFS or OCFS2
  • another logical block device such as B. LVM
  • any application that supports direct access to a block device.

DRBD-based clusters are used to e.g. B. file servers, relational databases (such as PostgreSQL or MySQL ) and hypervisor / server virtualization (such as OpenVZ ) to include synchronous replication and high availability.

history

In July 2007 the DRBD authors made the software available to the Linux developer community for possible future inclusion of DRBD in the official Linux kernel. After two and a half years, DRBD was included in kernel 2.6.33, which was released on February 24, 2010.

The commercially licensed version DRBD + was merged with the open source version in the first half of December 2008 and released under the GNU General Public License . As of the resulting version 8.3, it is possible to mirror the database on a third node. The maximum limit of 4 TiBytes per device has been increased to 16 TiBytes.

Since 2012 there is no longer any size limit per DRBD device. The official usage statistics include around 30,000 regularly updated installations with device sizes of up to 220 TB.

In July 2018 Linbit drbdmanagereplaced Linstor.

Web links

Commons : DRBD  - album with pictures, videos and audio files

Individual evidence

  1. Release 9.0.24-1 . June 30, 2020 (accessed July 1, 2020).
  2. Linus Torvalds: linux: Linux kernel source tree. February 5, 2018, accessed February 5, 2018 .
  3. Dual Primary: Think Twice | DRBD High Availability, DR, software-defined storage. In: LINBIT | DRBD High Availability, DR, software-defined storage. Retrieved February 5, 2018 (American English).
  4. Roland Kammerer: [DRBD-user] drbdmanage v0.98. linbit.com, October 28, 2016, accessed February 5, 2018 .
  5. drbdmanage: Management system for DRBD9. LINBIT, January 15, 2018, accessed February 5, 2018 .
  6. Lars Ellenberg: DRBD wants to go mainline . In: Linux kernel mailing list . July 21, 2007. Retrieved February 21, 2011.
  7. DRBD makes it into the Linux kernel . golem.de . December 10, 2009. Retrieved February 21, 2011.
  8. Linux 2.6.33 released . gmane.org. February 24, 2010. Retrieved August 28, 2012.
  9. DRBD + high availability solution becomes open source . heise.de . November 17, 2008. Retrieved February 21, 2011.
  10. Announcement of DRBD 8.3 under the GPL . LINBIT. Archived from the original on December 27, 2010. Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. Retrieved February 21, 2011. @1@ 2Template: Webachiv / IABot / www.linbit.com
  11. Maximum volume size on DRBD . Linbit . April 2, 2012. Archived from the original on August 27, 2013. Info: The archive link was automatically inserted and not yet checked. Please check the original and archive link according to the instructions and then remove this notice. Retrieved August 23, 2013. @1@ 2Template: Webachiv / IABot / blogs.linbit.com
  12. Usage Page . Retrieved August 28, 2014.