Non-Uniform Memory Access

Non-Uniform Memory Access, or NUMA for short, is a computer memory architecture for multiprocessor systems in which each processor has its own “local” memory, but grants other processors “direct” access to it via a common address space ( distributed shared memory ). The memory access times for a CPU in such a network therefore depend on whether a memory address is in the CPU's own "local" or in the "external" memory (of another CPU).

Stand in contrast to this

Uniform Memory Access (UMA) , in which there is a central memory to which access times are always the same.
No-Remote-Memory-Access (NoRMA) , in which no direct access to the external memory is permitted and each processor uses its own address space.
Cache-only-Memory-Access (CoMA) , in which, as with NUMA, there is a common address space and separate local memory for each processor, but other processors do not write to the "remote" memory, but rather this (similar to caches ) in theirs local memory, whereupon it is deleted (invalidated) at its previous position. I.e. In the case of a write access to a memory page, the writing CPU becomes the owner of the memory page and the page is transferred to the local main memory of this CPU.

NUMA architectures are the next step in increasing the scalability of the SMP architectures.

Cache coherent NUMA (ccNUMA)

Almost all computer architectures use a small amount of very fast memory, which is referred to as a cache , in order to take advantage of locality properties when accessing memory . When using NUMA, maintaining cache coherency across the distributed memory creates additional overhead. As an example, imagine a processor pulling data from another processor's memory, doing calculations and writing the results to its local cache. The cache of the processor from which the data originates (and perhaps also other caches in the system) must then be synchronized.

Non-cache-coherent NUMA systems are easier to develop and build, but difficult to program with the standard Neumanns programming model. All NUMA systems currently in use therefore have special hardware to ensure cache coherence and are therefore also referred to as cache-coherent NUMA (ccNUMA).

This is mostly achieved through inter-processor communication between the cache controllers, which ensures coherent memory contents if the same memory location is stored in more than one cache. ccNUMA suffers from poor performance when several processors want to access the same memory location in quick succession. An operating system with NUMA support therefore tries to minimize the frequency of such accesses by allocating processors and memory in a NUMA-friendly way: threads that belong together are always assigned to the CPU cores of the same processor; when they request memory, they preferably receive memory from that processor.

Current implementations of ccNUMA systems are, for example, AMD multiprocessor systems based on Opteron and SGI systems with NUMAlink . Earlier ccNUMA systems were based on the Alpha processor EV7 from Digital Equipment Corporation (DEC) or the MIPS -R1x000 processors such as those in the SGI Origin series.

NUMA vs. Cluster computing

In contrast to NUMA computers, each cluster node has its own address space. Furthermore, the communication latencies between cluster nodes are significantly higher than those between NUMA-coupled processors.

By making appropriate adjustments in the operating system when paging the virtual memory , it is possible to implement cluster-wide address spaces and thus implement “NUMA in software”. However, since the high latencies persist, this is rarely useful.

Non-Uniform Memory Access

Cache coherent NUMA (ccNUMA)

NUMA vs. Cluster computing

See also