ARM architecture family: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
No edit summary
clarifications; conciseness; expand terseness; curtail hyphenation fest
Line 17: Line 17:
This work would eventually turn into the '''ARM6'''. The first models were released in 1991, and Apple used the ARM6-based ARM 610 as the basis for their [[Apple Newton]] [[Personal Digital Assistant|PDA]]. In 1994, Acorn used the ARM 610 as the main [[central processing unit|CPU]] in their [[RiscPC]] computers.
This work would eventually turn into the '''ARM6'''. The first models were released in 1991, and Apple used the ARM6-based ARM 610 as the basis for their [[Apple Newton]] [[Personal Digital Assistant|PDA]]. In 1994, Acorn used the ARM 610 as the main [[central processing unit|CPU]] in their [[RiscPC]] computers.


The core has remained largely the same size throughout these changes. ARM2 had 30,000 transistors, while the ARM6 grew to only 35,000. The idea is that the end-user combines the ARM core with a number of optional parts to produce a complete CPU, one that can be built on old [[Fab (semiconductors)|semiconductor fabs]] and still deliver lots of performance at a low cost.
The core has remained largely the same size throughout these changes. ARM2 had 30,000 transistors, while the ARM6 grew to only 35,000. The idea is that the [[Original Design Manufacturer]] combines the ARM core with a number of optional parts to produce a complete CPU, one that can be built on old [[Fab (semiconductors)|semiconductor fabs]] and still deliver lots of performance at a low cost.


The most successful implementation has been the [[ARM7TDMI]] with hundreds of millions sold in mobile phones, handheld video game systems, and Sega [[Dreamcast]]s. While ARM's business has always been to sell [[IP core]]s, some of the licensees generated [[microcontrollers]] based on this core.
The most successful implementation has been the [[ARM7TDMI]] with hundreds of millions sold in mobile phones, handheld video game systems, and [[Sega Dreamcast]]s. While ARM's business has always been to sell [[IP core]]s, some of the licensees generated [[microcontrollers]] based on this core.


The Dreamcast features a SH4 processor which only borrows concepts from ARM (low power consumption, optional compact instruction set etc.), but is otherwise different from an ARM. The Dreamcast also features a sound chip designed by Yamaha with an ARM7 core. Nintendo's Gameboy Advance, however, uses the ARM7TDMI at 16.78MHz.
The Dreamcast features a SH4 processor which only borrows concepts from ARM (low power consumption, optional compact instruction set etc.), but is otherwise different from an ARM. The Dreamcast also features a sound chip designed by Yamaha with an ARM7 core. Nintendo's Gameboy Advance, however, uses the ARM7TDMI at 16.78MHz.


[[Digital Equipment Corporation|DEC]] licensed the architecture (which caused some confusion because they also produced the [[DEC Alpha]]) and produced the '''[[StrongARM]]'''. At 233 MHz this CPU drew only 1 [[watt]] of power (more recent versions draw far less). This work was later passed to [[Intel]] as a part of a lawsuit settlement, and Intel took the opportunity to supplement their aging [[Intel i960|i960]] line with the StrongARM. Intel have since developed its own high performance implementation known by the name '''[[Intel XScale|XScale]]'''.
[[Digital Equipment Corporation|DEC]] licensed the architecture (which caused some confusion because they also produced the [[DEC Alpha]]) and produced the '''[[StrongARM]]'''. At 233 MHz this CPU drew only 1 [[watt]] of power (more recent versions draw far less). This work was later passed to [[Intel]] as a part of a lawsuit settlement, and Intel took the opportunity to supplement their aging [[Intel i960|i960]] line with the StrongARM. Intel later developed its own high performance implementation known as '''[[Intel XScale|XScale]]''' which it has since sold.


== The cores ==
== The cores ==
Line 346: Line 346:
* Simple, but fast, 2-priority-level [[interrupt]] subsystem with switched register banks
* Simple, but fast, 2-priority-level [[interrupt]] subsystem with switched register banks


An interesting addition to the ARM design is the use of a 4-bit ''condition code'' on the front of every instruction, meaning that execution of every instruction can be made a conditional.
An interesting addition to the ARM design is the use of a 4-bit ''condition code'' on the front of every instruction, meaning that execution of every instruction is optionally conditional.


This cuts down significantly on the space available for, for example, displacements in memory access instructions, but on the other hand it does make it possible to avoid branch instructions when generating code for small if statements. The standard example of this is [[Euclid]]'s [[greatest common divisor|GCD]] algorithm:
This cuts down significantly on the encoding bits available for displacements in memory access instructions, but on the other hand it avoids branch instructions when generating code for small <code>if</code> statements. The standard example of this is [[Euclid]]'s [[greatest common divisor|GCD]] algorithm:


In the [[C programming language]], the loop is:
In the [[C programming language]], the loop is:
Line 371: Line 371:
BNE loop ; if "NE", then loop
BNE loop ; if "NE", then loop


which avoids the branches around the then and else clause that one would typically have to emit.
which avoids the branches around the <code>then</code> and <code>else</code> clauses.


Another unique feature of the instruction set is the ability to fold shifts and rotates into the "data processing" (arithmetic, logical, and register-register move) instructions, so that, for example, the C statement
Another unique feature of the instruction set is the ability to fold shifts and rotates into the "data processing" (arithmetic, logical, and register-register move) instructions, so that, for example, the C statement


<pre>a += (j << 2); </pre>
:<code>a += (j << 2); </code>


could be rendered as a single word, single cycle instruction
could be rendered as a single word, single cycle instruction on the ARM.


<pre> ADD Ra, Ra, Rj, LSL #2 </pre>
:<code> ADD Ra, Ra, Rj, LSL #2 </code>


This results in the typical ARM program being denser than expected with less memory access so the pipeline is used more efficiently. Even though the ARM runs at what many would consider to be low speeds, it nevertheless competes quite well with much more complex CPU designs.
on the ARM, register allocation permitting.

This results in the typical ARM program being denser than what would normally be expected of a RISC processor. This implies that there is less need for memory access and that the pipeline is being used more efficiently. Even though the ARM runs at what many would consider to be low speeds, it nevertheless competes quite well with much more complex CPU designs.


The ARM processor also has some features rarely seen on other architectures that are considered RISC, such as PC-relative addressing (indeed, on the ARM the PC is one of its 16 registers) and pre- and post-increment addressing modes.
The ARM processor also has some features rarely seen on other architectures that are considered RISC, such as PC-relative addressing (indeed, on the ARM the PC is one of its 16 registers) and pre- and post-increment addressing modes.
Line 392: Line 390:


=== Thumb ===
=== Thumb ===
Perhaps in part because of the conditional execution facility using up four bits of every instruction, newer ARM processors have a 16-bit instruction mode, called '''Thumb'''. The smaller opcodes have less functionality; for example, only branches can be conditional, and many opcodes cannot access all of the CPU's registers. However, the shorter opcodes give improved code density overall, even though some operations will require more opcodes to be executed. Particularly in situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allows greater performance than with 32-bit code because of the more efficient use of the limited memory bandwidth. Typically in embedded applications a small range of addresses have a 32-bit datapath and the rest are 16 bits wide or narrower (e.g. the [[Game Boy Advance]]); in this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using the 32-bit instruction set, placing them in the limited 32-bit bus width memory.
Newer ARM processors have a 16-bit instruction mode, called '''Thumb''', perhaps related to the conditional execution facility using four bits of every instruction. In Thumb, the smaller opcodes have less functionality. For example, only branches can be conditional, and many opcodes cannot access all of the CPU's registers. However, the shorter opcodes give improved code density overall, even though some operations require more instructions. Particularly in situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allows greater performance than with 32-bit code because of the more efficient use of the limited memory bandwidth. Typically embedded hardware has a small range of addresses of 32-bit datapath and the rest are 16 bits or narrower (e.g. the [[Game Boy Advance]]). In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using the (non-Thumb) 32-bit instruction set, placing them in the limited 32-bit bus width memory.


The first processor with Thumb technology was the ARM7TDMI. All ARM9 and later families, including [[Intel XScale|XScale]] have included Thumb technology.
The first processor with Thumb technology was the ARM7TDMI. All ARM9 and later families, including [[Intel XScale|XScale]] have included Thumb technology.


=== Jazelle ===
=== Jazelle ===
ARM has implemented a [http://www.arm.com/products/solutions/Jazelle.html technology] that allows certain of their architectures to execute [[Java bytecode]] natively in hardware, in another execution mode alongside the existing ARM and Thumb modes and accessed in a similar fashion to ARM/Thumb interworking.
ARM has implemented a [http://www.arm.com/products/solutions/Jazelle.html technology] that allows certain of their architectures to execute [[Java bytecode]] natively in hardware, as another execution mode. It interoperates alongside the existing ARM and Thumb modes.


The first processor with Jazelle technology was the '''ARM926EJ-S''': Jazelle being denoted by the 'J' in the CPU name. It has been used by mobile phone manufacturers to speed up execution of [[Java ME]] games and applications, which is probably what drove development of the technology.
The first processor with Jazelle technology was the '''ARM926EJ-S''': Jazelle being denoted by the 'J' in the CPU name. It is used by mobile phone manufacturers to speed up execution of [[Java ME]] games and applications, which is probably what drove development of the technology.


=== Thumb-2 ===
=== Thumb-2 ===
Line 412: Line 410:


=== NEON ===
=== NEON ===
'''NEON''' technology is a combined 64 and 128bit [[SIMD]] (Single Instruction Multiple Data) instruction set that provides standardized acceleration for media and signal processing applications. NEON can execute MP3 audio decoder on CPU running at 10 MHz and can run the GSM AMR (Adaptive Multi-Rate) speech codec using CPU running at no more than 13 MHz. It features a comprehensive instruction set, separate register files and independent execution hardware. NEON supports 8-, 16-, 32- and 64-bit integer and single precision floating-point data and operates in [[SIMD]] operations for handling audio/video processing as well as graphics and gaming processing. SIMD is a crucial element in [[vector processor|vector supercomputers]] which feature simultaneous multiple operations. In NEON, the SIMD supports up to 16 operations at the same time.
'''NEON''' technology is a combined 64 and 128 bit [[SIMD]] (Single Instruction Multiple Data) instruction set that provides standardized acceleration for media and signal processing applications. NEON can execute MP3 audio decoding on CPUs running at 10 MHz and can run the [[GSM]] AMR (Adaptive Multi-Rate) speech [[codec]] at no more than 13 MHz. It features a comprehensive instruction set, separate register files and independent execution hardware. NEON supports 8-, 16-, 32- and 64-bit integer and single precision floating-point data and operates in [[SIMD]] operations for handling audio/video processing as well as graphics and gaming processing. SIMD is a crucial element in [[vector processor|vector supercomputers]] which feature simultaneous multiple operations. In NEON, the SIMD supports up to 16 operations at the same time.


=== VFP ===
=== VFP ===
'''VFP''' technology is a coprocessor extension to the ARM architecture. It provides low-cost single-precision and double-precision floating-point computation that is fully compliant with the ''[[IEEE 754|ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic]]''. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDA, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture also supports execution of short vector instructions allowing SIMD (Single Instruction Multiple Data) parallelism. This is useful in graphics and signal-processing applications by reducing code size and increasing throughput.
'''VFP''' technology is a coprocessor extension to the ARM architecture. It provides low-cost single-precision and double-precision floating-point computation fully compliant with the ''[[IEEE 754|ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic]]''. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture also supports execution of short vector instructions allowing SIMD (Single Instruction Multiple Data) parallelism. This is useful in graphics and signal-processing applications by reducing code size and increasing throughput.


== ARM licensees ==
== ARM licensees ==
ARM Ltd does not manufacture and sell CPU devices based on their own designs, but rather, licenses the processor architecture to interested parties. ARM offers a variety of licensing terms, broken down by cost and deliverables. To all licensees, ARM provides an integratable hardware-description of the ARM core, as well as complete set of software development toolset (compiler, debugger, SDK), and the right to sell manufactured-silicon containing the ARM CPU. Fabless licensees, who wish to integrate an ARM core into their own chip design, are usually only interested in acquiring a ready-to-manufacture, pre-verified [[IP-core]]. For these customers, ARM delivers a gate-netlist description of the chosen ARM core, along with an abstracted simulation-model and test programs to aid design integration and verification. More ambitious customers, including integrated device manufacturers (IDM) and foundry operators, chose to acquire the processor IP in synthesizable RTL ([[Verilog]]) form. With the synthesizable RTL, the customer has the ability to perform architectural level optimizations and extensions. These allow the designer to achieve exotic design goals not otherwise possible with an unmodified netlist (high clock speed, very low power-consumption, instruction-set extensions, etc.) While ARM does not grant the licensee the right to re-sell the ARM-architecture itself, licensees may freely sell manufactured product (chip devices, evaluation boards, complete systems, etc.) Merchant foundries can be a special case; not only are they allowed to sell finished silicon containing ARM-cores, they generally hold the right to re-manufacture ARM-cores for other customers.
ARM Ltd does not manufacture and sell CPU devices based on their own designs, but rather, licenses the processor architecture to interested parties. ARM offers a variety of licensing terms, varying in cost and deliverables. To all licensees, ARM provides an integratable hardware description of the ARM core, as well as complete software development toolset (compiler, debugger, SDK), and the right to sell manufactured silicon containing the ARM CPU. Fabless licensees, who wish to integrate an ARM core into their own chip design, are usually only interested in acquiring a ready-to-manufacture verified [[IP core]]. For these customers, ARM delivers a gate netlist description of the chosen ARM core, along with an abstracted simulation model and test programs to aid design integration and verification. More ambitious customers, including integrated device manufacturers (IDM) and foundry operators, choose to acquire the processor IP in synthesizable RTL ([[Verilog]]) form. With the synthesizable RTL, the customer has the ability to perform architectural level optimizations and extensions. This allows the designer to achieve exotic design goals not otherwise possible with an unmodified netlist (high clock speed, very low power consumption, instruction set extensions, etc.) While ARM does not grant the licensee the right to resell the ARM architecture itself, licensees may freely sell manufactured product (chip devices, evaluation boards, complete systems, etc.) Merchant foundries can be a special case; not only are they allowed to sell finished silicon containing ARM cores, they generally hold the right to remanufacture ARM cores for other customers.


Like most IP vendors, ARM prices its IP based on perceived value. In architectural terms, the lower-performance ARM cores command a lower license cost than the higher-performance cores. In terms of silicon implementation, a synthesizable core is more expensive than a hard-macro (black-box) core. Complicating price matters, merchant foundries who hold an ARM license (such as Samsung and Fujitsu) can offer reduced licensing costs to its fab customers. In exchange for acquiring the ARM core through the foundry's in-house design-services, the customer can reduce or eliminate payment of ARM's upfront license fee. Compared to dedicated semicounductor foundries (such as TSMC and UMC) without in-house design-services, Fujitsu/Samsung charge 2-3x more per manufactured wafer. For low-mid volume applications, a design-service foundry offers lower overall pricing (through subsidization of the license-fee.) For high volume mass-produced parts, the long-term cost-reduction achievable through lower wafer-pricing reduces the impact of ARM's NRE cost, making the dedicated foundry a better choice.
Like most IP vendors, ARM prices its IP based on perceived value. In architectural terms, the lower performance ARM cores command a lower license cost than the higher performance cores. In terms of silicon implementation, a synthesizable core is more expensive than a hard macro (blackbox) core. Complicating price matters, merchant foundries who hold an ARM license (such as Samsung and Fujitsu) can offer reduced licensing costs to its fab customers. In exchange for acquiring the ARM core through the foundry's in-house design services, the customer can reduce or eliminate payment of ARM's upfront license fee. Compared to dedicated semicounductor foundries (such as TSMC and UMC) without in-house design services, Fujitsu/Samsung charge 2 to 3 times more per manufactured wafer. For low to mid volume applications, a design service foundry offers lower overall pricing (through subsidization of the license fee.) For high volume mass produced parts, the long term cost reduction achievable through lower wafer pricing reduces the impact of ARM's NRE cost, making the dedicated foundry a better choice.


Many hightech semiconductor firms hold ARM licenses: [[Atmel]], [[Broadcom]], [[Cirrus Logic]], [[Freescale]] (spun off from [[Motorola]] in 2004), [[Fujitsu]], [[Intel]] (through its settlement with [[DEC]]), [[International Business Machines|IBM]], [[Infineon Technologies]], [[Nintendo]], [[Oki Electric Industry|OKI]], [[Philips]], [[Samsung Electronics|Samsung]], [[Sharp Corporation|Sharp]], [[STMicroelectronics]], [[Texas Instruments]] and [[VLSI Technology|VLSI]] are some of the many companies who have licensed the ARM in one form or another. Although ARM's license terms are covered by [[Non-disclosure agreement|NDA]], within the IP industry, ARM is widely known to be among the most expensive CPU cores. A single customer product containing a basic ARM-core can incur a one-time license fee in excess of (USD) $200,000. Where significant quantity and architectural modification are involved, the license fee can exceed $10M.
Many hightech semiconductor firms hold ARM licenses: [[Atmel]], [[Broadcom]], [[Cirrus Logic]], [[Freescale]] (spun off from [[Motorola]] in 2004), [[Fujitsu]], [[Intel]] (through its settlement with [[DEC]]), [[International Business Machines|IBM]], [[Infineon Technologies]], [[Nintendo]], [[Oki Electric Industry|OKI]], [[Philips]], [[Samsung Electronics|Samsung]], [[Sharp Corporation|Sharp]], [[STMicroelectronics]], [[Texas Instruments]] and [[VLSI Technology|VLSI]] are some of the many companies who have licensed the ARM in one form or another. Although ARM's license terms are covered by [[Non-disclosure agreement|NDA]], within the IP industry, ARM is widely known to be among the most expensive CPU cores. A single customer product containing a basic ARM core can incur a one time license fee in excess of (USD) $200,000. Where significant quantity and architectural modification are involved, the license fee can exceed $10M.


==See also==
==See also==

Revision as of 16:37, 12 August 2006

The ARM architecture (originally the Acorn RISC Machine) is a 32-bit RISC processor architecture that is widely used in a number of embedded designs. Due to their power saving features, ARM CPUs are dominant in the mobile electronics market, where low power consumption is a critical design goal.

Today, the ARM family accounts for over 75% of all 32-bit embedded CPUs, making it one of the most prolific 32-bit architectures in the world. ARM CPUs are found in all corners of consumer electronics, from portable devices (PDAs, mobile phones, media players, handheld gaming units, and calculators) to computer peripherals (hard drives, desktop routers.) The most noticeable branch in this family nowadays is Intel's XScale.

Some other ARM architecture (ARM's headquarters in Cambridge UK)

History

A Conexant ARM processor used mainly in routers

The ARM design was started in 1983 as a development project at Acorn Computers Ltd.

The team, led by Roger Wilson and Steve Furber, started development of what in some ways resembles an advanced MOS Technology 6502. Acorn had a long line of computers based on the 6502, so a chip that was similar to program could represent a significant advantage for the company.

The team completed development samples called ARM1 by 1985, and the first "real" production systems as ARM2 the following year. The ARM2 featured a 32-bit data bus, a 26-bit address space giving a 64 Mbyte address range and 16 32-bit registers. One of these registers served as the (word aligned) program counter with its top 6 bits and lowest 2 bits holding the processor status flags. The ARM2 was possibly the simplest useful 32-bit microprocessor in the world, with only 30,000 transistors (compare with Motorola's four-year older 68000 with around 68,000). Much of this simplicity comes from not having microcode (which represents about 1/4 to 1/3rd of the 68000) and, like most CPUs of the day, not including any cache. This simplicity led to its low power usage, while performing better than the 286. A successor, ARM3, was produced with a 4KB cache which further improved performance.

In the late 1980s Apple Computer started working with Acorn on newer versions of the ARM core. The work was so important that Acorn spun off the design team in 1990 into a new company called Advanced RISC Machines. For this reason you often see ARM lengthened to Advanced RISC Machine instead of Acorn RISC Machine. Advanced RISC Machines became ARM Limited when the company floated on the London Stock Exchange and NASDAQ in 1998.

This work would eventually turn into the ARM6. The first models were released in 1991, and Apple used the ARM6-based ARM 610 as the basis for their Apple Newton PDA. In 1994, Acorn used the ARM 610 as the main CPU in their RiscPC computers.

The core has remained largely the same size throughout these changes. ARM2 had 30,000 transistors, while the ARM6 grew to only 35,000. The idea is that the Original Design Manufacturer combines the ARM core with a number of optional parts to produce a complete CPU, one that can be built on old semiconductor fabs and still deliver lots of performance at a low cost.

The most successful implementation has been the ARM7TDMI with hundreds of millions sold in mobile phones, handheld video game systems, and Sega Dreamcasts. While ARM's business has always been to sell IP cores, some of the licensees generated microcontrollers based on this core.

The Dreamcast features a SH4 processor which only borrows concepts from ARM (low power consumption, optional compact instruction set etc.), but is otherwise different from an ARM. The Dreamcast also features a sound chip designed by Yamaha with an ARM7 core. Nintendo's Gameboy Advance, however, uses the ARM7TDMI at 16.78MHz.

DEC licensed the architecture (which caused some confusion because they also produced the DEC Alpha) and produced the StrongARM. At 233 MHz this CPU drew only 1 watt of power (more recent versions draw far less). This work was later passed to Intel as a part of a lawsuit settlement, and Intel took the opportunity to supplement their aging i960 line with the StrongARM. Intel later developed its own high performance implementation known as XScale which it has since sold.

The cores

Family Core Feature Cache (I/D)/MMU typical MIPS @ MHz in Application
ARM7TDMI ARM7TDMI(-S) 3-stage pipeline none 15 MIPS @ 16.8 MHz Game Boy Advance, Nintendo DS, iPod
ARM710T MMU 36 MIPS @ 40 MHz Psion 5 series
ARM720T 8KB unified, MMU 60 MIPS @ 59.8 MHz
ARM740T MPU
ARM7EJ-S Jazelle DBX none
ARM9TDMI ARM9TDMI 5-stage pipeline none
ARM920T 16KB/16KB, MMU 200 MIPS @ 180 MHz Armadillo, GP32,GP2X (first core), Tapwave Zodiac (Motorola i. MX1)
ARM922T 8KB/8KB, MMU
ARM940T 4KB/4KB, MPU GP2X (second core)
ARM9E ARM946E-S variable, tightly coupled memories, MPU Nintendo DS, Nokia N-Gage, Conexant 802.11 chips
ARM966E-S no cache, TCMs

ST Micro STR91xF, includes Ethernet [1]

ARM968E-S no cache, TCMs
ARM926EJ-S Jazelle DBX variable, TCMs, MMU 220 MIPS @ 200 MHz Mobile phones: Sony Ericsson (K, W series),Siemens and Benq (x65 series and newer)
ARM996HS Clockless processor no caches, TCMs, MPU
ARM10E ARM1020E (VFP) 32KB/32KB, MMU
ARM1022E (VFP) 16KB/16KB, MMU
ARM1026EJ-S Jazelle DBX variable, MMU or MPU
ARM11 ARM1136J(F)-S SIMD, Jazelle DBX, (VFP) variable, MMU
ARM1156T2(F)-S SIMD, Thumb-2, (VFP) variable, MPU
ARM1176JZ(F)-S SIMD, Jazelle DBX, (VFP) variable, MMU+TrustZone
ARM11 MPCore 1-4 core SMP, SIMD, Jazelle DBX, (VFP) variable, MMU
Cortex Cortex-A8 Application profile, NEON, Jazelle RCT, Thumb-2 variable (L1+L2), MMU+TrustZone up to 2000 (2.0 DMIPS/MHz in speed from 600 MHz to greater than 1 GHz) Texas Instruments OMAP3
Cortex-R4 Embedded profile variable cache, MMU optional 600 DMIPS Broadcom is a user
Cortex-M3 Microcontroller profile no cache, (MPU) 120 DMIPS @ 100MHz Luminary Micro[2] microcontroller family
XScale 80200/IOP310/IOP315 I/O Processor
80219
IOP321 Iyonix
IOP33x
PXA210/PXA250 Applications processor Zaurus SL-5600
PXA255 32KB/32KB, MMU 400 BogoMips @400 MHz Gumstix
PXA26x
PXA27x 800 MIPS @ 624 MHz HTC Universal, Zaurus SL-C1000
PXA800(E)F
Monahans 1000 MIPS @ 1.25 GHz
PXA900 Blackberry 8700
IXC1100 Control Plane Processor
IXP2400/IXP2800
IXP2850
IXP2325/IXP2350
IXP42x NSLU2
IXP460/IXP465

Design notes

To keep the design lean, simple and fast, it was hardwired without microcode, like the much simpler 8-bit 6502 processor used in prior Acorn microcomputers.

The ARM architecture includes the following RISC features:

  • Load/store architecture
  • No support for misaligned memory accesses (now supported in v6 Arm cores)
  • Orthogonal instruction set
  • Large 16 × 32-bit register file
  • Fixed opcode width of 32 bits to ease decoding and pipelining, at the cost of decreased code density
  • Mostly single-cycle execution

To compensate for the simpler design, compared to contemporary processors like the Intel 80286 and Motorola 68020, some unique design features were used:

  • Conditional execution of most instructions, reducing branch overhead and compensating for the lack of a branch predictor
  • Arithmetic instructions only alter condition codes when desired
  • 32-bit barrel shifter which can be used without performance penalty with most arithmetic instructions and address calculations
  • Powerful indexed addressing modes
  • Simple, but fast, 2-priority-level interrupt subsystem with switched register banks

An interesting addition to the ARM design is the use of a 4-bit condition code on the front of every instruction, meaning that execution of every instruction is optionally conditional.

This cuts down significantly on the encoding bits available for displacements in memory access instructions, but on the other hand it avoids branch instructions when generating code for small if statements. The standard example of this is Euclid's GCD algorithm:

In the C programming language, the loop is:

int gcd (int i, int j) 
{
   while (i != j) 
      if (i > j) 
          i -= j;
      else 
          j -= i;
   return i;
} 

In ARM assembly, the loop is:

loop   CMP    Ri, Rj       ; set condition "NE" if (i != j)
                           ;               "GT" if (i > j), 
                           ;           or  "LT" if (i < j)           
       SUBGT  Ri, Ri, Rj   ; if "GT", i = i-j;  
       SUBLT  Rj, Rj, Ri   ; if "LT", j = j-i; 
       BNE    loop         ; if "NE", then loop

which avoids the branches around the then and else clauses.

Another unique feature of the instruction set is the ability to fold shifts and rotates into the "data processing" (arithmetic, logical, and register-register move) instructions, so that, for example, the C statement

a += (j << 2);

could be rendered as a single word, single cycle instruction on the ARM.

ADD Ra, Ra, Rj, LSL #2

This results in the typical ARM program being denser than expected with less memory access so the pipeline is used more efficiently. Even though the ARM runs at what many would consider to be low speeds, it nevertheless competes quite well with much more complex CPU designs.

The ARM processor also has some features rarely seen on other architectures that are considered RISC, such as PC-relative addressing (indeed, on the ARM the PC is one of its 16 registers) and pre- and post-increment addressing modes.

Another item of note is that the ARM has been around for a while, with the instruction set increasing somewhat over time. Some early ARM processors (prior to ARM7TDMI), for example, have no instruction to load a two-byte quantity, so that, strictly speaking, for them it's not possible to generate code that would behave the way one would expect for C objects of type "volatile short".

The ARM7 and most earlier designs have a three stage pipeline; the stages being fetch, decode, and execute. Higher performance designs, such as the ARM9, have a five stage pipeline. Additional changes for higher performance include a faster adder, and more extensive branch prediction logic.

Thumb

Newer ARM processors have a 16-bit instruction mode, called Thumb, perhaps related to the conditional execution facility using four bits of every instruction. In Thumb, the smaller opcodes have less functionality. For example, only branches can be conditional, and many opcodes cannot access all of the CPU's registers. However, the shorter opcodes give improved code density overall, even though some operations require more instructions. Particularly in situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allows greater performance than with 32-bit code because of the more efficient use of the limited memory bandwidth. Typically embedded hardware has a small range of addresses of 32-bit datapath and the rest are 16 bits or narrower (e.g. the Game Boy Advance). In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using the (non-Thumb) 32-bit instruction set, placing them in the limited 32-bit bus width memory.

The first processor with Thumb technology was the ARM7TDMI. All ARM9 and later families, including XScale have included Thumb technology.

Jazelle

ARM has implemented a technology that allows certain of their architectures to execute Java bytecode natively in hardware, as another execution mode. It interoperates alongside the existing ARM and Thumb modes.

The first processor with Jazelle technology was the ARM926EJ-S: Jazelle being denoted by the 'J' in the CPU name. It is used by mobile phone manufacturers to speed up execution of Java ME games and applications, which is probably what drove development of the technology.

Thumb-2

Thumb-2 technology made its debut in the ARM1156 core, announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth. As a result the stated aim for Thumb-2 is to achieve code density that is similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.

Thumb-2 also extends both the ARM and Thumb instruction set with yet more instructions, including bit-field manipulation, table branches, and conditional execution.

Thumb-2EE

Thumb-2EE, marketed as Jazelle RCT, was announced in 2005, first appearing in the Cortex-A8 processor. Thumb-2EE provides a small extension to Thumb-2, making the instruction set particularly suited to code generated at runtime (e.g. by JIT compilation) in managed Execution Environments. Thumb-2EE is a target for languages such as Limbo, Java, C#, Perl and Python, and allows JIT compilers to output smaller compiled code without impacting performance.

New features provided by Thumb-2EE include automatic null pointer checks on every load and store instruction, an instruction to perform an array bounds check, and the ability to branch to handlers, which are small sections of frequently called code, commonly used to implement a feature of a high level language, such as allocating memory for a new object.

NEON

NEON technology is a combined 64 and 128 bit SIMD (Single Instruction Multiple Data) instruction set that provides standardized acceleration for media and signal processing applications. NEON can execute MP3 audio decoding on CPUs running at 10 MHz and can run the GSM AMR (Adaptive Multi-Rate) speech codec at no more than 13 MHz. It features a comprehensive instruction set, separate register files and independent execution hardware. NEON supports 8-, 16-, 32- and 64-bit integer and single precision floating-point data and operates in SIMD operations for handling audio/video processing as well as graphics and gaming processing. SIMD is a crucial element in vector supercomputers which feature simultaneous multiple operations. In NEON, the SIMD supports up to 16 operations at the same time.

VFP

VFP technology is a coprocessor extension to the ARM architecture. It provides low-cost single-precision and double-precision floating-point computation fully compliant with the ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture also supports execution of short vector instructions allowing SIMD (Single Instruction Multiple Data) parallelism. This is useful in graphics and signal-processing applications by reducing code size and increasing throughput.

ARM licensees

ARM Ltd does not manufacture and sell CPU devices based on their own designs, but rather, licenses the processor architecture to interested parties. ARM offers a variety of licensing terms, varying in cost and deliverables. To all licensees, ARM provides an integratable hardware description of the ARM core, as well as complete software development toolset (compiler, debugger, SDK), and the right to sell manufactured silicon containing the ARM CPU. Fabless licensees, who wish to integrate an ARM core into their own chip design, are usually only interested in acquiring a ready-to-manufacture verified IP core. For these customers, ARM delivers a gate netlist description of the chosen ARM core, along with an abstracted simulation model and test programs to aid design integration and verification. More ambitious customers, including integrated device manufacturers (IDM) and foundry operators, choose to acquire the processor IP in synthesizable RTL (Verilog) form. With the synthesizable RTL, the customer has the ability to perform architectural level optimizations and extensions. This allows the designer to achieve exotic design goals not otherwise possible with an unmodified netlist (high clock speed, very low power consumption, instruction set extensions, etc.) While ARM does not grant the licensee the right to resell the ARM architecture itself, licensees may freely sell manufactured product (chip devices, evaluation boards, complete systems, etc.) Merchant foundries can be a special case; not only are they allowed to sell finished silicon containing ARM cores, they generally hold the right to remanufacture ARM cores for other customers.

Like most IP vendors, ARM prices its IP based on perceived value. In architectural terms, the lower performance ARM cores command a lower license cost than the higher performance cores. In terms of silicon implementation, a synthesizable core is more expensive than a hard macro (blackbox) core. Complicating price matters, merchant foundries who hold an ARM license (such as Samsung and Fujitsu) can offer reduced licensing costs to its fab customers. In exchange for acquiring the ARM core through the foundry's in-house design services, the customer can reduce or eliminate payment of ARM's upfront license fee. Compared to dedicated semicounductor foundries (such as TSMC and UMC) without in-house design services, Fujitsu/Samsung charge 2 to 3 times more per manufactured wafer. For low to mid volume applications, a design service foundry offers lower overall pricing (through subsidization of the license fee.) For high volume mass produced parts, the long term cost reduction achievable through lower wafer pricing reduces the impact of ARM's NRE cost, making the dedicated foundry a better choice.

Many hightech semiconductor firms hold ARM licenses: Atmel, Broadcom, Cirrus Logic, Freescale (spun off from Motorola in 2004), Fujitsu, Intel (through its settlement with DEC), IBM, Infineon Technologies, Nintendo, OKI, Philips, Samsung, Sharp, STMicroelectronics, Texas Instruments and VLSI are some of the many companies who have licensed the ARM in one form or another. Although ARM's license terms are covered by NDA, within the IP industry, ARM is widely known to be among the most expensive CPU cores. A single customer product containing a basic ARM core can incur a one time license fee in excess of (USD) $200,000. Where significant quantity and architectural modification are involved, the license fee can exceed $10M.

See also

External links