Self-Monitoring, Analysis and Reporting Technology

from Wikipedia, the free encyclopedia

Self-Monitoring, Analysis and Reporting Technology ( SMART or SMART , German system for self -monitoring, analysis and status reporting ) is an industry standard for monitoring hard disk drives (HDD) and solid-state drives (SSD) and is used to predict a possible failure of the Storage medium. The values ​​of different sensors are evaluated with the help of different parameters.

overview

The monitored data is evaluated when the computer is started by the appropriately set BIOS or other firmware , or by special software that must be installed in addition to the operating system. Microsoft, for example, has provided a driver for this since Windows 95b (OSR 2) , which is then addressed by this software.

The program is based on the limit values ​​set by the hard disk manufacturer for the individual parameters, such as temperature. After a longer period of time, the software can then predict expected failures.

"Switching off" SMART, for example in the BIOS settings, does not switch off data acquisition, but only switches off the warnings when the threshold values ​​are exceeded. The collected data is saved in a reserved area of ​​the hard disk that cannot be changed by programs.

The entire monitoring does not slow down the hard drive, as it only logs what is happening without taking corrective action. This is already done by mechanisms internal to the hard drive, for example in the event of vibrations, which in turn existed before SMART. Everything else, such as mileage and temperature, is recorded by specially built-in sensors and chip functions. There is a division into “online” parameters, which are permanently noted, and those which are updated during pauses when the drive is, so to speak, “offline”.

Expressiveness

SMART is limited to the mass storage devices monitored by it, such as hard drives or SSDs, and does not provide any information on the overall reliability of the computer system. There is no link between the data obtained from several mass storage devices. The system is also not standardized; it is up to the manufacturers to decide which parameters they monitor within which limits. The accuracy of the monitoring is also discussed among users. For example, some temperature sensors are considered to be incorrectly placed or set too optimistically. B. be well below room temperature.

An independent Google study, which lasted nine months, covered all manufacturers and a total of 100,000 hard drives, produced the following result in 2006: If all relevant parameters are included, 64% of all failures can be predicted with SMART. All other warning signals, i.e. audible or noticeable as data errors, were ignored. In the remaining third of all failures, the hard drive itself incorrectly reported that it was free of problems.

The stress on the hard disk had a far smaller impact on its durability than previously assumed. If a drive survives the first year, the idle portion no longer plays a role until it is regularly replaced after four years. Only in the first and after the fourth year does permanent reading and writing double the failure rate.

history

In 1992, IBM realized that as PCs became more widespread in companies, so too did the trust placed in them. Failures were increasingly becoming a financial problem that one wanted to address with PFA (Predictive Failure Analysis). IBM hard drives with this system informed the computer of any parameter changes so that the user could react in time with an exchange. A little later, Compaq introduced IntelliSafe. This filters the irrelevant and only reports the threatening changes and setpoints to the running software. Seagate , Quantum and Conner were involved in the development and adapted it to their products; Compaq did not manufacture hard drives itself.
Sensing the potential and with an industry standard in mind, the disclosure of the system was forced by Compaq and especially Seagate. Together with Conner, Quantum, Western Digital and then IBM, the two approaches merged under the name SMART

Since 1996 and the start of the ATA -3 standard, or SCSI -3 four years earlier, it has been part of the standard equipment of a hard drive almost without exception.

The specification for the SMART parameters was removed before the ATA-3 standard was adopted (see web links ). Therefore, neither the meaning of the stored values ​​nor their scaling are stipulated (for the latter see also common parameters ). Only their location is officially standardized. Strictly speaking, even according to the ATA-7 standard, there is no way of reading out the temperature of a plate, for example. Practically all available disks adhere to the data format from the ATA-3 draft. A read-out program adds a designation such as "Seek Error Rate" to each parameter ID for better understanding. Over the years, a reliable de facto standard has emerged.

Solid-state drives (SSDs) no longer require many of the previous test points due to the system, but different, new ones. However, there is currently no coordination between the SSD controller manufacturers. As a result, some new parameter IDs were added, but sometimes existing IDs were simply given a new meaning. This leads to misinterpretations in all SMART programs, which do not yet know the meaning in the new drives.

A brief evaluation of important SMART parameters is also included in most BIOS versions, so that warning messages about defective SSDs can appear when the computer is switched on. In this case, it is advisable to switch off the SMART self-test function in the BIOS and to carry out a manual test with a current program in the operating system (see comparison of SMART programs ).

Variations after connection

The implementation of the SMART standard differs depending on the hard disk connection in the PC. There are two of them: ATA and SCSI standards. Both know the HEALTH STATUS. The firmware of the drive indicates whether it is classified as "okay" or "problematic". Both standards also support reading out the temperature and several variants of self-tests and logbooks.

In the case of ATA hard disks, numerous values ​​and their limits can also be queried using running software. In this way, the software or the user can assess more precisely whether and why an error will occur. However, these parameters are not exactly standardized and differ in scope and interpretation, even between models from one manufacturer.

The commands and data formats for all these functions are, however, implemented completely differently for ATA and SCSI.

Basically, SCSI commands are transmitted on the USB port. The hard disks connected via USB are almost without exception not SCSI but (S) ATA disks. In the course of the introduction of the USB 3.0 interface, the USB Attached SCSI (UAS) protocol was introduced; this can also be used on USB 2.0 at reduced speed, which, in contrast to the technically simpler bulk transfer of the USB memory sticks, tunneling the ATA Enables commands via the USB bus and enables SMART queries via USB. Chip manufacturers such as Cypress, JMicron or SunPlusIT use manufacturer-specific commands. Some programs can use these commands (see section SMART programs in comparison ). There are also USB-SATA bridges that support the manufacturer-independent SCSI / ATA translation standard.

The FireWire connection - which is common on Apple computers in particular - enables transmission natively, but Mac OS X does not use this.

Drives connected via eSATA , like their internal SATA counterparts, can be read without any problems.

Serial ATA disks connected via Serial Attached SCSI (SAS) can be checked if the corresponding SAT commands are available.

For tape drives, there are functions analogous to SMART called TapeAlert . They are used to warn of worn belts.

evaluation

Usual parameters

Each value is first saved as raw data . This is then sorted on a scale from 0 to 100, 200 or 255 for better understanding. The different scales are used for finer gradations where the manufacturer considers them to be useful. Starting with the scale maximum, the value approaches zero in the event of errors or increasing age. However, the critical limit (threshold) is often well above it.

The following table shows the individual parameters and the evaluation of the respective raw values ​​(not to be confused with the values ​​of the value scale):

Legend of the raw values
A.
Critical to failure
Failure-relevant parameters. If available, possible failures can be forecast.
I. Informative, parameters of little or no relevance for the failure forecast
higher, better
The higher the raw value, the better
lower, better
The lower the raw value, the better
ID Hex Parameter name (English) Parameter name (German) A. I. Better description
01 0x01 (Raw) Read Error Rate Read error rate (raw)
lower, better
  • Uncorrectable errors when reading from the hard disk, leads to reading again.
  • Indicates a problem with the plate surface.
  • Some drives have very high raw values ​​here, which cannot be compared between models from one manufacturer. With newer Seagate drives, it is incorrectly identical to that with Hardware ECC Recovered. Only the scale values ​​are relevant to failure.
02 0x02 Throughput performance Throughput
higher, better
  • general data throughput or efficiency of the hard disk
  • Strongly indicates braking problems in the drive.
03 0x03 Spin Up Time Acceleration time
lower, better
  • Average of the start time in (milli-) seconds.
  • Indicates problems with the motor or the plate bearings.
  • Brand new Maxtor and Quantum drives often had false alarms in the first month.
04 0x04 Start / Stop Count Start / stop processes Yes
lower, better
  • Number of start and stop processes of a drive (also standby)
  • Indicates wear and tear, as this is the hard drive that is the most stressful.
05 0x05 Reallocated Sectors Count reassigned sectors
Critical to failure
lower, better
  • Number of reserve sectors consumed.
  • Indicates surface problems, as only then does a reserve sector automatically replace one previously used.
  • If this RAW counter is not equal to zero, the probability of failure is fivefold. This usually follows the first "Reallocation Event" within six months.
07 0x07 Seek error rate Search error rate
lower, better
  • Uncorrectable errors when reading from the hard disk, leads to reading again.
  • Indicates a positioning problem in the read / write unit.
  • Also unexplained by the manufacturer, some brand-new Seagate drives enter scale values ​​well below 100 here.
09 0x09 Power On Hours Count Time in operation Yes
lower, better
  • Mileage in hours or seconds (including standby)
  • Indicates wear and tear, but does not say anything about usage conditions during this time.
  • On some Maxtor models, e.g. B. With the Maxtor DiamondMax 10 6L250S0 these are minutes.
10 0x0A Spin retry count Start-up repetitions,
only relevant for HDDs
Critical to failure
lower, better
  • Number of start-up attempts to rev up the hard disks to the nominal speed. An increasing value indicates mechanical problems in the drive of the hard disk.
12 0x0C Power cycle count Number of activations Yes
lower, better
  • The number of times the drive has been turned on and off.
184 0xB8 End-to-end error End-to-end errors
Critical to failure
lower, better
  • Increasing values ​​indicate parity errors between the storage medium and drive controller.
187 0xBB Reported uncorrectable error Reported uncorrectable errors
Critical to failure
lower, better
188 0xBC Command timeout Commands which could not be executed in time
Critical to failure
lower, better
  • Number of command aborts due to timeout
193 0xC1 Load cycle count

or.

Load / Unload Cycle Count

Parking processes Yes
lower, better
  • The read / write unit is parked on the plastic ramp next to the plates.
  • Usually only with notebook drives. Indicates wear and tear; around 300,000 are planned - the raw value shows the previous one.
  • The read / write unit is parked when it is switched off or after idling for around 10 seconds. This sometimes creates an irritating noise. If the notebook falls, the read / write unit no longer hits the magnetic disks. The shock resistance is tripled to around 1000 g . Switching on and off is also gentler, as the unit is not lowered onto a special area of ​​the plates (“ landing zone ”).
194 0xC2 Drive temperature Hard drive temperature
lower, better
  • Temperature of the drive in ° C
  • Since some drives also store maximum and minimum values, earlier hypothermia or overheating can be detected during operation. The value specified as raw value then contains all three numbers in a row.
  • High temperatures (from 40 ° C) only have an effect after three years. This year they double the probability of failure. Then they lose their meaning again. Averaged over all ages, temperatures below 25 ° C are far more dangerous than those above 40 ° C. Double 20 ° C, triple failure rate 15 ° C; Measured up to 52 ° C. Some manufacturers use inaccurate or misplaced sensors.
195 0xC3 Hardware ECC Recovered rescued bit errors
lower, better
  • corrected bit errors when reading
  • May indicate a problem with the plate surface.
  • The high data density of today's hard disks means that error correction is inevitable when reading. [Document?] So even very high values ​​here are no cause for concern.
  • Samsung drives of the P80 series often incorrectly enter very low scale values ​​here. In general, very high raw values ​​are common which, due to a change from one technology to a newer one ("technology change"), cannot be compared between models from the same manufacturer. They increase during read processes, since only then does error correction take place. Only the scale values ​​are relevant to failure. The values ​​are rarely referred to as "ECC on-the-fly".
196 0xC4 Reallocation Event Count
Critical to failure
lower, better
  • Number of successful and unsuccessful reassignments of the sector position carried out to date as a result of reading errors from defective sectors.
197 0xC5 Current pending sector count
Critical to failure
lower, better
  • Number of sectors waiting for assignment of a new sector position due to read errors
198 0xC6 Uncorrectable Sector Count Uncorrectable sectors
Critical to failure
lower, better
  • Number of previous uncorrectable sector errors in write or read operations.
199 0xC7 Ultra DMA CRC Error Count DMA CRC error Yes
lower, better
  • Number of CRC errors that occurred
  • The cause can be defective cables, dirty contacts, overclocking or faulty hard disk drivers. The transmission is repeated more slowly and slowly. If this fails, access to the hard disk is blocked.
201 0xC9 Soft read error rate
Critical to failure
lower, better
  • Number of read errors that cannot be corrected by software.

There are numerous other parameters, some of which are manufacturer-exclusive. Complete lists can be found in the literature section of the web links.

example

The evaluation of important SMART parameters using the example of a Hitachi 250 GB hard drive, connected via Serial ATA and read out with the smartmontools .

Parameter ID Parameter name Value (normalized current measured value) Worst (worst value so far) Threshold (limit value - value should be greater) Type (maximum measured value shortly before failure) Updated (real-time or measured value after a self-test) RAW Value (actual measured value) comment
2 Throughput performance 100 100 050 Pre-fail Offline 0  
3 Spin Up Time 118 118 024 Pre-fail Always 294 Hitachi uses its own counting method, no (milli-) seconds.
4th Start Stop Count 100 100 000 Old age Always 772 The hard disk motor was switched on / off 772 times, including standby starts.
5 Reallocated sector count 100 100 005 Pre-fail Always 55 55 sectors were exchanged for reserve sectors due to defects. However, the drive still rates this as problem-free (the value is still 100) - perhaps wrongly.
7th Seek error rate 100 100 067 Pre-fail Always 0 So far there have been no read / write errors.
9 Power On Hours 100 100 000 Old age Always 1775 Drive has been powered for 1775 hours to date. This also includes standby phases in which the plates were idle. If the evaluation program does not know the hard disk model, you have to assess for yourself whether the value represents hours, minutes or seconds.
10 Spin retry count 100 100 060 Pre-fail Always 0 So far there have been no false starts, the hard disk always started without any problems.
12 Power cycle count 100 100 000 Old age Always 745 So far, the PC with this hard disk has been switched on and off 745 times.
194 Temperature 161 161 000 Old age Always 34
+ ( 10 2 16 + 49 2 32 )
Current temperature here would be 34 ° C. Previous life maxima of the drive were 10 ° C and 49 ° C. Value has therefore dropped from 200 to 161.
199 UDMA CRC error count 200 253 000 Old age Always 730 So far there have been 730 transmission errors to the main board. The cause is either a faulty hard disk controller, a defective connection cable or a loose connection .
Value is a normalized measured value, which mostly counts backwards (the lower, the worse).
Worst worst value so far.
Threshold the limit below which the value must not fall.
Type stands for the meaning of the parameter: "Pre-fail" is a warning of an imminent failure, while "Old age" means that it is generally a question of progressive aging (the current temperature does not necessarily fall into one of the two categories).
Updated indicates whether the value is updated permanently (always) or only through a self-test of the type "Offline data collection".
RAW value is the actual measured value, e.g. the measured temperature or the number of errors.

Evaluation : According to the hard drive's own assessment, this drive is completely okay. Nowhere was the limit even close to being reached. According to a Google study, only the 55 replaced sectors are of concern. This value should therefore be kept in mind. However, if the “UDMA CRC Error Count” does not increase any further after the cable has been replaced and the cooling is improved so that approx. 45 ° C (temperature) is no longer exceeded, the drive can actually continue to be used without any problems.

Self-test and error log

In addition to the ongoing logging of the above parameters, there are other tests. Some manufacturers start these periodically in idle mode, others leave it to the user. He can with some of the offered programs perform. What is finally tested is also determined by the manufacturer. The standard is a short test with checking of all parameters, followed by samples of the legibility of the individual panes. The long version exchanges the sample for a complete check.

ATA-6 adds two more variants. One is recommended after a drive has been transported (called Conveyance - similar to the short test), the other allows you to test areas of the drive that you can select yourself (Selective - similar to the long test).

Since 1999 and the ATA-5 standard, errors that have occurred have not only been included in the parameter values ​​(result for example: "Error rate: high"), but also recorded in detail. The errors, the time since the device was last switched on and the five previous steps are noted. There is even a separate table for the results of the above self-tests. In general, only current error clusters are considered to be questionable here.

If the hard disk supports updating its firmware , the error log is deleted when the hard disk is rewritten (regardless of the version). The parameter values ​​are mostly retained.

SMART programs in comparison

The following table lists well-known programs for reading out SMART data.

Program name Operating system (s) price Duration of
the demo version
target group user interface connection RAID controller support Correct interpretation of SSDs Display of the error log Starting the self-tests Failure prediction Notification at Notification by providers Remarks
Argus monitor Windows € 14.95 30 days Beginners to advanced graphically (S) ATA, USB yes (not for all) Yes No No Yes Selectable parameter changes, limit value, temperature Windows, sound, e-mail, execute any command ArgusMonitor Additionally graphic display of CPU and graphics card temperature as well as CPU core frequency and Intel 'Turbo Boost' status; Display and control of mainboard and GPU fans
smartmontools Windows (native or Cygwin ),
Linux ,
Darwin ( Mac OS X ),
Free / Open / Net BSD,
Solaris ,
OS / 2 ,
QNX
Open source - Professional users Command line ,
optional daemon or service , graphical front end
(S) ATA, SCSI, SAT , USB 3ware (Linux, FreeBSD, Windows),
Compaq / HP (Linux, FreeBSD),
HighPoint (Linux),
Intel Matrix RAID (Windows)
Yes Yes yes (also time-controlled) No Selectable parameter changes, limit value, temperature Window (Windows only), e-mail, system log, execute any command smartmontools GSmartControl manual
HDAT2 DOS Freeware - Professional users Text menu (S) ATA, SCSI, USB, FireWire (some) yes (not for all) - Yes Yes No - - Lubomir Cabla Offers setting of AAM and other parameters, as well as surface tests.
DriveSitter Windows from $ 29.69 30 days Advanced graphically (S) ATA - ? Yes Yes Yes Selectable parameter changes, limit value, temperature Windows, sound, e-mail, network message, system log, execute any command Oliver Marr Highly scalable, switches to idle mode if required at critical temperatures.
EASIS Drive Check Windows Freeware / Pro € 19.- - Advanced graphically (S) ATA, USB, surface test all - ? Yes No No Parameter changes Window, email EASIS Can perform surface tests to find defective sectors
HDD Health Windows Freeware - Beginners to advanced graphically (S) ATA - - yes (in new version) yes (in new version) Yes every parameter change, temperature Window, Sound, Email, Network Message (Email and Network Commercial Version Only) PANTERASoft
Active SMART Windows from € 18.46 21 days Beginners to advanced graphically (S) ATA, SCSI, USB announced - No No Yes Limit value, temperature Window, sound, email, network message Ariolic ATA / SCSI / USB Switches to idle mode if the temperature is critical.
SpeedFan Windows Freeware - Beginners to advanced graphically (S) ATA, SCSI - yes (not for all) No Yes Yes Limit value, temperature System notification, sound, e-mail, execute any command Alfredo Milani Comparetti Provides online analysis of the drive [1] , monitors PC temperatures
SMARTReporter Mac OS X Open Source / Pro € 4.49 - Beginners graphically (S) ATA - yes (based on smartmontools) Yes Yes No limit Execute window, email, any command Julian Mayer
HDTune Windows Freeware HD Tune Pro 24.95 EUR - Beginners to advanced graphically (S) ATA, USB (most) - - No No No - - EFD software Performs benchmarks and surface tests; Health for ext. HDD only in the Pro version
Norton System Doctor Windows proprietary - Beginners graphically (S) ATA, SCSI, USB ? ? No No No Limit value (for each data carrier individually) Taskbar icon, sound, administrative message Symantec weblink Can be configured individually for each data carrier, interface for Disc Doktor / chkdsk : surface test, complete test on restart
CrystalDiskInfo Windows Open source - Beginners to advanced graphically (S) ATA, USB (some) Intel Matrix RAID Yes Yes No Yes Limit value, temperature (for each data carrier individually) Taskbar Icon, Sound, Email, Event Log Crystal Dew World Offers setting of AAM and other parameters
Acronis® Drive Monitor ™ Windows Freeware / proprietary - Beginners to advanced graphically (S) ATA, USB (most), software RAID controllers (many) Software RAID controller YES, hardware controller support announced ? Yes ? Yes Hard drive problems, temperature, "critical events", backup messages Taskbar icon, alarm message, email Acronis Manual
Samsung SSD Magician Windows proprietary - Beginners to advanced graphically (S) ATA - Yes Yes ? ? ? -
DHE Drive Info Windows Freeware - Beginners to advanced graphically (S) ATA, SCSI, USB experimental Yes Yes Yes ? Limit value, temperature window Dirk Hauschild portable, no installation required

Reading of hard disks on RAID controllers

  • Only the controller manufacturer has the information required to read out the SMART status in the RAID system. So he has to make this available with his driver via API function. However, not all of them do this - and when they do, it is often manufacturer-specific and only for selected models. The table evaluates the manufacturers from which the program knows the functions.
  • Addressing the controller directly without using the driver functions is more successful, but also potentially unstable and therefore only acceptable under DOS .
  • If SMART support is mentioned in the controller's specifications, this is often only internal to the controller. The driver then does not pass the information on to programs, some only to that of a drive.
  • Hard disks in so-called software RAIDs (i.e. groups that are managed by the operating system) and those that are set up on RAID controllers as individual drives instead of as a group can always be read out. Therefore it is not counted.

swell

  1. ^ Heise announcement of February 16, 2007
  2. a b http://research.google.com/archive/disk_failures.pdf
  3. - ( Memento of the original from March 21, 2014 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. Example of a reallocation of an existing SMART attribute on Indilinx controllers @1@ 2Template: Webachiv / IABot / www.ocztechnologyforum.com
  4. Some USB devices with SMART support (smartmontools Wiki)
  5. Michael Schmelzle: These SMART data are important. IDG Tech Media GmbH, October 30, 2013, accessed April 5, 2017 .
  6. http://forums.storagereview.net/index.php?showtopic=20731
  7. Figure: Read / write head in park position
  8. Ticket # 20275: Add support for starting tests

Web links