Self-Monitoring, Analysis and Reporting Technology

Self-Monitoring, Analysis and Reporting Technology ( SMART or SMART , German system for self -monitoring, analysis and status reporting ) is an industry standard for monitoring hard disk drives (HDD) and solid-state drives (SSD) and is used to predict a possible failure of the Storage medium. The values of different sensors are evaluated with the help of different parameters.

overview

The monitored data is evaluated when the computer is started by the appropriately set BIOS or other firmware , or by special software that must be installed in addition to the operating system. Microsoft, for example, has provided a driver for this since Windows 95b (OSR 2) , which is then addressed by this software.

The program is based on the limit values set by the hard disk manufacturer for the individual parameters, such as temperature. After a longer period of time, the software can then predict expected failures.

"Switching off" SMART, for example in the BIOS settings, does not switch off data acquisition, but only switches off the warnings when the threshold values are exceeded. The collected data is saved in a reserved area of the hard disk that cannot be changed by programs.

The entire monitoring does not slow down the hard drive, as it only logs what is happening without taking corrective action. This is already done by mechanisms internal to the hard drive, for example in the event of vibrations, which in turn existed before SMART. Everything else, such as mileage and temperature, is recorded by specially built-in sensors and chip functions. There is a division into “online” parameters, which are permanently noted, and those which are updated during pauses when the drive is, so to speak, “offline”.

Expressiveness

SMART is limited to the mass storage devices monitored by it, such as hard drives or SSDs, and does not provide any information on the overall reliability of the computer system. There is no link between the data obtained from several mass storage devices. The system is also not standardized; it is up to the manufacturers to decide which parameters they monitor within which limits. The accuracy of the monitoring is also discussed among users. For example, some temperature sensors are considered to be incorrectly placed or set too optimistically. B. be well below room temperature.

An independent Google study, which lasted nine months, covered all manufacturers and a total of 100,000 hard drives, produced the following result in 2006: If all relevant parameters are included, 64% of all failures can be predicted with SMART. All other warning signals, i.e. audible or noticeable as data errors, were ignored. In the remaining third of all failures, the hard drive itself incorrectly reported that it was free of problems.

The stress on the hard disk had a far smaller impact on its durability than previously assumed. If a drive survives the first year, the idle portion no longer plays a role until it is regularly replaced after four years. Only in the first and after the fourth year does permanent reading and writing double the failure rate.

history

In 1992, IBM realized that as PCs became more widespread in companies, so too did the trust placed in them. Failures were increasingly becoming a financial problem that one wanted to address with PFA (Predictive Failure Analysis). IBM hard drives with this system informed the computer of any parameter changes so that the user could react in time with an exchange. A little later, Compaq introduced IntelliSafe. This filters the irrelevant and only reports the threatening changes and setpoints to the running software. Seagate , Quantum and Conner were involved in the development and adapted it to their products; Compaq did not manufacture hard drives itself.
Sensing the potential and with an industry standard in mind, the disclosure of the system was forced by Compaq and especially Seagate. Together with Conner, Quantum, Western Digital and then IBM, the two approaches merged under the name SMART

Since 1996 and the start of the ATA -3 standard, or SCSI -3 four years earlier, it has been part of the standard equipment of a hard drive almost without exception.

The specification for the SMART parameters was removed before the ATA-3 standard was adopted (see web links ). Therefore, neither the meaning of the stored values nor their scaling are stipulated (for the latter see also common parameters ). Only their location is officially standardized. Strictly speaking, even according to the ATA-7 standard, there is no way of reading out the temperature of a plate, for example. Practically all available disks adhere to the data format from the ATA-3 draft. A read-out program adds a designation such as "Seek Error Rate" to each parameter ID for better understanding. Over the years, a reliable de facto standard has emerged.

Solid-state drives (SSDs) no longer require many of the previous test points due to the system, but different, new ones. However, there is currently no coordination between the SSD controller manufacturers. As a result, some new parameter IDs were added, but sometimes existing IDs were simply given a new meaning. This leads to misinterpretations in all SMART programs, which do not yet know the meaning in the new drives.

A brief evaluation of important SMART parameters is also included in most BIOS versions, so that warning messages about defective SSDs can appear when the computer is switched on. In this case, it is advisable to switch off the SMART self-test function in the BIOS and to carry out a manual test with a current program in the operating system (see comparison of SMART programs ).

Variations after connection

The implementation of the SMART standard differs depending on the hard disk connection in the PC. There are two of them: ATA and SCSI standards. Both know the HEALTH STATUS. The firmware of the drive indicates whether it is classified as "okay" or "problematic". Both standards also support reading out the temperature and several variants of self-tests and logbooks.

In the case of ATA hard disks, numerous values and their limits can also be queried using running software. In this way, the software or the user can assess more precisely whether and why an error will occur. However, these parameters are not exactly standardized and differ in scope and interpretation, even between models from one manufacturer.

The commands and data formats for all these functions are, however, implemented completely differently for ATA and SCSI.

Basically, SCSI commands are transmitted on the USB port. The hard disks connected via USB are almost without exception not SCSI but (S) ATA disks. In the course of the introduction of the USB 3.0 interface, the USB Attached SCSI (UAS) protocol was introduced; this can also be used on USB 2.0 at reduced speed, which, in contrast to the technically simpler bulk transfer of the USB memory sticks, tunneling the ATA Enables commands via the USB bus and enables SMART queries via USB. Chip manufacturers such as Cypress, JMicron or SunPlusIT use manufacturer-specific commands. Some programs can use these commands (see section SMART programs in comparison ). There are also USB-SATA bridges that support the manufacturer-independent SCSI / ATA translation standard.

The FireWire connection - which is common on Apple computers in particular - enables transmission natively, but Mac OS X does not use this.

Drives connected via eSATA , like their internal SATA counterparts, can be read without any problems.

Serial ATA disks connected via Serial Attached SCSI (SAS) can be checked if the corresponding SAT commands are available.

For tape drives, there are functions analogous to SMART called TapeAlert . They are used to warn of worn belts.

evaluation

Usual parameters

Each value is first saved as raw data . This is then sorted on a scale from 0 to 100, 200 or 255 for better understanding. The different scales are used for finer gradations where the manufacturer considers them to be useful. Starting with the scale maximum, the value approaches zero in the event of errors or increasing age. However, the critical limit (threshold) is often well above it.

The following table shows the individual parameters and the evaluation of the respective raw values (not to be confused with the values of the value scale):

Legend of the raw values
A.	Failure-relevant parameters. If available, possible failures can be forecast.
I.	Informative, parameters of little or no relevance for the failure forecast
	The higher the raw value, the better
	The lower the raw value, the better

ID	Hex	Parameter name (English)	Parameter name (German)	I.	description
01	0x01	(Raw) Read Error Rate	Read error rate (raw)		Uncorrectable errors when reading from the hard disk, leads to reading again. Indicates a problem with the plate surface. Some drives have very high raw values here, which cannot be compared between models from one manufacturer. With newer Seagate drives, it is incorrectly identical to that with Hardware ECC Recovered. Only the scale values are relevant to failure.
02	0x02	Throughput performance	Throughput		general data throughput or efficiency of the hard disk Strongly indicates braking problems in the drive.
03	0x03	Spin Up Time	Acceleration time		Average of the start time in (milli-) seconds. Indicates problems with the motor or the plate bearings. Brand new Maxtor and Quantum drives often had false alarms in the first month.
04	0x04	Start / Stop Count	Start / stop processes	Yes	Number of start and stop processes of a drive (also standby) Indicates wear and tear, as this is the hard drive that is the most stressful.
05	0x05	Reallocated Sectors Count	reassigned sectors		Number of reserve sectors consumed. Indicates surface problems, as only then does a reserve sector automatically replace one previously used. If this RAW counter is not equal to zero, the probability of failure is fivefold. This usually follows the first "Reallocation Event" within six months.
07	0x07	Seek error rate	Search error rate		Uncorrectable errors when reading from the hard disk, leads to reading again. Indicates a positioning problem in the read / write unit. Also unexplained by the manufacturer, some brand-new Seagate drives enter scale values well below 100 here.
09	0x09	Power On Hours Count	Time in operation	Yes	Mileage in hours or seconds (including standby) Indicates wear and tear, but does not say anything about usage conditions during this time. On some Maxtor models, e.g. B. With the Maxtor DiamondMax 10 6L250S0 these are minutes.
10	0x0A	Spin retry count	Start-up repetitions, only relevant for HDDs		Number of start-up attempts to rev up the hard disks to the nominal speed. An increasing value indicates mechanical problems in the drive of the hard disk.
12	0x0C	Power cycle count	Number of activations	Yes	The number of times the drive has been turned on and off.
184	0xB8	End-to-end error	End-to-end errors		Increasing values indicate parity errors between the storage medium and drive controller.
187	0xBB	Reported uncorrectable error	Reported uncorrectable errors		Errors which could not be corrected by the integrated forward error correction (ECC).
188	0xBC	Command timeout	Commands which could not be executed in time		Number of command aborts due to timeout
193	0xC1	Load cycle count or. Load / Unload Cycle Count	Parking processes	Yes	The read / write unit is parked on the plastic ramp next to the plates. Usually only with notebook drives. Indicates wear and tear; around 300,000 are planned - the raw value shows the previous one. The read / write unit is parked when it is switched off or after idling for around 10 seconds. This sometimes creates an irritating noise. If the notebook falls, the read / write unit no longer hits the magnetic disks. The shock resistance is tripled to around 1000 g . Switching on and off is also gentler, as the unit is not lowered onto a special area of the plates (“ landing zone ”).
194	0xC2	Drive temperature	Hard drive temperature		Temperature of the drive in ° C Since some drives also store maximum and minimum values, earlier hypothermia or overheating can be detected during operation. The value specified as raw value then contains all three numbers in a row. High temperatures (from 40 ° C) only have an effect after three years. This year they double the probability of failure. Then they lose their meaning again. Averaged over all ages, temperatures below 25 ° C are far more dangerous than those above 40 ° C. Double 20 ° C, triple failure rate 15 ° C; Measured up to 52 ° C. Some manufacturers use inaccurate or misplaced sensors.
195	0xC3	Hardware ECC Recovered	rescued bit errors		corrected bit errors when reading May indicate a problem with the plate surface. The high data density of today's hard disks means that error correction is inevitable when reading. ^[Document?] So even very high values here are no cause for concern. Samsung drives of the P80 series often incorrectly enter very low scale values here. In general, very high raw values are common which, due to a change from one technology to a newer one ("technology change"), cannot be compared between models from the same manufacturer. They increase during read processes, since only then does error correction take place. Only the scale values are relevant to failure. The values are rarely referred to as "ECC on-the-fly".
196	0xC4	Reallocation Event Count			Number of successful and unsuccessful reassignments of the sector position carried out to date as a result of reading errors from defective sectors.
197	0xC5	Current pending sector count			Number of sectors waiting for assignment of a new sector position due to read errors
198	0xC6	Uncorrectable Sector Count	Uncorrectable sectors		Number of previous uncorrectable sector errors in write or read operations.
199	0xC7	Ultra DMA CRC Error Count	DMA CRC error	Yes	Number of CRC errors that occurred The cause can be defective cables, dirty contacts, overclocking or faulty hard disk drivers. The transmission is repeated more slowly and slowly. If this fails, access to the hard disk is blocked.
201	0xC9	Soft read error rate			Number of read errors that cannot be corrected by software.

There are numerous other parameters, some of which are manufacturer-exclusive. Complete lists can be found in the literature section of the web links.

example

The evaluation of important SMART parameters using the example of a Hitachi 250 GB hard drive, connected via Serial ATA and read out with the smartmontools .

Parameter ID	Parameter name	Value (normalized current measured value)	Worst (worst value so far)	Threshold (limit value - value should be greater)	Type (maximum measured value shortly before failure)	Updated (real-time or measured value after a self-test)	RAW Value (actual measured value)	comment
2	Throughput performance	100	100	050	Pre-fail	Offline	0
3	Spin Up Time	118	118	024	Pre-fail	Always	294	Hitachi uses its own counting method, no (milli-) seconds.
4th	Start Stop Count	100	100	000	Old age	Always	772	The hard disk motor was switched on / off 772 times, including standby starts.
5	Reallocated sector count	100	100	005	Pre-fail	Always	55	55 sectors were exchanged for reserve sectors due to defects. However, the drive still rates this as problem-free (the value is still 100) - perhaps wrongly.
7th	Seek error rate	100	100	067	Pre-fail	Always	0	So far there have been no read / write errors.
9	Power On Hours	100	100	000	Old age	Always	1775	Drive has been powered for 1775 hours to date. This also includes standby phases in which the plates were idle. If the evaluation program does not know the hard disk model, you have to assess for yourself whether the value represents hours, minutes or seconds.
10	Spin retry count	100	100	060	Pre-fail	Always	0	So far there have been no false starts, the hard disk always started without any problems.
12	Power cycle count	100	100	000	Old age	Always	745	So far, the PC with this hard disk has been switched on and off 745 times.
194	Temperature	161	161	000	Old age	Always	34 + ( 10 2 ¹⁶ + 49 2 ³² )	Current temperature here would be 34 ° C. Previous life maxima of the drive were 10 ° C and 49 ° C. Value has therefore dropped from 200 to 161.
199	UDMA CRC error count	200	253	000	Old age	Always	730	So far there have been 730 transmission errors to the main board. The cause is either a faulty hard disk controller, a defective connection cable or a loose connection .

Value	is a normalized measured value, which mostly counts backwards (the lower, the worse).
Worst	worst value so far.
Threshold	the limit below which the value must not fall.
Type	stands for the meaning of the parameter: "Pre-fail" is a warning of an imminent failure, while "Old age" means that it is generally a question of progressive aging (the current temperature does not necessarily fall into one of the two categories).
Updated	indicates whether the value is updated permanently (always) or only through a self-test of the type "Offline data collection".
RAW value	is the actual measured value, e.g. the measured temperature or the number of errors.

Evaluation : According to the hard drive's own assessment, this drive is completely okay. Nowhere was the limit even close to being reached. According to a Google study, only the 55 replaced sectors are of concern. This value should therefore be kept in mind. However, if the “UDMA CRC Error Count” does not increase any further after the cable has been replaced and the cooling is improved so that approx. 45 ° C (temperature) is no longer exceeded, the drive can actually continue to be used without any problems.

Self-test and error log

In addition to the ongoing logging of the above parameters, there are other tests. Some manufacturers start these periodically in idle mode, others leave it to the user. He can with some of the offered programs perform. What is finally tested is also determined by the manufacturer. The standard is a short test with checking of all parameters, followed by samples of the legibility of the individual panes. The long version exchanges the sample for a complete check.

ATA-6 adds two more variants. One is recommended after a drive has been transported (called Conveyance - similar to the short test), the other allows you to test areas of the drive that you can select yourself (Selective - similar to the long test).

Since 1999 and the ATA-5 standard, errors that have occurred have not only been included in the parameter values (result for example: "Error rate: high"), but also recorded in detail. The errors, the time since the device was last switched on and the five previous steps are noted. There is even a separate table for the results of the above self-tests. In general, only current error clusters are considered to be questionable here.

If the hard disk supports updating its firmware , the error log is deleted when the hard disk is rewritten (regardless of the version). The parameter values are mostly retained.

SMART programs in comparison

The following table lists well-known programs for reading out SMART data.

Program name	Operating system (s)	price	Duration of the demo version	target group	user interface	connection	RAID controller support	Correct interpretation of SSDs	Display of the error log	Starting the self-tests	Failure prediction	Notification at	Notification by	providers	Remarks
Argus monitor	Windows	€ 14.95	30 days	Beginners to advanced	graphically	(S) ATA, USB	yes (not for all)	Yes	No	No	Yes	Selectable parameter changes, limit value, temperature	Windows, sound, e-mail, execute any command	ArgusMonitor	Additionally graphic display of CPU and graphics card temperature as well as CPU core frequency and Intel 'Turbo Boost' status; Display and control of mainboard and GPU fans
smartmontools	Windows (native or Cygwin ), Linux , Darwin ( Mac OS X ), Free / Open / Net BSD, Solaris , OS / 2 , QNX	Open source	-	Professional users	Command line , optional daemon or service , graphical front end	(S) ATA, SCSI, SAT , USB	3ware (Linux, FreeBSD, Windows), Compaq / HP (Linux, FreeBSD), HighPoint (Linux), Intel Matrix RAID (Windows)	Yes	Yes	yes (also time-controlled)	No	Selectable parameter changes, limit value, temperature	Window (Windows only), e-mail, system log, execute any command	smartmontools GSmartControl	manual
HDAT2	DOS	Freeware	-	Professional users	Text menu	(S) ATA, SCSI, USB, FireWire (some)	yes (not for all)	-	Yes	Yes	No	-	-	Lubomir Cabla	Offers setting of AAM and other parameters, as well as surface tests.
DriveSitter	Windows	from $ 29.69	30 days	Advanced	graphically	(S) ATA	-	?	Yes	Yes	Yes	Selectable parameter changes, limit value, temperature	Windows, sound, e-mail, network message, system log, execute any command	Oliver Marr	Highly scalable, switches to idle mode if required at critical temperatures.
EASIS Drive Check	Windows	Freeware / Pro € 19.-	-	Advanced	graphically	(S) ATA, USB, surface test all	-	?	Yes	No	No	Parameter changes	Window, email	EASIS	Can perform surface tests to find defective sectors
HDD Health	Windows	Freeware	-	Beginners to advanced	graphically	(S) ATA	-	-	yes (in new version)	yes (in new version)	Yes	every parameter change, temperature	Window, Sound, Email, Network Message (Email and Network Commercial Version Only)	PANTERASoft
Active SMART	Windows	from € 18.46	21 days	Beginners to advanced	graphically	(S) ATA, SCSI, USB	announced	-	No	No	Yes	Limit value, temperature	Window, sound, email, network message	Ariolic ATA / SCSI / USB	Switches to idle mode if the temperature is critical.
SpeedFan	Windows	Freeware	-	Beginners to advanced	graphically	(S) ATA, SCSI	-	yes (not for all)	No	Yes	Yes	Limit value, temperature	System notification, sound, e-mail, execute any command	Alfredo Milani Comparetti	Provides online analysis of the drive [1] , monitors PC temperatures
SMARTReporter	Mac OS X	Open Source / Pro € 4.49	-	Beginners	graphically	(S) ATA	-	yes (based on smartmontools)	Yes	Yes	No	limit	Execute window, email, any command	Julian Mayer
HDTune	Windows	Freeware HD Tune Pro 24.95 EUR	-	Beginners to advanced	graphically	(S) ATA, USB (most)	-	-	No	No	No	-	-	EFD software	Performs benchmarks and surface tests; Health for ext. HDD only in the Pro version
Norton System Doctor	Windows	proprietary	-	Beginners	graphically	(S) ATA, SCSI, USB	?	?	No	No	No	Limit value (for each data carrier individually)	Taskbar icon, sound, administrative message	Symantec weblink	Can be configured individually for each data carrier, interface for Disc Doktor / chkdsk : surface test, complete test on restart
CrystalDiskInfo	Windows	Open source	-	Beginners to advanced	graphically	(S) ATA, USB (some)	Intel Matrix RAID	Yes	Yes	No	Yes	Limit value, temperature (for each data carrier individually)	Taskbar Icon, Sound, Email, Event Log	Crystal Dew World	Offers setting of AAM and other parameters
Acronis® Drive Monitor ™	Windows	Freeware / proprietary	-	Beginners to advanced	graphically	(S) ATA, USB (most), software RAID controllers (many)	Software RAID controller YES, hardware controller support announced	?	Yes	?	Yes	Hard drive problems, temperature, "critical events", backup messages	Taskbar icon, alarm message, email	Acronis	Manual
Samsung SSD Magician	Windows	proprietary	-	Beginners to advanced	graphically	(S) ATA	-	Yes	Yes	?	?	?	-
DHE Drive Info	Windows	Freeware	-	Beginners to advanced	graphically	(S) ATA, SCSI, USB	experimental	Yes	Yes	Yes	?	Limit value, temperature	window	Dirk Hauschild	portable, no installation required

Reading of hard disks on RAID controllers

Only the controller manufacturer has the information required to read out the SMART status in the RAID system. So he has to make this available with his driver via API function. However, not all of them do this - and when they do, it is often manufacturer-specific and only for selected models. The table evaluates the manufacturers from which the program knows the functions.
Addressing the controller directly without using the driver functions is more successful, but also potentially unstable and therefore only acceptable under DOS .
If SMART support is mentioned in the controller's specifications, this is often only internal to the controller. The driver then does not pass the information on to programs, some only to that of a drive.
Hard disks in so-called software RAIDs (i.e. groups that are managed by the operating system) and those that are set up on RAID controllers as individual drives instead of as a group can always be read out. Therefore it is not counted.

swell

^ Heise announcement of February 16, 2007

↑ ^a ^b http://research.google.com/archive/disk_failures.pdf

↑ - ( Memento of the original from March 21, 2014 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. Example of a reallocation of an existing SMART attribute on Indilinx controllers @1@ 2

↑ Some USB devices with SMART support (smartmontools Wiki)

↑ Michael Schmelzle: These SMART data are important. IDG Tech Media GmbH, October 30, 2013, accessed April 5, 2017 .

↑ http://forums.storagereview.net/index.php?showtopic=20731

↑ Figure: Read / write head in park position

↑ Ticket # 20275: Add support for starting tests

Web links

Manufacturer's own software
- Fujitsu
- Hitachi
- Maxtor ( Memento from April 15, 2007 in the Internet Archive )
- Samsung
- Seagate
- Western Digital
- Ultimate Boot CD - proprietary and other tools on a bootable CD.
- SSD tools: curse or blessing? an inventory… , pc-experience.de

Software based on availability for operating systems
- FreeBSD RAID Monitoring

literature
- Linux community: "Prevention instead of crash"
- Introduction (English)
- Compendium (English, PDF; 679 kB)
- Background (English)
- Failure study (English, also as PDF )

Standards
- ATA-3 Standard, Draft 7b (English, PDF) - The SMART attributes mentioned here were removed before the standard was adopted.
- ATA-8 ACS Standard, Draft 6a ( Memento from December 11, 2009 in the Internet Archive ) (English, PDF; 2.8 MB) - Last draft of the currently valid standard, the SMART attributes are still missing.
- ATA-8 Appendix on SMART Attributes ( Memento of July 3, 2007 in the Internet Archive ) (English, PDF; 24 kB) - Unaccepted proposal for an informal appendix to the ATA-8 ACS standard.

[1] Heise announcement of February 16, 2007

[google_pdf-2] ttp://research.google.com/archive/disk_failures.pdf