Thursday, 13 November 2014

Using SMART to accurately predict when a hard drive is about to die

A bunch of hard drives -- Google goes through quite a few, presumably

Share This article

SMART, which first started to appear in consumer hard drives about 10 years ago, is a brilliant concept — it’s meant to tell you if a hard drive is about to die — but in practice, I think we can all agree that SMART has always been rather underwhelming. Personally, SMART has never helped me catch a failing drive — and I suspect it’s the same for most people who are reading this story.
The problem with SMART — which stands for Self-Monitoring, Analysis and Reporting Technology — is that it provides a lot of (mostly useless) data, and many of the statistics reported are inconsistent (different definitions, different measurements) between hard drive manufacturers. While you want SMART to simply tell you that a hard drive is about to fail, it does no such thing — instead, you get to analyse about 50 different variables (which vary from drive to drive), and then try to magically divine whether the drive is healthy or not. While most hard drive makers do provide some kind of SMART monitoring tool, none of them accurately warn you that a drive is about to fail.
The sad thing is, SMART does return some useful information — but because of the inconsistency between drive makers, and because drive makers don’t tell us which of the various attributes and variables we should pay attention to, any useful data is drowned out. Now, however, thanks to the infinite-online-backup guys at Backblaze, we may finally have a way of using SMART to predict drive death.
SMART 187 (uncorrectable errors) vs. drive failure rate
SMART 187 (uncorrectable errors) vs. drive failure rate

Which SMART attributes actually matter?

For the past year or so, Backblaze has been capturing the SMART data from some 40,000 hard drives. The hard drives are made by all the usual suspects — Seagate, Western Digital, Hitachi, and HGST. By working backwards from drives that failed, and then looking at their reported SMART data from the preceding weeks and months, Backblaze thinks it has identified five SMART attributes that actually predict drive death. Backblaze is apparently using this data to replace drives before they fail, so it must be fairly confident in its findings.
Here are the big five. (For more info on what each error means, the SMART Wikipedia page is pretty good.)
  1. SMART ID 5 (0x05): Relocated Sectors Count
  2. SMART ID 187 (0xBB): Reported Uncorrectable Errors
  3. SMART ID 188 (0xBC): Command Timeout
  4. SMART ID 197 (0xC5): Current Pending Sector Count
  5. SMART ID 198 (0xC6): Uncorrectable Sector Count
In general, if a drive shows a count of zero (0) for all of these attributes, it means the drive is almost certainly healthy. Conversely, if any of these attributes has a value of 1 or more, there’s a significant chance that the drive will die soon — it’s time to back up your data ASAP and slot in a new drive. Backblaze says that SMART attribute 187 (0xBB), Reported Uncorrectable Errors, is particularly useful because all hard drive makers seem to agree on the same definition, and because the reported number is easy to interpret.
If you want to look at Backblaze’s complete SMART data set, it’s all online for your perusal — or read on, for a little more analysis of some of its more interesting findings.

Is hard drive life affected by the number of power cycles?

One of the most popular myths/anecdotes/old wives’ tales in computer building circles is that turning a computer on severely reduces the lifespan of a hard drive. The idea is that, keeping a drive spinning at a few thousand rpm is easy — but the initial strain on the components is somehow damaging. The counterpoint to this, of course, is that a switched off hard drive will obviously live longer than a spinning hard drive. So, should you keep your PC turned on 24/7 and flip the config option that prevents your hard drives from spinning down?
Power cycles vs. hard drive life
Power cycles vs. hard drive life
Backblaze’s SMART data on power cycles is… interesting. Drives clearly have less chance of dying when they have only been powered up a few times — but after 30 power cycles or so, it seems to even out. Backblaze very rarely powers down its drives (less than 100 times in total, over 4+ years), so we can’t really draw any conclusive data.
Temperature vs. hard drive failure rate
Temperature (in Celsius I think) vs. hard drive failure rate
You’ll be glad to hear that temperature doesn’t appear to affect a drive’s failure rate.
SMART hard drive data, with the DiskCheckup tool
SMART hard drive data, with the DiskCheckup tool
If you want to use this SMART data to see if your own hard drives are failing, PassMark’s DiskCheckup is a free program for Windows that’s easy to use. If any of the above five SMART values are above zero, then you may want to back up your data soon. Bear in mind that not every drive reports every SMART attribute — and you may need to use the hex (0xC5) code rather than decimal (197) to find the SMART attribute you’re looking for.

0 comments:

Post a Comment