Analysis
Is there a SATA silent drive failure problem?
posted on 08 July 2008 09:51
Yesterday RAID INc. announced it was going to OEM NEC of America D-Series drive arrays because the array controller, amongst other things, carried out read integrity validation checks. This was necessary because RAID Inc. customers had reported 'silent drive failures' on SATA drives with not all the data on the drive being accessible by the RAID controller.
Simple statement; could be earth-shattering implications. RAID Inc. chose NEC drive arrays and rejected drive array subsystems from Infortrend and Xyratex because they didn't do such read integrity validation checking.
NEC's release about the RAID Inc. OEM deal says this about one of the things its D-Series controller carries out: 'SATA read verification to detect silent read errors that other arrays do not.'
Another simple statement. A 'silent' read error, meaning that the controller doesn't return all the information it was asked to. We can see how that would be vital in a high performance computing (HPC) numerical calculation where bits count. But it is also vital everywhere else too. You simply have to trust absolutely that what you request from a drive you get.
Jerome Wendt has written about unrecoverable bit errors on SATA drives and how RAID systems do not detect them.
Is there a general problem here, or one that is only revealed in HPC configurations with hundreds or thousands of drives and a very low occurrence rate? Certainly there has been no whisper of a similar problem from other SATA drive array suppliers.
NEC of America has a Clipper Group evaluation of the D-Series. That evaluation does not mention the SATA read verification at all. It has an ESG validation of the D-Series; that doesn't mention SATA read verification either. This double absence suggests that neither NEC of America nor the evaluators thought it was significant.
Certainly the D-Series has a lot of data protection features with redundant caches and enhanced RAID schemes and both evaluations go into these in detail. But the SATA read verification has only come to the fore in the OEM agreement with RAID Inc. So ... is there a real, hidden and industry-wide problem here or a minor, minor blip affecting some HPC sites?
[Chris Mellor.]
tags: SATA
in Analysis
Combatting silent SATA drive errors
you're reading:
Is there a SATA silent drive failure problem?


