[OmniOS-discuss] zpool degraded while smart sais disks are OK

Sun Mar 23 23:32:15 UTC 2014

On Mar 21, 2014, at 10:13 PM, Tobias Oetiker <tobi at oetiker.ch> wrote:

> Yesterday Richard Elling wrote:
> 
>> 
>> On Mar 21, 2014, at 3:23 PM, Tobias Oetiker <tobi at oetiker.ch> wrote:
> 
> [...]
>>> 
>>> it happened over time as you can see from the timestamps in the
>>> log. The errors from zfs's point of view were 1 read and about 30 write
>>> 
>>> but according to smart the disks are without flaw
>> 
>> Actually, SMART is pretty dumb. In most cases, it only looks for uncorrectable
>> errors that are related to media or heads. For a clue to more permanent errors,
>> you will want to look at the read/write error reports for errors that are
>> corrected with possible delays. You can also look at the grown defects list.
>> 
>> This behaviour is expected for drives with errors that are not being quickly
>> corrected or have firmware bugs (horrors!) and where the disk does not do TLER
>> (or its vendor's equivalent)
>> -- richard
> 
> the error counters look like this:
> 
> 
> Error counter log:
>           Errors Corrected by           Total   Correction     Gigabytes    Total
>               ECC          rereads/    errors   algorithm      processed    uncorrected
>           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
> read:       3494        0         0      3494      44904        530.879           0
> write:         0        0         0         0      39111       1793.323           0
> verify:        0        0         0         0       8133          0.000           0

Errors corrected without delay looks good. The problem lies elsewhere.

> 
> the disk vendor is HGST in case anyone has further ideas ... the system has 20 of these disks and the problems occured with
> three of them. The system has been running fine for two months previously.

...and yet there are aborted commands, likely due to a reset after a timeout.
Resets aren't issued without cause.

There are two different resets issued by the sd driver: LU and bus. If the
LU reset doesn't work, the resets are escalated to bus. This is, of course,
tunable, but is rarely tuned. A bus reset for SAS is a questionable practice,
since SAS is a fabric, not a bus. But the effect of a device in the fabric
being reset could be seen as aborted commands by more than one target. To
troubleshoot these cases, you need to look at all of the devices in the data
path and map the common causes: HBAs, expanders, enclosures, etc. Traverse
the devices looking for errors, as you did with the disks. Useful tools:
sasinfo, lsiutil/sas2ircu, smp_utils, sg3_utils, mpathadm, fmtopo.
 -- richard

> 
> Vendor:               HGST
> Product:              HUS724030ALS640
> Revision:             A152
> User Capacity:        3,000,592,982,016 bytes [3.00 TB]
> Logical block size:   512 bytes
> Serial number:        P8J20SNV
> Device type:          disk
> Transport protocol:   SAS
> 
> cheers
> tobi
>> 
>> 
> 
> -- 
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> www.oetiker.ch tobi at oetiker.ch +41 62 775 9902
> *** We are hiring IT staff: www.oetiker.ch/jobs ***