[OmniOS-discuss] zpool degraded while smart sais disks are OK

Mon Mar 31 14:16:08 UTC 2014

Hi Richard,

Mar 23 Richard Elling wrote:

>
> On Mar 21, 2014, at 10:13 PM, Tobias Oetiker <tobi at oetiker.ch> wrote:
>
> > Yesterday Richard Elling wrote:
> >
> >>
> >> On Mar 21, 2014, at 3:23 PM, Tobias Oetiker <tobi at oetiker.ch> wrote:
> >
> > [...]
> >>>
> >>> it happened over time as you can see from the timestamps in the
> >>> log. The errors from zfs's point of view were 1 read and about 30 write
> >>>
> >>> but according to smart the disks are without flaw
> >>
> >> Actually, SMART is pretty dumb. In most cases, it only looks for uncorrectable
> >> errors that are related to media or heads. For a clue to more permanent errors,
> >> you will want to look at the read/write error reports for errors that are
> >> corrected with possible delays. You can also look at the grown defects list.
> >>
> >> This behaviour is expected for drives with errors that are not being quickly
> >> corrected or have firmware bugs (horrors!) and where the disk does not do TLER
> >> (or its vendor's equivalent)
> >> -- richard
> >
> > the error counters look like this:
> >
> >
> > Error counter log:
> >           Errors Corrected by           Total   Correction     Gigabytes    Total
> >               ECC          rereads/    errors   algorithm      processed    uncorrected
> >           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
> > read:       3494        0         0      3494      44904        530.879           0
> > write:         0        0         0         0      39111       1793.323           0
> > verify:        0        0         0         0       8133          0.000           0
>
> Errors corrected without delay looks good. The problem lies elsewhere.
>
> >
> > the disk vendor is HGST in case anyone has further ideas ... the system has 20 of these disks and the problems occured with
> > three of them. The system has been running fine for two months previously.
>
> ...and yet there are aborted commands, likely due to a reset after a timeout.
> Resets aren't issued without cause.
>
> There are two different resets issued by the sd driver: LU and bus. If the
> LU reset doesn't work, the resets are escalated to bus. This is, of course,
> tunable, but is rarely tuned. A bus reset for SAS is a questionable practice,
> since SAS is a fabric, not a bus. But the effect of a device in the fabric
> being reset could be seen as aborted commands by more than one target. To
> troubleshoot these cases, you need to look at all of the devices in the data
> path and map the common causes: HBAs, expanders, enclosures, etc. Traverse
> the devices looking for errors, as you did with the disks. Useful tools:
> sasinfo, lsiutil/sas2ircu, smp_utils, sg3_utils, mpathadm, fmtopo.

thanks for the hints ... after detatching/attaching the 'failed'
disks, they got resilvered and a subsequent scrub did not detect
any errors ...

all a bit mysterious ... will keep an eye on the box to see how it
fares on the future ...

cheers
tobi

-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
www.oetiker.ch tobi at oetiker.ch +41 62 775 9902
*** We are hiring IT staff: www.oetiker.ch/jobs ***