[OmniOS-discuss] weird disk behavior

Wed Mar 9 23:53:52 UTC 2016

comment below...

> On Mar 9, 2016, at 11:05 AM, Michael Rasmussen <mir at miras.org> wrote:
> 
> Hi all,
> 
> I suddenly noticed one of the disk bays in my storage server going red
> with this logged in dmesg:
> Mar  9 19:19:47 nas genunix: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1022,1708 at 3/pci1028,1f0e at 0 (mpt0):
> Mar  9 19:19:47 nas     Disconnected command timeout for Target 1
> Mar  9 19:19:51 nas genunix: [ID 365881 kern.info] /pci at 0,0/pci1022,1708 at 3/pci1028,1f0e at 0 (mpt0):
> Mar  9 19:19:51 nas     Log info 0x31140000 received for target 1.
> Mar  9 19:19:51 nas     scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
> Mar  9 19:19:51 nas genunix: [ID 365881 kern.info] /pci at 0,0/pci1022,1708 at 3/pci1028,1f0e at 0 (mpt0):
> Mar  9 19:19:51 nas     Log info 0x31130000 received for target 1.
> Mar  9 19:19:51 nas     scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
> Mar  9 19:19:51 nas genunix: [ID 365881 kern.info] /pci at 0,0/pci1022,1708 at 3/pci1028,1f0e at 0 (mpt0):
> Mar  9 19:19:51 nas     Log info 0x31130000 received for target 1.
> Mar  9 19:19:51 nas     scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
> Mar  9 19:20:21 nas genunix: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1022,1708 at 3/pci1028,1f0e at 0/sd at 1,0 (sd3):
> Mar  9 19:20:21 nas     Command failed to complete...Device is gone
> Mar  9 19:20:24 nas genunix: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1022,1708 at 3/pci1028,1f0e at 0/sd at 1,0 (sd3):
> Mar  9 19:20:24 nas     Command failed to complete...Device is gone
> Mar  9 19:20:24 nas genunix: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1022,1708 at 3/pci1028,1f0e at 0/sd at 1,0 (sd3):
> Mar  9 19:20:24 nas     SYNCHRONIZE CACHE command failed (5)
> Mar  9 19:20:24 nas genunix: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1022,1708 at 3/pci1028,1f0e at 0/sd at 1,0 (sd3):
> Mar  9 19:20:24 nas     drive offline
> 
> zpool online and smartctl could not talk to the disk.
> 
> Pulling the disk and reinserting it and the status showed green in
> which case both smartctl and zpool online could talk to the disk.
> 
> Resilvering is now taking place.
> 
> Any idea what has went wrong or should I worry for a disk imminently
> failing?

these are symptoms that the drive is not responding and resets are being sent to try (often in vain) to bring the disk online. Since this is mpt, it is likely 3Gbps and if the drive is SATA your tears will flow. Now that the drive is back AND the symptoms cleared after reinstalling the drive, it is very likely that drive is the source of the errors. smartctl might give more info. IMHO you should plan for replacement of that drive.

NB, for that SAS fabric generation, it is posaible that the problem drive is not the only drive showing the same errors, but your drive pull test is a reasonable approach. 

Do not be surprised if smartctl doesn't correctly identify the issue, smart isn't very smart sometimes.

  -- richard

> 
> 
> -- 
> Hilsen/Regards
> Michael Rasmussen
> 
> Get my public GnuPG keys:
> michael <at> rasmussen <dot> cc
> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xD3C9A00E
> mir <at> datanom <dot> net
> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE501F51C
> mir <at> miras <dot> org
> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE3E80917
> --------------------------------------------------------------
> /usr/games/fortune -es says:
> No group of professionals meets except to conspire against the public
> at large. -- Mark Twain
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss