[OmniOS-discuss] How bad are these controller / io errors??

Richard Elling richard.elling at richardelling.com
Sat Aug 17 23:25:49 UTC 2013


On Aug 13, 2013, at 8:20 AM, steve at linuxsuite.org wrote:

> 
>   Howdy!
> 
>         This is a SuperMicro JBOD with SATA disks. I am aware of the
> issues of having
> SATA on SAS, but was wondering just how serious these kinds of errors
> are.. a scrub of the pool
> completes without noticable problems.. I did a lot of stress testing
> earlier and could
> not get a failure. Disabling NCQ on the controller was a neccessary.
> What is the practical risk to data??
> 
>        See below info for iostat / syslog
> 
> thanx - steve
> 
>           syslog info
> 
> kern.warning<4>: Aug 13 10:39:10 dfs1 scsi: [ID 243001 kern.warning]
> WARNING: /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
> kern.warning<4>: Aug 13 10:39:10 dfs1 #011mptsas_handle_event_sync:
> IOCStatus=0x8000, IOCLogInfo=0x31120303
> kern.warning<4>: Aug 13 10:39:10 dfs1 scsi: [ID 243001 kern.warning]
> WARNING: /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
> kern.warning<4>: Aug 13 10:39:10 dfs1 #011mptsas_handle_event_sync:
> IOCStatus=0x8000, IOCLogInfo=0x31120436
> kern.warning<4>: Aug 13 10:39:10 dfs1 scsi: [ID 243001 kern.warning]
> WARNING: /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
> kern.warning<4>: Aug 13 10:39:10 dfs1 #011mptsas_handle_event:
> IOCStatus=0x8000, IOCLogInfo=0x31120303
> kern.warning<4>: Aug 13 10:39:10 dfs1 scsi: [ID 243001 kern.warning]
> WARNING: /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):

These messages are generated by the device and reported by the mpt_sas
driver. They can be decoded, but unfortunately the illumos driver leaves it as
an exercise for the developer :-(

> 
> Blah Blah...
> 
> kern.warning<4>: Aug 13 10:39:10 dfs1 #011mptsas_handle_event:
> IOCStatus=0x8000, IOCLogInfo=0x31120436
> kern.info<6>: Aug 13 10:39:11 dfs1 scsi: [ID 365881 kern.info]
> /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
> kern.info<6>: Aug 13 10:39:11 dfs1 #011Log info 0x31120303 received for
> target 13.
> kern.info<6>: Aug 13 10:39:11 dfs1 #011scsi_status=0x0, ioc_status=0x804b,
> scsi_state=0xc
> kern.info<6>: Aug 13 10:39:11 dfs1 scsi: [ID 365881 kern.info]
> /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):
> kern.info<6>: Aug 13 10:39:11 dfs1 #011Log info 0x31120303 received for
> target 13.
> kern.info<6>: Aug 13 10:39:11 dfs1 #011scsi_status=0x0, ioc_status=0x804b,
> scsi_state=0xc
> kern.info<6>: Aug 13 10:39:11 dfs1 scsi: [ID 365881 kern.info]
> /pci at 0,0/pci8086,340d at 6/pci1000,3080 at 0 (mpt_sas0):

These (status 0x804b == aborted command) are the result of a device reset. 
The fact you are seeing a reset means that an I/O timed out somewhere.

> 
>          Output of iostat -En
> 
>         Looks like "Hard Errors" and "No Device" correspond. What
> does "Transport Error" and "Recoverable" mean. I see no evidence
> of data corruption/loss, does ZFS deal/recover from these errors in a
> good/safe
> way?

Transport errors are things like non-response to command. Recoverable means
that the sd driver can retry.

> 
> c5t5000C500489947A8d0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 11
> Vendor: ATA      Product: ST3000DM001-9YN1 Revision: CC4H Serial No: W1F0AAMA
> Size: 3000.59GB <3000592982016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0
> Illegal Request: 2 Predictive Failure Analysis: 0
> 
> c5t5000C500525EB2B9d0 Soft Errors: 0 Hard Errors: 5 Transport Errors: 46
> Vendor: ATA      Product: ST3000DM001-9YN1 Revision: CC4H Serial No: W1F0QM5H
> Size: 3000.59GB <3000592982016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 5 Recoverable: 0
> Illegal Request: 5 Predictive Failure Analysis: 0
> 
> c5t5000C50045561CEAd0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 7
> Vendor: ATA      Product: ST3000DM001-9YN1 Revision: CC4H Serial No: W1F09G4Q
> Size: 3000.59GB <3000592982016 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0
> Illegal Request: 1 Predictive Failure Analysis: 0

Something is sick. Unfortunately, it could be one of the many SATA devices 
causing disruptions to everyone else -- a good reason to not directly attach
SATA devices to SAS expanders. Tracking these down and eliminating a bug
in the expander itself, is a tedious task. Is there anything in the JBOD other than
the Seagate 3TBs?
 -- richard

-- 

ZFS storage and performance consulting at http://www.RichardElling.com








More information about the OmniOS-discuss mailing list