[OmniOS-discuss] slow drive response times

Tue Jan 6 23:25:45 UTC 2015

Thanks!  This has been very educational.  Let me see if I have this
straight:  The zero error counts for the HBA and the expander ports
eliminate either of those as the source of the errors seen in the
sg_logs output - is that right?

So back to my original question:  If I see long service times on a
drive, and it shows errors in the drive counters you mentioned, but not
on the expander ports or HBAs, then is it safe to conclude the fault
lies with the drive?

Kevin

On 01/06/2015 02:23 PM, Richard Elling wrote:
> 
>> On Jan 6, 2015, at 12:18 PM, Kevin Swab <Kevin.Swab at ColoState.EDU> wrote:
>>
>> SAS expanders are involved in my systems, so I installed 'sasinfo' and
>> 'smp_utils'.  After a bit of poking around in the dark, I came up with
>> the following commands which I think get at the error counters you
>> mentioned.
> 
> Yes, this data looks fine
> 
>>
>> Unfortunately, I had to remove the "wounded soldier" from this system
>> since it was causing problems.  This output is from the same slot, but
>> with a healthy replacement drive:
>>
>> # sasinfo hba-port -a SUNW-mpt_sas-1 -l
>> HBA Name: SUNW-mpt_sas-1
>>  HBA Port Name: /dev/cfg/c7
>>    Phy Information:
>>      Identifier: 0
>>        Link Error Statistics:
>>            Invalid Dword: 0
>>            Running Disparity Error: 0
>>            Loss of Dword Sync: 0
>>            Reset Problem: 0
>>      Identifier: 1
>>        Link Error Statistics:
>>            Invalid Dword: 0
>>            Running Disparity Error: 0
>>            Loss of Dword Sync: 0
>>            Reset Problem: 0
>>      Identifier: 2
>>        Link Error Statistics:
>>            Invalid Dword: 0
>>            Running Disparity Error: 0
>>            Loss of Dword Sync: 0
>>            Reset Problem: 0
>>      Identifier: 3
>>        Link Error Statistics:
>>            Invalid Dword: 0
>>            Running Disparity Error: 0
>>            Loss of Dword Sync: 0
>>            Reset Problem: 0
> 
> perfect!
> 
>>  HBA Port Name: /dev/cfg/c8
>>    Phy Information:
>>      Identifier: 4
>>        Link Error Statistics:
>>            Invalid Dword: 0
>>            Running Disparity Error: 0
>>            Loss of Dword Sync: 0
>>            Reset Problem: 0
>>      Identifier: 5
>>        Link Error Statistics:
>>            Invalid Dword: 0
>>            Running Disparity Error: 0
>>            Loss of Dword Sync: 0
>>            Reset Problem: 0
>>      Identifier: 6
>>        Link Error Statistics:
>>            Invalid Dword: 0
>>            Running Disparity Error: 0
>>            Loss of Dword Sync: 0
>>            Reset Problem: 0
>>      Identifier: 7
>>        Link Error Statistics:
>>            Invalid Dword: 0
>>            Running Disparity Error: 0
>>            Loss of Dword Sync: 0
> 
> perfect!
> 
>>
>>
>>
>> # ./smp_discover /dev/smp/expd9 | egrep '(c982|c983)'
>>  phy  26:U:attached:[50000394a8cbc982:00  t(SSP)]  6 Gbps
>> # ./smp_discover /dev/smp/expd11 | egrep '(c982|c983)'
>>  phy  26:U:attached:[50000394a8cbc983:01  t(SSP)]  6 Gbps
>> # ./smp_rep_phy_err_log --phy=26 /dev/smp/expd9
>> Report phy error log response:
>>  Expander change count: 228
>>  phy identifier: 26
>>  invalid dword count: 0
>>  running disparity error count: 0
>>  loss of dword synchronization count: 0
>>  phy reset problem count: 0
>> # ./smp_rep_phy_err_log --phy=26 /dev/smp/expd11
>> Report phy error log response:
>>  Expander change count: 228
>>  phy identifier: 26
>>  invalid dword count: 0
>>  running disparity error count: 0
>>  loss of dword synchronization count: 0
>>  phy reset problem count: 0
>> #
>>
>> "disparity error count" and "loss of dword sync count" are 0 in all of
>> this output, in contrast with the non-zero values seen in the sg_logs
>> output for the "wounded soldier".
> 
> perfect!
> 
>>
>> Am I looking at the right output?
> 
> Yes, this is not showing any errors, which is a good thing.
> 
>>  Does "phy" in the above commands
>> refer to the HDD itself or the port on the expander it's connected to?
> 
> Expander port. The HDD's view is in the sg_logs --page=0x18 /dev/rdsk/...
> 
>> Had I been able to run the above commands with the "wounded soldier"
>> still installed, what should I have been looking for?
> 
> The process is to rule out errors. You have succeeded.
>  -- richard
> 
>>
>> Thanks again for your help,
>> Kevin
>>
>>
>> On 01/02/2015 03:45 PM, Richard Elling wrote:
>>>
>>>> On Jan 2, 2015, at 1:50 PM, Kevin Swab <Kevin.Swab at colostate.edu
>>>> <mailto:Kevin.Swab at colostate.edu>> wrote:
>>>>
>>>> I've run 'sg_logs' on the drive I pulled last week.  There were alot of
>>>> errors in the backgroud scan section of the output, which made it very
>>>> large, so I put it here:
>>>>
>>>> http://pastebin.com/jx5BvSep
>>>>
>>>> When I pulled this drive, the SMART health status was OK.  
>>>
>>> SMART isn’t smart :-P
>>>
>>>> However, when
>>>> I put it in a test system to run 'sg_logs', the status changed to
>>>> "impending failure...".  Had the SMART status changed before pulling the
>>>> drive, I'm sure 'fmd' would have alerted me to the problem…
>>>
>>> By default, fmd looks for the predictive failure (PFA) and self-test
>>> every hour using the disk_transport 
>>> agent. fmstat should show activity there. When a PFA is seen, then there
>>> will be an ereport generated
>>> and, for most cases, a syslog message. However, this will not cause a
>>> zfs-retire event.
>>>
>>> Vendors have significant leeway in how they implement SMART. In my
>>> experience the only thing
>>> you can say for sure is if the vendor thinks the drive’s death is
>>> imminent, then you should replace
>>> it. I suspect these policies are financially motivated rather than
>>> scientific… some amount of truthiness
>>> is to be expected.
>>>
>>> In the logs, clearly the one disk has lots of errors that have been
>>> corrected and the rate is increasing.
>>> The rate of change for "Errors corrected with possible delays” may
>>> correlate to your performance issues,
>>> but the interpretation is left up to the vendors.
>>>
>>> In the case of this naughty drive, yep it needs replacing.
>>>
>>>>
>>>> Since that drive had other indications of trouble, I ran 'sg_logs' on
>>>> another drive I pulled recently that has a SMART health status of OK,
>>>> but exibits similar slow service time behavior:
>>>>
>>>> http://pastebin.com/Q0t8Jnug
>>>
>>> This one looks mostly healthy.
>>>
>>> Another place to look for latency issues is the phy logs. In the sg_logs
>>> output, this is the
>>> Protocol Specific port log page for SAS SSP. Key values are running
>>> disparity error 
>>> count and loss of dword sync count. The trick here is that you need to
>>> look at both ends
>>> of the wire for each wire. For a simple case, this means looking at both
>>> the HBA’s phys error
>>> counts and the driver. If you have expanders in the mix, it is more
>>> work. You’ll want to look at 
>>> all of the HBA, expander, and drive phys health counters for all phys.
>>>
>>> This can get tricky because wide ports are mostly dumb. For example, if
>>> an HBA has a 4-link
>>> wide port (common) and one of the links is acting up (all too common)
>>> the latency impacts 
>>> will be random.
>>>
>>> To see HBA and expander link health, you can use sg3_utils, its
>>> companion smp_utils, or
>>> sasinfo (installed as a separate package from OmniOS, IIRC). For example, 
>>> sasinfo hba-port -l
>>>
>>> HTH
>>> — richard
>>>
>>>
>>>>
>>>> Thanks for taking the time to look at these, please let me know what you
>>>> find...
>>>>
>>>> Kevin
>>>>
>>>>
>>>>
>>>>
>>>> On 12/31/2014 06:13 PM, Richard Elling wrote:
>>>>>
>>>>>> On Dec 31, 2014, at 4:30 PM, Kevin Swab <Kevin.Swab at ColoState.EDU
>>>>>> <mailto:Kevin.Swab at ColoState.EDU>> wrote:
>>>>>>
>>>>>> Hello Richard and group, thanks for your reply!
>>>>>>
>>>>>> I'll look into sg_logs for one of these devices once I have a chance to
>>>>>> track that progam down...
>>>>>>
>>>>>> Thanks for the tip on the 500 ms latency, I wasn't aware that could
>>>>>> happen in normal cases.  However, I don't believe what I'm seeing
>>>>>> constitutes normal behavior.
>>>>>>
>>>>>> First, some anecdotal evidence:  If I pull and replace the suspect
>>>>>> drive, my downstream systems stop complaining, and the high service time
>>>>>> numbers go away.
>>>>>
>>>>> We call these "wounded soldiers"  -- it takes more resources to manage a
>>>>> wounded soldier than a dead soldier, so one strategy of war is to
>>>>> wound your
>>>>> enemy causing them to consume resources tending the wounded. The sg_logs
>>>>> should be enlightening.
>>>>>
>>>>> NB, consider a 4TB disk with 5 platters: if a head or surface starts
>>>>> to go, then
>>>>> you  have a 1/10 chance that the data you request is under the
>>>>> damaged head
>>>>> and will need to be recovered by the drive. So it is not uncommon to see
>>>>> 90+% of the I/Os to the drive completing quickly. It is also not
>>>>> unusual to see
>>>>> only a small number of sectors or tracks affected.
>>>>>
>>>>> Detecting these becomes tricky, especially as you reduce the
>>>>> timeout/retry
>>>>> interval, since the problem is rarely seen in the average latency --
>>>>> that which
>>>>> iostat and sar record. This is an area where we can and are improving.
>>>>> -- richard
>>>>>
>>>>>>
>>>>>> I threw out 500 ms as a guess to the point at which I start seeing
>>>>>> problems.  However, I see service times far in excess of that, sometimes
>>>>>> over 30,000 ms!  Below is 20 minutes of sar output from a drive I pulled
>>>>>> a few days ago, during a time when downstream VMWare servers were
>>>>>> complaining.  (since the sar output is so verbose, I grepped out the
>>>>>> info just for the suspect drive):
>>>>>>
>>>>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd91,a)'
>>>>>> 14:50:00   device        %busy   avque   r+w/s  blks/s  avwait  avserv
>>>>>>         sd91,a           99     5.3       1      42     0.0  7811.7
>>>>>>         sd91,a          100    11.3       1      53     0.0 11016.0
>>>>>>         sd91,a          100     3.8       1      75     0.0  3615.8
>>>>>>         sd91,a          100     4.9       1      25     0.0  8633.5
>>>>>>         sd91,a           93     3.9       1      55     0.0  4385.3
>>>>>>         sd91,a           86     3.5       2      75     0.0  2060.5
>>>>>>         sd91,a           91     3.1       4      80     0.0   823.8
>>>>>>         sd91,a           97     3.5       1      50     0.0  3984.5
>>>>>>         sd91,a          100     4.4       1      56     0.0  6068.6
>>>>>>         sd91,a          100     5.0       1      55     0.0  8836.0
>>>>>>         sd91,a          100     5.7       1      51     0.0  7939.6
>>>>>>         sd91,a           98     9.9       1      42     0.0 12526.8
>>>>>>         sd91,a          100     7.4       0      10     0.0 36813.6
>>>>>>         sd91,a           51     3.8       8      90     0.0   500.2
>>>>>>         sd91,a           88     3.4       1      60     0.0  2338.8
>>>>>>         sd91,a          100     4.5       1      28     0.0  6969.2
>>>>>>         sd91,a           93     3.8       1      59     0.0  5138.9
>>>>>>         sd91,a           79     3.1       1      59     0.0  3143.9
>>>>>>         sd91,a           99     4.7       1      52     0.0  5598.4
>>>>>>         sd91,a          100     4.8       1      62     0.0  6638.4
>>>>>>         sd91,a           94     5.0       1      54     0.0  3752.7
>>>>>>
>>>>>> For comparison, here's the sar output from another drive in the same
>>>>>> pool for the same period of time:
>>>>>>
>>>>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd82,a)'
>>>>>> 14:50:00   device        %busy   avque   r+w/s  blks/s  avwait  avserv
>>>>>>         sd82,a            0     0.0       2      28     0.0     5.6
>>>>>>         sd82,a            1     0.0       3      51     0.0     5.4
>>>>>>         sd82,a            1     0.0       4      66     0.0     6.3
>>>>>>         sd82,a            1     0.0       3      48     0.0     4.3
>>>>>>         sd82,a            1     0.0       3      45     0.0     6.1
>>>>>>         sd82,a            1     0.0       6      82     0.0     2.7
>>>>>>         sd82,a            1     0.0       8     112     0.0     2.8
>>>>>>         sd82,a            0     0.0       3      27     0.0     1.8
>>>>>>         sd82,a            1     0.0       5      80     0.0     3.1
>>>>>>         sd82,a            0     0.0       3      35     0.0     3.1
>>>>>>         sd82,a            1     0.0       3      35     0.0     3.8
>>>>>>         sd82,a            1     0.0       4      49     0.0     3.2
>>>>>>         sd82,a            0     0.0       0       0     0.0     4.1
>>>>>>         sd82,a            3     0.0       9      84     0.0     4.1
>>>>>>         sd82,a            1     0.0       6      55     0.0     3.7
>>>>>>         sd82,a            0     0.0       1      23     0.0     7.0
>>>>>>         sd82,a            0     0.0       6      57     0.0     1.8
>>>>>>         sd82,a            1     0.0       5      70     0.0     2.3
>>>>>>         sd82,a            1     0.0       4      55     0.0     3.7
>>>>>>         sd82,a            1     0.0       5      72     0.0     4.1
>>>>>>         sd82,a            1     0.0       4      54     0.0     3.6
>>>>>>
>>>>>> The other drives in this pool all show data similar to that of sd82.
>>>>>>
>>>>>> Your point about tuning blindly is well taken, and I'm certainly no
>>>>>> expert on the IO stack.  What's a humble sysadmin to do?
>>>>>>
>>>>>> For further reference, this system is running r151010.  The drive in
>>>>>> question is a Toshiba MG03SCA300 (7200rpm SAS), and the pool the drive
>>>>>> was in is using lz4 compression and looks like this:
>>>>>>
>>>>>> # zpool status data1
>>>>>> pool: data1
>>>>>> state: ONLINE
>>>>>> scan: resilvered 1.67T in 70h56m with 0 errors on Wed Dec 31
>>>>>> 14:40:20 2014
>>>>>> config:
>>>>>>
>>>>>>      NAME                       STATE     READ WRITE CKSUM
>>>>>>      data1                      ONLINE       0     0     0
>>>>>>        raidz2-0                 ONLINE       0     0     0
>>>>>>          c6t5000039468CB54F0d0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB5138d0  ONLINE       0     0     0
>>>>>>          c6t5000039468D000DCd0  ONLINE       0     0     0
>>>>>>          c6t5000039468D000E8d0  ONLINE       0     0     0
>>>>>>          c6t5000039468D00F5Cd0  ONLINE       0     0     0
>>>>>>          c6t5000039478C816CCd0  ONLINE       0     0     0
>>>>>>          c6t5000039478C8546Cd0  ONLINE       0     0     0
>>>>>>        raidz2-1                 ONLINE       0     0     0
>>>>>>          c6t5000039478C855F0d0  ONLINE       0     0     0
>>>>>>          c6t5000039478C856E8d0  ONLINE       0     0     0
>>>>>>          c6t5000039478C856ECd0  ONLINE       0     0     0
>>>>>>          c6t5000039478C856F4d0  ONLINE       0     0     0
>>>>>>          c6t5000039478C86374d0  ONLINE       0     0     0
>>>>>>          c6t5000039478C8C2A8d0  ONLINE       0     0     0
>>>>>>          c6t5000039478C8C364d0  ONLINE       0     0     0
>>>>>>        raidz2-2                 ONLINE       0     0     0
>>>>>>          c6t5000039478C9958Cd0  ONLINE       0     0     0
>>>>>>          c6t5000039478C995C4d0  ONLINE       0     0     0
>>>>>>          c6t5000039478C9DACCd0  ONLINE       0     0     0
>>>>>>          c6t5000039478C9DB30d0  ONLINE       0     0     0
>>>>>>          c6t5000039478C9DB6Cd0  ONLINE       0     0     0
>>>>>>          c6t5000039478CA73B4d0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB3A20d0  ONLINE       0     0     0
>>>>>>        raidz2-3                 ONLINE       0     0     0
>>>>>>          c6t5000039478CB3A64d0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB3A70d0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB3E7Cd0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB3EB0d0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB3FBCd0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB4048d0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB4054d0  ONLINE       0     0     0
>>>>>>        raidz2-4                 ONLINE       0     0     0
>>>>>>          c6t5000039478CB424Cd0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB4250d0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB470Cd0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB471Cd0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB4E50d0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB50A8d0  ONLINE       0     0     0
>>>>>>          c6t5000039478CB50BCd0  ONLINE       0     0     0
>>>>>>      spares
>>>>>>        c6t50000394A8CBC93Cd0    AVAIL
>>>>>>
>>>>>> errors: No known data errors
>>>>>>
>>>>>>
>>>>>> Thanks for your help,
>>>>>> Kevin
>>>>>>
>>>>>> On 12/31/2014 3:22 PM, Richard Elling wrote:
>>>>>>>
>>>>>>>> On Dec 31, 2014, at 11:25 AM, Kevin Swab <Kevin.Swab at colostate.edu
>>>>>>>> <mailto:Kevin.Swab at colostate.edu>> wrote:
>>>>>>>>
>>>>>>>> Hello Everyone,
>>>>>>>>
>>>>>>>> We've been running OmniOS on a number of SuperMicro 36bay chassis,
>>>>>>>> with
>>>>>>>> Supermicro motherboards, LSI SAS controllers (9211-8i & 9207-8i) and
>>>>>>>> various SAS HDD's.  These systems are serving block storage via
>>>>>>>> Comstar
>>>>>>>> and Qlogic FC HBA's, and have been running well for several years.
>>>>>>>>
>>>>>>>> The problem we've got is that as the drives age, some of them start to
>>>>>>>> perform slowly (intermittently) without failing - no zpool or iostat
>>>>>>>> errors, and nothing logged in /var/adm/messages.  The slow performance
>>>>>>>> can be seen as high average service times in iostat or sar.
>>>>>>>
>>>>>>> Look at the drive's error logs using sg_logs (-a for all)
>>>>>>>
>>>>>>>>
>>>>>>>> When these service times get above 500ms, they start to cause IO
>>>>>>>> timeouts on the downstream storage consumers, which is bad...
>>>>>>>
>>>>>>> 500 milliseconds is not unusual for a busy HDD with SCSI TCQ or
>>>>>>> SATA NCQ
>>>>>>>
>>>>>>>>
>>>>>>>> I'm wondering - is there a way to tune OmniOS' behavior so that it
>>>>>>>> doesn't try so hard to complete IOs to these slow disks, and instead
>>>>>>>> just gives up and fails them?
>>>>>>>
>>>>>>> Yes, the tuning in Alasdair's blog should work as he describes.
>>>>>>> More below...
>>>>>>>
>>>>>>>>
>>>>>>>> I found an old post from 2011 which states that some tunables exist,
>>>>>>>> but are ignored by the mpt_sas driver:
>>>>>>>>
>>>>>>>> http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
>>>>>>>>
>>>>>>>> Does anyone know the current status of these tunables, or have any
>>>>>>>> other
>>>>>>>> suggestions that might help?
>>>>>>>
>>>>>>> These tunables are on the order of seconds. The default, 60, is
>>>>>>> obviously too big
>>>>>>> unless you have old, slow, SCSI CD-ROMs. But setting it below the
>>>>>>> manufacturer's
>>>>>>> internal limit (default or tuned) can lead to an unstable system.
>>>>>>> Some vendors are
>>>>>>> better than others at documenting these, but in any case you'll
>>>>>>> need to see their spec.
>>>>>>> Expect values on the order of 6 to 15 seconds for modern HDDs and SSDs.
>>>>>>>
>>>>>>> There are a lot of tunables in this area at all levels of the
>>>>>>> architecture. OOB, the OmniOS
>>>>>>> settings ensure stable behaviour. Tuning any layer without
>>>>>>> understanding the others can
>>>>>>> lead to unstable systems, as demonstrated by your current
>>>>>>> downstream consumers.
>>>>>>> -- richard
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Kevin
>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> -------------------------------------------------------------------
>>>>>>>> Kevin Swab                          UNIX Systems Administrator
>>>>>>>> ACNS                                Colorado State University
>>>>>>>> Phone: (970)491-6572                Email:
>>>>>>>> Kevin.Swab at ColoState.EDU <mailto:Kevin.Swab at ColoState.EDU>
>>>>>>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C
>>>>>>>> _______________________________________________
>>>>>>>> OmniOS-discuss mailing list
>>>>>>>> OmniOS-discuss at lists.omniti.com
>>>>>>>> <mailto:OmniOS-discuss at lists.omniti.com>
>>>>>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>>>>>
>>>>
>>>> -- 
>>>> -------------------------------------------------------------------
>>>> Kevin Swab                          UNIX Systems Administrator
>>>> ACNS                                Colorado State University
>>>> Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
>>>> <mailto:Kevin.Swab at ColoState.EDU>
>>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C
>>>
>>> --
>>>
>>> Richard.Elling at RichardElling.com <mailto:Richard.Elling at RichardElling.com>
>>> +1-760-896-4422
>>>
>>>
>>>
>>
>> -- 
>> -------------------------------------------------------------------
>> Kevin Swab                          UNIX Systems Administrator
>> ACNS                                Colorado State University
>> Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C

-- 
-------------------------------------------------------------------
Kevin Swab                          UNIX Systems Administrator
ACNS                                Colorado State University
Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C