[OmniOS-discuss] slow drive response times

Tue Jan 6 20:18:22 UTC 2015

SAS expanders are involved in my systems, so I installed 'sasinfo' and
'smp_utils'.  After a bit of poking around in the dark, I came up with
the following commands which I think get at the error counters you
mentioned.

 Unfortunately, I had to remove the "wounded soldier" from this system
since it was causing problems.  This output is from the same slot, but
with a healthy replacement drive:

# sasinfo hba-port -a SUNW-mpt_sas-1 -l
HBA Name: SUNW-mpt_sas-1
  HBA Port Name: /dev/cfg/c7
    Phy Information:
      Identifier: 0
        Link Error Statistics:
            Invalid Dword: 0
            Running Disparity Error: 0
            Loss of Dword Sync: 0
            Reset Problem: 0
      Identifier: 1
        Link Error Statistics:
            Invalid Dword: 0
            Running Disparity Error: 0
            Loss of Dword Sync: 0
            Reset Problem: 0
      Identifier: 2
        Link Error Statistics:
            Invalid Dword: 0
            Running Disparity Error: 0
            Loss of Dword Sync: 0
            Reset Problem: 0
      Identifier: 3
        Link Error Statistics:
            Invalid Dword: 0
            Running Disparity Error: 0
            Loss of Dword Sync: 0
            Reset Problem: 0
  HBA Port Name: /dev/cfg/c8
    Phy Information:
      Identifier: 4
        Link Error Statistics:
            Invalid Dword: 0
            Running Disparity Error: 0
            Loss of Dword Sync: 0
            Reset Problem: 0
      Identifier: 5
        Link Error Statistics:
            Invalid Dword: 0
            Running Disparity Error: 0
            Loss of Dword Sync: 0
            Reset Problem: 0
      Identifier: 6
        Link Error Statistics:
            Invalid Dword: 0
            Running Disparity Error: 0
            Loss of Dword Sync: 0
            Reset Problem: 0
      Identifier: 7
        Link Error Statistics:
            Invalid Dword: 0
            Running Disparity Error: 0
            Loss of Dword Sync: 0

# ./smp_discover /dev/smp/expd9 | egrep '(c982|c983)'
  phy  26:U:attached:[50000394a8cbc982:00  t(SSP)]  6 Gbps
# ./smp_discover /dev/smp/expd11 | egrep '(c982|c983)'
  phy  26:U:attached:[50000394a8cbc983:01  t(SSP)]  6 Gbps
# ./smp_rep_phy_err_log --phy=26 /dev/smp/expd9
Report phy error log response:
  Expander change count: 228
  phy identifier: 26
  invalid dword count: 0
  running disparity error count: 0
  loss of dword synchronization count: 0
  phy reset problem count: 0
# ./smp_rep_phy_err_log --phy=26 /dev/smp/expd11
Report phy error log response:
  Expander change count: 228
  phy identifier: 26
  invalid dword count: 0
  running disparity error count: 0
  loss of dword synchronization count: 0
  phy reset problem count: 0
#

"disparity error count" and "loss of dword sync count" are 0 in all of
this output, in contrast with the non-zero values seen in the sg_logs
output for the "wounded soldier".

Am I looking at the right output?  Does "phy" in the above commands
refer to the HDD itself or the port on the expander it's connected to?
Had I been able to run the above commands with the "wounded soldier"
still installed, what should I have been looking for?

Thanks again for your help,
Kevin

On 01/02/2015 03:45 PM, Richard Elling wrote:
> 
>> On Jan 2, 2015, at 1:50 PM, Kevin Swab <Kevin.Swab at colostate.edu
>> <mailto:Kevin.Swab at colostate.edu>> wrote:
>>
>> I've run 'sg_logs' on the drive I pulled last week.  There were alot of
>> errors in the backgroud scan section of the output, which made it very
>> large, so I put it here:
>>
>> http://pastebin.com/jx5BvSep
>>
>> When I pulled this drive, the SMART health status was OK.  
> 
> SMART isn’t smart :-P
> 
>> However, when
>> I put it in a test system to run 'sg_logs', the status changed to
>> "impending failure...".  Had the SMART status changed before pulling the
>> drive, I'm sure 'fmd' would have alerted me to the problem…
> 
> By default, fmd looks for the predictive failure (PFA) and self-test
> every hour using the disk_transport 
> agent. fmstat should show activity there. When a PFA is seen, then there
> will be an ereport generated
> and, for most cases, a syslog message. However, this will not cause a
> zfs-retire event.
> 
> Vendors have significant leeway in how they implement SMART. In my
> experience the only thing
> you can say for sure is if the vendor thinks the drive’s death is
> imminent, then you should replace
> it. I suspect these policies are financially motivated rather than
> scientific… some amount of truthiness
> is to be expected.
> 
> In the logs, clearly the one disk has lots of errors that have been
> corrected and the rate is increasing.
> The rate of change for "Errors corrected with possible delays” may
> correlate to your performance issues,
> but the interpretation is left up to the vendors.
> 
> In the case of this naughty drive, yep it needs replacing.
> 
>>
>> Since that drive had other indications of trouble, I ran 'sg_logs' on
>> another drive I pulled recently that has a SMART health status of OK,
>> but exibits similar slow service time behavior:
>>
>> http://pastebin.com/Q0t8Jnug
> 
> This one looks mostly healthy.
> 
> Another place to look for latency issues is the phy logs. In the sg_logs
> output, this is the
> Protocol Specific port log page for SAS SSP. Key values are running
> disparity error 
> count and loss of dword sync count. The trick here is that you need to
> look at both ends
> of the wire for each wire. For a simple case, this means looking at both
> the HBA’s phys error
> counts and the driver. If you have expanders in the mix, it is more
> work. You’ll want to look at 
> all of the HBA, expander, and drive phys health counters for all phys.
> 
> This can get tricky because wide ports are mostly dumb. For example, if
> an HBA has a 4-link
> wide port (common) and one of the links is acting up (all too common)
> the latency impacts 
> will be random.
> 
> To see HBA and expander link health, you can use sg3_utils, its
> companion smp_utils, or
> sasinfo (installed as a separate package from OmniOS, IIRC). For example, 
> sasinfo hba-port -l
> 
> HTH
>  — richard
> 
> 
>>
>> Thanks for taking the time to look at these, please let me know what you
>> find...
>>
>> Kevin
>>
>>
>>
>>
>> On 12/31/2014 06:13 PM, Richard Elling wrote:
>>>
>>>> On Dec 31, 2014, at 4:30 PM, Kevin Swab <Kevin.Swab at ColoState.EDU
>>>> <mailto:Kevin.Swab at ColoState.EDU>> wrote:
>>>>
>>>> Hello Richard and group, thanks for your reply!
>>>>
>>>> I'll look into sg_logs for one of these devices once I have a chance to
>>>> track that progam down...
>>>>
>>>> Thanks for the tip on the 500 ms latency, I wasn't aware that could
>>>> happen in normal cases.  However, I don't believe what I'm seeing
>>>> constitutes normal behavior.
>>>>
>>>> First, some anecdotal evidence:  If I pull and replace the suspect
>>>> drive, my downstream systems stop complaining, and the high service time
>>>> numbers go away.
>>>
>>> We call these "wounded soldiers"  -- it takes more resources to manage a
>>> wounded soldier than a dead soldier, so one strategy of war is to
>>> wound your
>>> enemy causing them to consume resources tending the wounded. The sg_logs
>>> should be enlightening.
>>>
>>> NB, consider a 4TB disk with 5 platters: if a head or surface starts
>>> to go, then
>>> you  have a 1/10 chance that the data you request is under the
>>> damaged head
>>> and will need to be recovered by the drive. So it is not uncommon to see
>>> 90+% of the I/Os to the drive completing quickly. It is also not
>>> unusual to see
>>> only a small number of sectors or tracks affected.
>>>
>>> Detecting these becomes tricky, especially as you reduce the
>>> timeout/retry
>>> interval, since the problem is rarely seen in the average latency --
>>> that which
>>> iostat and sar record. This is an area where we can and are improving.
>>> -- richard
>>>
>>>>
>>>> I threw out 500 ms as a guess to the point at which I start seeing
>>>> problems.  However, I see service times far in excess of that, sometimes
>>>> over 30,000 ms!  Below is 20 minutes of sar output from a drive I pulled
>>>> a few days ago, during a time when downstream VMWare servers were
>>>> complaining.  (since the sar output is so verbose, I grepped out the
>>>> info just for the suspect drive):
>>>>
>>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd91,a)'
>>>> 14:50:00   device        %busy   avque   r+w/s  blks/s  avwait  avserv
>>>>          sd91,a           99     5.3       1      42     0.0  7811.7
>>>>          sd91,a          100    11.3       1      53     0.0 11016.0
>>>>          sd91,a          100     3.8       1      75     0.0  3615.8
>>>>          sd91,a          100     4.9       1      25     0.0  8633.5
>>>>          sd91,a           93     3.9       1      55     0.0  4385.3
>>>>          sd91,a           86     3.5       2      75     0.0  2060.5
>>>>          sd91,a           91     3.1       4      80     0.0   823.8
>>>>          sd91,a           97     3.5       1      50     0.0  3984.5
>>>>          sd91,a          100     4.4       1      56     0.0  6068.6
>>>>          sd91,a          100     5.0       1      55     0.0  8836.0
>>>>          sd91,a          100     5.7       1      51     0.0  7939.6
>>>>          sd91,a           98     9.9       1      42     0.0 12526.8
>>>>          sd91,a          100     7.4       0      10     0.0 36813.6
>>>>          sd91,a           51     3.8       8      90     0.0   500.2
>>>>          sd91,a           88     3.4       1      60     0.0  2338.8
>>>>          sd91,a          100     4.5       1      28     0.0  6969.2
>>>>          sd91,a           93     3.8       1      59     0.0  5138.9
>>>>          sd91,a           79     3.1       1      59     0.0  3143.9
>>>>          sd91,a           99     4.7       1      52     0.0  5598.4
>>>>          sd91,a          100     4.8       1      62     0.0  6638.4
>>>>          sd91,a           94     5.0       1      54     0.0  3752.7
>>>>
>>>> For comparison, here's the sar output from another drive in the same
>>>> pool for the same period of time:
>>>>
>>>> # sar -d -f /var/adm/sa/sa28 -s 14:50 -e 15:10 | egrep '(device|sd82,a)'
>>>> 14:50:00   device        %busy   avque   r+w/s  blks/s  avwait  avserv
>>>>          sd82,a            0     0.0       2      28     0.0     5.6
>>>>          sd82,a            1     0.0       3      51     0.0     5.4
>>>>          sd82,a            1     0.0       4      66     0.0     6.3
>>>>          sd82,a            1     0.0       3      48     0.0     4.3
>>>>          sd82,a            1     0.0       3      45     0.0     6.1
>>>>          sd82,a            1     0.0       6      82     0.0     2.7
>>>>          sd82,a            1     0.0       8     112     0.0     2.8
>>>>          sd82,a            0     0.0       3      27     0.0     1.8
>>>>          sd82,a            1     0.0       5      80     0.0     3.1
>>>>          sd82,a            0     0.0       3      35     0.0     3.1
>>>>          sd82,a            1     0.0       3      35     0.0     3.8
>>>>          sd82,a            1     0.0       4      49     0.0     3.2
>>>>          sd82,a            0     0.0       0       0     0.0     4.1
>>>>          sd82,a            3     0.0       9      84     0.0     4.1
>>>>          sd82,a            1     0.0       6      55     0.0     3.7
>>>>          sd82,a            0     0.0       1      23     0.0     7.0
>>>>          sd82,a            0     0.0       6      57     0.0     1.8
>>>>          sd82,a            1     0.0       5      70     0.0     2.3
>>>>          sd82,a            1     0.0       4      55     0.0     3.7
>>>>          sd82,a            1     0.0       5      72     0.0     4.1
>>>>          sd82,a            1     0.0       4      54     0.0     3.6
>>>>
>>>> The other drives in this pool all show data similar to that of sd82.
>>>>
>>>> Your point about tuning blindly is well taken, and I'm certainly no
>>>> expert on the IO stack.  What's a humble sysadmin to do?
>>>>
>>>> For further reference, this system is running r151010.  The drive in
>>>> question is a Toshiba MG03SCA300 (7200rpm SAS), and the pool the drive
>>>> was in is using lz4 compression and looks like this:
>>>>
>>>> # zpool status data1
>>>> pool: data1
>>>> state: ONLINE
>>>> scan: resilvered 1.67T in 70h56m with 0 errors on Wed Dec 31
>>>> 14:40:20 2014
>>>> config:
>>>>
>>>>       NAME                       STATE     READ WRITE CKSUM
>>>>       data1                      ONLINE       0     0     0
>>>>         raidz2-0                 ONLINE       0     0     0
>>>>           c6t5000039468CB54F0d0  ONLINE       0     0     0
>>>>           c6t5000039478CB5138d0  ONLINE       0     0     0
>>>>           c6t5000039468D000DCd0  ONLINE       0     0     0
>>>>           c6t5000039468D000E8d0  ONLINE       0     0     0
>>>>           c6t5000039468D00F5Cd0  ONLINE       0     0     0
>>>>           c6t5000039478C816CCd0  ONLINE       0     0     0
>>>>           c6t5000039478C8546Cd0  ONLINE       0     0     0
>>>>         raidz2-1                 ONLINE       0     0     0
>>>>           c6t5000039478C855F0d0  ONLINE       0     0     0
>>>>           c6t5000039478C856E8d0  ONLINE       0     0     0
>>>>           c6t5000039478C856ECd0  ONLINE       0     0     0
>>>>           c6t5000039478C856F4d0  ONLINE       0     0     0
>>>>           c6t5000039478C86374d0  ONLINE       0     0     0
>>>>           c6t5000039478C8C2A8d0  ONLINE       0     0     0
>>>>           c6t5000039478C8C364d0  ONLINE       0     0     0
>>>>         raidz2-2                 ONLINE       0     0     0
>>>>           c6t5000039478C9958Cd0  ONLINE       0     0     0
>>>>           c6t5000039478C995C4d0  ONLINE       0     0     0
>>>>           c6t5000039478C9DACCd0  ONLINE       0     0     0
>>>>           c6t5000039478C9DB30d0  ONLINE       0     0     0
>>>>           c6t5000039478C9DB6Cd0  ONLINE       0     0     0
>>>>           c6t5000039478CA73B4d0  ONLINE       0     0     0
>>>>           c6t5000039478CB3A20d0  ONLINE       0     0     0
>>>>         raidz2-3                 ONLINE       0     0     0
>>>>           c6t5000039478CB3A64d0  ONLINE       0     0     0
>>>>           c6t5000039478CB3A70d0  ONLINE       0     0     0
>>>>           c6t5000039478CB3E7Cd0  ONLINE       0     0     0
>>>>           c6t5000039478CB3EB0d0  ONLINE       0     0     0
>>>>           c6t5000039478CB3FBCd0  ONLINE       0     0     0
>>>>           c6t5000039478CB4048d0  ONLINE       0     0     0
>>>>           c6t5000039478CB4054d0  ONLINE       0     0     0
>>>>         raidz2-4                 ONLINE       0     0     0
>>>>           c6t5000039478CB424Cd0  ONLINE       0     0     0
>>>>           c6t5000039478CB4250d0  ONLINE       0     0     0
>>>>           c6t5000039478CB470Cd0  ONLINE       0     0     0
>>>>           c6t5000039478CB471Cd0  ONLINE       0     0     0
>>>>           c6t5000039478CB4E50d0  ONLINE       0     0     0
>>>>           c6t5000039478CB50A8d0  ONLINE       0     0     0
>>>>           c6t5000039478CB50BCd0  ONLINE       0     0     0
>>>>       spares
>>>>         c6t50000394A8CBC93Cd0    AVAIL
>>>>
>>>> errors: No known data errors
>>>>
>>>>
>>>> Thanks for your help,
>>>> Kevin
>>>>
>>>> On 12/31/2014 3:22 PM, Richard Elling wrote:
>>>>>
>>>>>> On Dec 31, 2014, at 11:25 AM, Kevin Swab <Kevin.Swab at colostate.edu
>>>>>> <mailto:Kevin.Swab at colostate.edu>> wrote:
>>>>>>
>>>>>> Hello Everyone,
>>>>>>
>>>>>> We've been running OmniOS on a number of SuperMicro 36bay chassis,
>>>>>> with
>>>>>> Supermicro motherboards, LSI SAS controllers (9211-8i & 9207-8i) and
>>>>>> various SAS HDD's.  These systems are serving block storage via
>>>>>> Comstar
>>>>>> and Qlogic FC HBA's, and have been running well for several years.
>>>>>>
>>>>>> The problem we've got is that as the drives age, some of them start to
>>>>>> perform slowly (intermittently) without failing - no zpool or iostat
>>>>>> errors, and nothing logged in /var/adm/messages.  The slow performance
>>>>>> can be seen as high average service times in iostat or sar.
>>>>>
>>>>> Look at the drive's error logs using sg_logs (-a for all)
>>>>>
>>>>>>
>>>>>> When these service times get above 500ms, they start to cause IO
>>>>>> timeouts on the downstream storage consumers, which is bad...
>>>>>
>>>>> 500 milliseconds is not unusual for a busy HDD with SCSI TCQ or
>>>>> SATA NCQ
>>>>>
>>>>>>
>>>>>> I'm wondering - is there a way to tune OmniOS' behavior so that it
>>>>>> doesn't try so hard to complete IOs to these slow disks, and instead
>>>>>> just gives up and fails them?
>>>>>
>>>>> Yes, the tuning in Alasdair's blog should work as he describes.
>>>>> More below...
>>>>>
>>>>>>
>>>>>> I found an old post from 2011 which states that some tunables exist,
>>>>>> but are ignored by the mpt_sas driver:
>>>>>>
>>>>>> http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-on-solaris-or-openindiana/
>>>>>>
>>>>>> Does anyone know the current status of these tunables, or have any
>>>>>> other
>>>>>> suggestions that might help?
>>>>>
>>>>> These tunables are on the order of seconds. The default, 60, is
>>>>> obviously too big
>>>>> unless you have old, slow, SCSI CD-ROMs. But setting it below the
>>>>> manufacturer's
>>>>> internal limit (default or tuned) can lead to an unstable system.
>>>>> Some vendors are
>>>>> better than others at documenting these, but in any case you'll
>>>>> need to see their spec.
>>>>> Expect values on the order of 6 to 15 seconds for modern HDDs and SSDs.
>>>>>
>>>>> There are a lot of tunables in this area at all levels of the
>>>>> architecture. OOB, the OmniOS
>>>>> settings ensure stable behaviour. Tuning any layer without
>>>>> understanding the others can
>>>>> lead to unstable systems, as demonstrated by your current
>>>>> downstream consumers.
>>>>> -- richard
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Kevin
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> -------------------------------------------------------------------
>>>>>> Kevin Swab                          UNIX Systems Administrator
>>>>>> ACNS                                Colorado State University
>>>>>> Phone: (970)491-6572                Email:
>>>>>> Kevin.Swab at ColoState.EDU <mailto:Kevin.Swab at ColoState.EDU>
>>>>>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C
>>>>>> _______________________________________________
>>>>>> OmniOS-discuss mailing list
>>>>>> OmniOS-discuss at lists.omniti.com
>>>>>> <mailto:OmniOS-discuss at lists.omniti.com>
>>>>>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>>>
>>
>> -- 
>> -------------------------------------------------------------------
>> Kevin Swab                          UNIX Systems Administrator
>> ACNS                                Colorado State University
>> Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
>> <mailto:Kevin.Swab at ColoState.EDU>
>> GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C
> 
> --
> 
> Richard.Elling at RichardElling.com <mailto:Richard.Elling at RichardElling.com>
> +1-760-896-4422
> 
> 
> 

-- 
-------------------------------------------------------------------
Kevin Swab                          UNIX Systems Administrator
ACNS                                Colorado State University
Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C