[OmniOS-discuss] Array returning queue full (SCSI 0x28) running Veeam jobs

Wed Jun 24 14:43:22 UTC 2015

After introducing Veeam into our environment (which has been the backend
for both Hyper-V and VMware hosts for 2+ years with minimal issues), I've
been trying to deal with a new intermittent issue. SAN traffic is via
IB/SRP presented by COMSTAR and I am currently running r151012.

What happens is that when my Veeam backup jobs run (at this point, I had
one big job of ~60 VMs), latency on the datastore that the repository is on
deteriorates. When I first saw this, it seemed it was only affecting the VM
running the Veeam proxy/repository (esxtop was only showing high QAVG on
that VM). I would vMotion this VM to another host and everything would work
normally again (I would put the host into maintenance mode and reboot it to
get it out of this "state"). This suggested that it was not a SAN issue and
isolated to the VM and/or host.

After updating to the latest Veeam B&R version (update 2), the issue
persisted. Then I updated to vSphere 5.5 update 2e, and the issue went away
for about 2 weeks (and my Veeam incident was closed). At this point, my
backup job tried to merge the incremental backup back into the full, and
that's when the issue reappeared. I was able to vMotion the VM to another
host and get the job completed. I also noticed after update 2e that the
latency appeared on any datastore that the VM had access to (had
hot-added), and subsequently noticed via CDM benchmarks that it was all VMs
on the affected host.

I reopened the case at Veeam, and took their recommendation to split the
job up into several smaller jobs. This seemed to work for a few days until
I caught it doing it again. This time I opened a case with VMware - while I
was on hold the server got out of the condition by itself (no more latency
- I was actually in the middle of doing CrystalDiskMark benchmarks on a few
VMs, on and off the affected host).

The VMware engineer took a look at the vmkernel logs and highlighted the
following entries:

2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] Abort cmd[2a] SN 0x9276395 req=7a done=0  tsk_status=255
req_lim=73 tx_head=106811312 tx_tail=106811184

2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] Abort cmd[2a] SN 0x927639a req=65 on lun 0 state 0

2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] Abort cmd[2a] SN 0x927639a req=65 done=0  tsk_status=255
req_lim=72 tx_head=106811313 tx_tail=106811185

2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] Abort cmd[2a] SN 0x9276396 req=4f on lun 11 state 0

2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] Abort cmd[2a] SN 0x9276396 req=4f done=0  tsk_status=255
req_lim=71 tx_head=106811314 tx_tail=106811186

2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] Abort cmd[2a] SN 0x9276397 req=36 on lun 10 state 0

2015-06-23T16:11:08.016Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] Abort cmd[2a] SN 0x9276399 req=31 on lun 6 state 0

2015-06-23T16:05:16.110Z cpu26:32839)ScsiDeviceIO: 2325:
Cmd(0x4136c28516c0) 0x88, CmdSN 0x8000001a from world 35997 to dev
"naa.600144f09084e251000052919b460003" failed H:0x0 D:0x8 P:0x0
Possible sense data: 0x0 0x0 0x0.

2015-06-23T16:05:18.253Z cpu20:36138)ScsiDeviceIO: 2325:
Cmd(0x4136c0428d40) 0x2a, CmdSN 0xa389000 from world 35970 to dev
"naa.600144f09084e251000052919b560004" failed H:0x0 D:0x8 P:0x0
Possible sense data: 0x0 0x0 0x0.

2015-06-23T16:05:19.382Z cpu8:32821)ScsiDeviceIO: 2325:
Cmd(0x412ece968480) 0x28, CmdSN 0x80000038 from world 36101 to dev
"naa.600144f09084e251000052919b460003" failed H:0x0 D:0x8 P:0x0
Possible sense data: 0x0 0x0 0x0.

2015-06-23T16:15:06.963Z cpu24:33485)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] New target id_ext=0x0002c903000bc0bc ioc_guid=0x0002c903000bc0bc
pkey=0xffff service_id=0x0002c903000bc0bc

dgid=fe80:0000:0000:0000:0002:c903:000b:c0bd - req_lim 128

2015-06-23T16:15:06.963Z cpu23:33475)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] ALIVE!!!

2015-06-23T16:15:06.972Z cpu15:33730)NMP:
nmp_ThrottleLogForDevice:2322: Cmd 0x16 (0x412ecb3737c0, 0) to dev
"naa.600144f09084e25100005286cf240003" on path
"vmhba_mlx4_0.1.1:C0:T1:L2

" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0. Act:NONE

2015-06-23T16:15:06.972Z cpu15:33730)ScsiDeviceIO: 2325:
Cmd(0x412ecb3737c0) 0x16, CmdSN 0x9f1f from world 0 to dev
"naa.600144f09084e25100005286cf240003" failed H:0x0 D:0x2 P:0x0 Va

lid sense data: 0x6 0x29 0x0.

2015-06-23T16:15:06.972Z cpu15:33730)ScsiCore: 1486: Power-on Reset
occurred on naa.600144f09084e25100005286cf240003

2015-06-23T16:15:05.029Z cpu6:34170)World: 14302: VC opID hostd-8fa8
maps to vmkernel opID ba341ef0

2015-06-23T16:15:06.941Z cpu13:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] Abort cmd[12] SN 0x92766c9 req=63 on lun 6 state 0

2015-06-23T16:15:06.941Z cpu13:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] active_abort 1 lim 1 used_req 126 tag=0000000000000163
tsk_tag=0000000000000063

2015-06-23T16:15:06.941Z cpu13:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] Abort cmd[12] SN 0x92766c9 req=63 done=0  tsk_status=255
req_lim=0 tx_head=106811505 tx_tail=106811377

2015-06-23T16:15:06.941Z cpu13:33135)WARNING: LinScsi:
SCSILinuxAbortCommands:1843: Failed, Driver ib_srp, for
vmhba_mlx4_0.1.1

2015-06-23T16:15:06.959Z cpu4:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 ::
1] Abort cmd[1a] SN 0x92766f2 req=5a on lun 10 state 0

2015-06-23T16:15:06.959Z cpu4:33135)<3>ib_srp: TGT[vmhba_mlx4_0.1.1 ::
1] tsk_mgmt no TX_IU head 106811505 tail 106811378 lim 0

2015-06-23T16:15:06.959Z cpu4:33135)WARNING: LinScsi:
SCSILinuxAbortCommands:1843: Failed, Driver ib_srp, for
vmhba_mlx4_0.1.1

2015-06-23T16:15:06.959Z cpu21:33475)<7>ib_srp: TGT[vmhba_mlx4_0.1.1
:: 1] Sending DREQ cm_id 0x410b54a06d00 qp_num 0x55

2015-06-23T16:15:06.960Z cpu25:36234)NMP:
nmp_ThrottleLogForDevice:2322: Cmd 0x12 (0x4136c4245a40, 0) to dev
"naa.600144f09084e251000052919b340002" on path
"vmhba_mlx4_0.1.1:C0:T1:L6

" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL

Of particular interest is the 0x28 SCSI device message returned by the SAN:

VMK_SCSI_DEVICE_QUEUE_FULL (TASK SET FULL) = 0x28

This status is returned when the LUN prevents accepting SCSI commands
from initiators due to lack of resources, namely the queue depth on
the array.

(outlined in this KB:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030381)

The engineer said that when under load the SAN cannot service the
requests, and gets into this state. He asked me to check with the
storage vendor (hah!) and check with corresponding timestamps to see
what is happening. What doesn't make sense to me is that it got into
this state at around 3am (that's when the performance dropped, when no
other load other than backup was running, and it had been running for
6+ hours at that point), and then magically recovered at around noon
(9+ hours). It's isolated to the host because other VMs experience the
same thing (although you can't really see it on esxtop until you do
sequential reads/writes).

I've lowered the max concurrent tasks for the Veeam proxy (from 4 to
3) for now, and will try to change the transport mode as suggested by
the Veeam tech (to see if it's a hot-add thing).

Is there a tunable I can check/change, or some other troubleshooting I
can do? Or is this one of those, "Oh, it's COMSTAR.. YMMV" answers? :)
What is strange is that after updating vSphere my initial "big" job
ran, whereas previously it would fail after so many hours
consistently. Is it a driver thing (running Mellanox 1.8.2, latest
that supports SRP)?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150624/42172438/attachment-0001.html>