<div dir="ltr">After introducing Veeam into our environment (which has been the backend for both Hyper-V and VMware hosts for 2+ years with minimal issues), I've been trying to deal with a new intermittent issue. SAN traffic is via IB/SRP presented by COMSTAR and I am currently running r151012.<div><br></div><div>What happens is that when my Veeam backup jobs run (at this point, I had one big job of ~60 VMs), latency on the datastore that the repository is on deteriorates. When I first saw this, it seemed it was only affecting the VM running the Veeam proxy/repository (esxtop was only showing high QAVG on that VM). I would vMotion this VM to another host and everything would work normally again (I would put the host into maintenance mode and reboot it to get it out of this "state"). This suggested that it was not a SAN issue and isolated to the VM and/or host.</div><div><br></div><div>After updating to the latest Veeam B&R version (update 2), the issue persisted. Then I updated to vSphere 5.5 update 2e, and the issue went away for about 2 weeks (and my Veeam incident was closed). At this point, my backup job tried to merge the incremental backup back into the full, and that's when the issue reappeared. I was able to vMotion the VM to another host and get the job completed. I also noticed after update 2e that the latency appeared on any datastore that the VM had access to (had hot-added), and subsequently noticed via CDM benchmarks that it was all VMs on the affected host.</div><div><br></div><div>I reopened the case at Veeam, and took their recommendation to split the job up into several smaller jobs. This seemed to work for a few days until I caught it doing it again. This time I opened a case with VMware - while I was on hold the server got out of the condition by itself (no more latency - I was actually in the middle of doing CrystalDiskMark benchmarks on a few VMs, on and off the affected host). </div><div><br></div><div>The VMware engineer took a look at the vmkernel logs and highlighted the following entries:</div><div><br></div><div><pre><span style="color:black">2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x9276395 req=7a done=0 tsk_status=255 req_lim=73 tx_head=106811312 tx_tail=106811184</span></pre><pre><span style="color:black">2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x927639a req=65 on lun 0 state 0</span></pre><pre><span style="color:black">2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x927639a req=65 done=0 tsk_status=255 req_lim=72 tx_head=106811313 tx_tail=106811185</span></pre><pre><span style="color:black">2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x9276396 req=4f on lun 11 state 0</span></pre><pre><span style="color:black">2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x9276396 req=4f done=0 tsk_status=255 req_lim=71 tx_head=106811314 tx_tail=106811186</span></pre><pre><span style="color:black">2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x9276397 req=36 on lun 10 state 0</span></pre><pre><span style="color:black">2015-06-23T16:11:08.016Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x9276399 req=31 on lun 6 state 0</span></pre><pre><span style="color:black"> </span></pre><pre><span style="color:black"> </span></pre><pre><span style="color:black">2015-06-23T16:05:16.110Z cpu26:32839)ScsiDeviceIO: 2325: Cmd(0x4136c28516c0) 0x88, CmdSN 0x8000001a from world 35997 to dev "naa.600144f09084e251000052919b460003" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.</span></pre><pre><span style="color:black">2015-06-23T16:05:18.253Z cpu20:36138)ScsiDeviceIO: 2325: Cmd(0x4136c0428d40) 0x2a, CmdSN 0xa389000 from world 35970 to dev "naa.600144f09084e251000052919b560004" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.</span></pre><pre><span style="color:black">2015-06-23T16:05:19.382Z cpu8:32821)ScsiDeviceIO: 2325: Cmd(0x412ece968480) 0x28, CmdSN 0x80000038 from world 36101 to dev "naa.600144f09084e251000052919b460003" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.</span></pre><pre><span style="color:black"> </span></pre><pre><span style="color:black"> </span></pre><pre><span style="color:black">2015-06-23T16:15:06.963Z cpu24:33485)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] New target id_ext=0x0002c903000bc0bc ioc_guid=0x0002c903000bc0bc pkey=0xffff service_id=0x0002c903000bc0bc</span></pre><pre><span style="color:black">dgid=fe80:0000:0000:0000:0002:c903:000b:c0bd - req_lim 128</span></pre><pre><span style="color:black">2015-06-23T16:15:06.963Z cpu23:33475)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] ALIVE!!!</span></pre><pre><span style="color:black">2015-06-23T16:15:06.972Z cpu15:33730)NMP: nmp_ThrottleLogForDevice:2322: Cmd 0x16 (0x412ecb3737c0, 0) to dev "naa.600144f09084e25100005286cf240003" on path "vmhba_mlx4_0.1.1:C0:T1:L2</span></pre><pre><span style="color:black">" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0. Act:NONE</span></pre><pre><span style="color:black">2015-06-23T16:15:06.972Z cpu15:33730)ScsiDeviceIO: 2325: Cmd(0x412ecb3737c0) 0x16, CmdSN 0x9f1f from world 0 to dev "naa.600144f09084e25100005286cf240003" failed H:0x0 D:0x2 P:0x0 Va</span></pre><pre><span style="color:black">lid sense data: 0x6 0x29 0x0.</span></pre><pre><span style="color:black">2015-06-23T16:15:06.972Z cpu15:33730)ScsiCore: 1486: Power-on Reset occurred on naa.600144f09084e25100005286cf240003</span></pre><pre><span style="color:black"> </span></pre><pre><span style="color:black"> </span></pre><pre><span style="color:black">2015-06-23T16:15:05.029Z cpu6:34170)World: 14302: VC opID hostd-8fa8 maps to vmkernel opID ba341ef0</span></pre><pre><span style="color:black">2015-06-23T16:15:06.941Z cpu13:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[12] SN 0x92766c9 req=63 on lun 6 state 0</span></pre><pre><span style="color:black">2015-06-23T16:15:06.941Z cpu13:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] active_abort 1 lim 1 used_req 126 tag=0000000000000163 tsk_tag=0000000000000063</span></pre><pre><span style="color:black">2015-06-23T16:15:06.941Z cpu13:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[12] SN 0x92766c9 req=63 done=0 tsk_status=255 req_lim=0 tx_head=106811505 tx_tail=106811377</span></pre><pre><span style="color:black">2015-06-23T16:15:06.941Z cpu13:33135)WARNING: LinScsi: SCSILinuxAbortCommands:1843: Failed, Driver ib_srp, for vmhba_mlx4_0.1.1</span></pre><pre><span style="color:black">2015-06-23T16:15:06.959Z cpu4:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[1a] SN 0x92766f2 req=5a on lun 10 state 0</span></pre><pre><span style="color:black">2015-06-23T16:15:06.959Z cpu4:33135)<3>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] tsk_mgmt no TX_IU head 106811505 tail 106811378 lim 0</span></pre><pre><span style="color:black">2015-06-23T16:15:06.959Z cpu4:33135)WARNING: LinScsi: SCSILinuxAbortCommands:1843: Failed, Driver ib_srp, for vmhba_mlx4_0.1.1</span></pre><pre><span style="color:black">2015-06-23T16:15:06.959Z cpu21:33475)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Sending DREQ cm_id 0x410b54a06d00 qp_num 0x55</span></pre><pre><span style="color:black">2015-06-23T16:15:06.960Z cpu25:36234)NMP: nmp_ThrottleLogForDevice:2322: Cmd 0x12 (0x4136c4245a40, 0) to dev "naa.600144f09084e251000052919b340002" on path "vmhba_mlx4_0.1.1:C0:T1:L6</span></pre><pre><span style="color:black">" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL</span></pre><pre><span style="color:black"><br></span></pre><pre><span style="font-family:arial,sans-serif;white-space:normal">Of particular interest is the 0x28 SCSI device message returned by the SAN:</span><br></pre><pre><h3 style="margin:0px 0px 14px;font-size:14px;border:0px none;font-family:Arial,Helvetica,sans-serif,'Bitstream Vera Sans';font-stretch:normal;padding:10px 0px 5px;vertical-align:baseline;color:rgb(0,0,0);white-space:normal"><span style="border:0px none;font-family:'Courier New';font-size:10pt;font-style:inherit;font-variant:inherit;font-weight:inherit;font-stretch:inherit;line-height:inherit;margin:0px;padding:0px;vertical-align:baseline">VMK_SCSI_DEVICE_QUEUE_FULL (TASK SET FULL) = 0x28</span></h3></pre><pre><span style="color:rgb(102,102,102);font-family:Arial,Helvetica,sans-serif,'Bitstream Vera Sans';font-size:12px;white-space:normal">This status is returned when the LUN prevents accepting SCSI commands from initiators due to lack of resources, namely the queue depth on the array.</span><font face="arial, sans-serif"><span style="white-space:normal"><br></span></font></pre><pre><font face="arial, sans-serif"><span style="white-space:normal">(outlined in this KB: <a href="http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030381">http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030381</a>)<br></span></font></pre><pre><font face="arial, sans-serif"><span style="white-space:normal"><br></span></font></pre><pre><font face="arial, sans-serif"><span style="white-space:normal">The engineer said that when under load the SAN cannot service the requests, and gets into this state. He asked me to check with the storage vendor (hah!) and check with corresponding timestamps to see what is happening. What doesn't make sense to me is that it got into this state at around 3am (that's when the performance dropped, when no other load other than backup was running, and it had been running for 6+ hours at that point), and then magically recovered at around noon (9+ hours). It's isolated to the host because other VMs experience the same thing (although you can't really see it on esxtop until you do sequential reads/writes). </span></font></pre><pre><font face="arial, sans-serif"><span style="white-space:normal">I've lowered the max concurrent tasks for the Veeam proxy (from 4 to 3) for now, and will try to change the transport mode as suggested by the Veeam tech (to see if it's a hot-add thing).</span></font></pre><pre><span style="white-space:normal;font-family:arial,sans-serif">Is there a tunable I can check/change, or some other troubleshooting I can do? Or is this one of those, "Oh, it's COMSTAR.. YMMV" answers? :) What is strange is that after updating vSphere my initial "big" job ran, whereas previously it would fail after so many hours consistently. Is it a driver thing (running Mellanox 1.8.2, latest that supports SRP)?</span><br></pre><pre><br></pre></div></div>