<div dir="ltr"><div><div><div>Hello all,<br></div>I am not using layer 2 flow control. The switch carries line-rate 10G traffic without error.<br><br></div>I think I have found the issue via lockstat. The first lockstat is taken during a multipath read:<br><br><br></div>lockstat -kWP sleep 30<br><div><div><div><br>Adaptive mutex spin: 21331 events in 30.020 seconds (711 events/sec)<br><br>Count indv cuml rcnt nsec Hottest Lock Caller<br>-------------------------------------------------------------------------------<br> 9306 44% 44% 0.00 1557 htable_mutex+0x370 htable_release<br> 6307 23% 68% 0.00 1207 htable_mutex+0x108 htable_lookup<br> 596 7% 75% 0.00 4100 0xffffff0931705188 cv_wait<br> 349 5% 80% 0.00 4437 0xffffff0931705188 taskq_thread<br> 704 2% 82% 0.00 995 0xffffff0935de3c50 dbuf_create<br><br></div><div>The hash table being read here I would guess is the tcp connection hash table.<br></div><div><br></div><div>When lockstat is run during a multipath write operation, I get:<br><br>Adaptive mutex spin: 1097341 events in 30.016 seconds (36558 events/sec)<br><br>Count indv cuml rcnt nsec Hottest Lock Caller<br>-------------------------------------------------------------------------------<br>210752 28% 28% 0.00 4781 0xffffff0931705188 taskq_thread<br>174471 22% 50% 0.00 4476 0xffffff0931705188 cv_wait<br>127183 10% 61% 0.00 2871 0xffffff096f29b510 zio_notify_parent<br>176066 10% 70% 0.00 1922 0xffffff0931705188 taskq_dispatch_ent<br>105134 9% 80% 0.00 3110 0xffffff096ffdbf10 zio_remove_child<br>67512 4% 83% 0.00 1938 0xffffff096f3db4b0 zio_add_child<br>45736 3% 86% 0.00 2239 0xffffff0935de3c50 dbuf_destroy<br>27781 3% 89% 0.00 3416 0xffffff0935de3c50 dbuf_create<br>38536 2% 91% 0.00 2122 0xffffff0935de3b70 dnode_rele<br>27841 2% 93% 0.00 2423 0xffffff0935de3b70 dnode_diduse_space<br>19020 2% 95% 0.00 3046 0xffffff09d9e305e0 dbuf_rele<br>14627 1% 96% 0.00 3632 dbuf_hash_table+0x4f8 dbuf_find<br><br><br><br></div><div>Writes are not performing htable lookups, while reads are.<br><br></div><div>-Warren V<br></div><div><br><br><br><br><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 2, 2015 at 3:14 AM, Joerg Goltermann <span dir="ltr"><<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
I would try *one* TPG which includes both interface addresses<br>
and I would double check for packet drops on the Catalyst.<br>
<br>
The 3560 supports only receive flow control which means, that<br>
a sending 10Gbit port can easily overload a 1Gbit port.<br>
Do you have flow control enabled?<br>
<br>
- Joerg<div><div class="h5"><br>
<br>
On 02.03.2015 09:22, W Verb via illumos-developer wrote:<br>
</div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div class="h5">
Hello Garrett,<br>
<br>
No, no 802.3ad going on in this config.<br>
<br>
Here is a basic schematic:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQVkVqcE5OQUJyUUU/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQVkVqcE5OQUJyUUU/<u></u>view?usp=sharing</a><br>
<br>
Here is the Nexenta MPIO iSCSI Setup Document that I used as a guide:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQbjEyUTBjN2tTNWM/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQbjEyUTBjN2tTNWM/<u></u>view?usp=sharing</a><br>
<br>
Note that I am using an MTU of 3000 on both the 10G and 1G NICs. The<br>
switch is set to allow 9148-byte frames, and I'm not seeing any<br>
errors/buffer overruns on the switch.<br>
<br>
Here is a screenshot of a packet capture from a read operation on the<br>
guest OS (from it's local drive, which is actually a VMDK file on the<br>
storage server). In this example, only a single 1G ESXi kernel interface<br>
(vmk1) is bound to the software iSCSI initiator.<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQa2NYdXhpZkpkbU0/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQa2NYdXhpZkpkbU0/<u></u>view?usp=sharing</a><br>
<br>
Note that there's a nice, well-behaved window sizing process taking<br>
place. The ESXi decreases the scaled window by 11 or 12 for each ACK,<br>
then bumps it back up to 512.<br>
<br>
Here is a similar screenshot of a single-interface write operation:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQbU1RZHRnakxDSFU/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQbU1RZHRnakxDSFU/<u></u>view?usp=sharing</a><br>
<br>
There are no pauses or gaps in the transmission rate in the<br>
single-interface transfers.<br>
<br>
<br>
In the next screenshots, I have enabled an additional 1G interface on<br>
the ESXi host, and bound it to the iSCSI initiator. The new interface is<br>
bound to a separate physical port, uses a different VLAN on the switch,<br>
and talks to a different 10G port on the storage server.<br>
<br>
First, let's look at a write operation on the guest OS, which happily<br>
pumps data at near-line-rate to the storage server.<br>
<br>
Here is a sequence number trace diagram. Note how the transfer has a<br>
nice, smooth increment rate over the entire transfer.<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQWHNIa0drWnNxMmM/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQWHNIa0drWnNxMmM/<u></u>view?usp=sharing</a><br>
<br>
Here are screenshots from packet captures on both 1G interfaces:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQRWhyVVQ4djNaU3c/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQRWhyVVQ4djNaU3c/<u></u>view?usp=sharing</a><br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQaTVjTEtTRloyR2c/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQaTVjTEtTRloyR2c/<u></u>view?usp=sharing</a><br>
<br>
Note how we again see nice, smooth window adjustment, and no gaps in<br>
transmission.<br>
<br>
<br>
But now, let's look at the problematic two-interface Read operation.<br>
First, the sequence graph:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQTzdFVWdQMWZ6LUU/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQTzdFVWdQMWZ6LUU/<u></u>view?usp=sharing</a><br>
<br>
As you can see, there are gaps and jumps in the transmission throughout<br>
the transfer.<br>
It is very illustrative to look at captures of the gaps, which are<br>
occurring on both interfaces:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQc0VISXN6eVFwQzg/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQc0VISXN6eVFwQzg/<u></u>view?usp=sharing</a><br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQVFREUHp3TGFiUU0/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQVFREUHp3TGFiUU0/<u></u>view?usp=sharing</a><br>
<br>
As you can see, there are ~.4 second pauses in transmission from the<br>
storage server, which kills the transfer rate.<br>
It's clear that the ESXi box ACKs the prior iSCSI operation to<br>
completion, then makes a new LUN request, which the storage server<br>
immediately replies to. The ESXi ACKs the response packet from the<br>
storage server, then waits...and waits....and waits... until eventually<br>
the storage server starts transmitting again.<br>
<br>
Because the pause happens while the ESXi client is waiting for a packet<br>
from the storage server, that tells me that the gaps are not an artifact<br>
of traffic being switched between both active interfaces, but are<br>
actually indicative of short hangs occurring on the server.<br>
<br>
Having a pause or two in transmission is no big deal, but in my case, it<br>
is happening constantly, and dropping my overall read transfer rate down<br>
to 20-60MB/s, which is slower than the single interface transfer rate<br>
(~90-100MB/s).<br>
<br>
Decreasing the MTU makes the pauses shorter, increasing them makes the<br>
pauses longer.<br>
<br>
Another interesting thing is that if I set the multipath io interval to<br>
3 operations instead of 1, I get better throughput. In other words, the<br>
less frequently I swap IP addresses on my iSCSI requests from the ESXi<br>
unit, the fewer pauses I see.<br>
<br>
Basically, COMSTAR seems to choke each time an iSCSI request from a new<br>
IP arrives.<br>
<br>
Because the single interface transfer is near line rate, that tells me<br>
that the storage system (mpt_sas, zfs, etc) is working fine. It's only<br>
when multiple paths are attempted that iSCSI falls on its face during reads.<br>
<br>
All of these captures were taken without a cache device being attached<br>
to the storage zpool, so this isn't looking like some kind of ZFS ARC<br>
problem. As mentioned previously, local transfers to/from the zpool are<br>
showing ~300-500 MB/s rates over long transfers (10G+).<br>
<br>
-Warren V<br>
<br>
On Sun, Mar 1, 2015 at 9:11 PM, Garrett D'Amore <<a href="mailto:garrett@damore.org" target="_blank">garrett@damore.org</a><br></div></div><span class="">
<mailto:<a href="mailto:garrett@damore.org" target="_blank">garrett@damore.org</a>>> wrote:<br>
<br>
I’m not sure I’ve followed properly. You have *two* interfaces.<br>
You are not trying to provision these in an aggr are you? As far as<br>
I’m aware, VMware does not support 802.3ad link aggregations. (Its<br>
possible that you can make it work with ESXi if you give the entire<br>
NIC to the guest — but I’m skeptical.) The problem is that if you<br>
try to use link aggregation, some packets (up to half!) will be<br>
lost. TCP and other protocols fare poorly in this situation.<br>
<br>
Its possible I’ve totally misunderstood what you’re trying to do, in<br>
which case I apologize.<br>
<br>
The idle thing is a red-herring — the cpu is waiting for work to do,<br>
probably because packets haven’t arrived (or where dropped by the<br>
hypervisor!) I wouldn’t read too much into that except that your<br>
network stack is in trouble. I’d look a bit more closely at the<br>
kstats for tcp — I suspect you’ll see retransmits or out of order<br>
values that are unusually high — if so this may help validate my<br>
theory above.<br>
<br>
- Garrett<br>
<br>
</span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="">
On Mar 1, 2015, at 9:03 PM, W Verb via illumos-developer<br></span>
<<a href="mailto:developer@lists.illumos.org" target="_blank">developer@lists.illumos.org</a> <mailto:<a href="mailto:developer@lists.illumos.org" target="_blank">developer@lists.<u></u>illumos.org</a>>><div><div class="h5"><br>
wrote:<br>
<br>
Hello all,<br>
<br>
<br>
Well, I no longer blame the ixgbe driver for the problems I'm seeing.<br>
<br>
<br>
I tried Joerg's updated driver, which didn't improve the issue. So<br>
I went back to the drawing board and rebuilt the server from scratch.<br>
<br>
What I noted is that if I have only a single 1-gig physical<br>
interface active on the ESXi host, everything works as expected.<br>
As soon as I enable two interfaces, I start seeing the performance<br>
problems I've described.<br>
<br>
Response pauses from the server that I see in TCPdumps are still<br>
leading me to believe the problem is delay on the server side, so<br>
I ran a series of kernel dtraces and produced some flamegraphs.<br>
<br>
<br>
This was taken during a read operation with two active 10G<br>
interfaces on the server, with a single target being shared by two<br>
tpgs- one tpg for each 10G physical port. The host device has two<br>
1G ports enabled, with VLANs separating the active ports into<br>
10G/1G pairs. ESXi is set to multipath using both VLANS with a<br>
round-robin IO interval of 1.<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQd3ZYOGh4d2pteGs/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQd3ZYOGh4d2pteGs/<u></u>view?usp=sharing</a><br>
<br>
<br>
This was taken during a write operation:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQMnBtU1Q2SXM2ams/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQMnBtU1Q2SXM2ams/<u></u>view?usp=sharing</a><br>
<br>
<br>
I then rebooted the server and disabled C-State, ACPI T-State, and<br>
general EIST (Turbo boost) functionality in the CPU.<br>
<br>
I when I attempted to boot my guest VM, the iSCSI transfer<br>
gradually ground to a halt during the boot loading process, and<br>
the guest OS never did complete its boot process.<br>
<br>
Here is a flamegraph taken while iSCSI is slowly dying:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQM21JeFZPX3dZWTg/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQM21JeFZPX3dZWTg/<u></u>view?usp=sharing</a><br>
<br>
<br>
I edited out cpu_idle_adaptive from the dtrace output and<br>
regenerated the slowdown graph:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQbTVwV3NvXzlPS1E/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQbTVwV3NvXzlPS1E/<u></u>view?usp=sharing</a><br>
<br>
<br>
I then edited cpu_idle_adaptive out of the speedy write operation<br>
and regenerated that graph:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQeWFYM0pCMDZ1X2s/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQeWFYM0pCMDZ1X2s/<u></u>view?usp=sharing</a><br>
<br>
<br>
I have zero experience with interpreting flamegraphs, but the most<br>
significant difference I see between the slow read example and the<br>
fast write example is in unix`thread_start --> unix`idle. There's<br>
a good chunk of "unix`i86_mwait" in the read example that is not<br>
present in the write example at all.<br>
<br>
Disabling the l2arc cache device didn't make a difference, and I<br>
had to reenable EIST support on the CPU to get my VMs to boot.<br>
<br>
I am seeing a variety of bug reports going back to 2010 regarding<br>
excessive mwait operations, with the suggested solutions usually<br>
being to set "cpupm enable poll-mode" in power.conf. That change<br>
also had no effect on speed.<br>
<br>
-Warren V<br>
<br>
<br>
<br>
<br>
-----Original Message-----<br>
<br>
From: Chris Siebenmann [mailto:<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a>]<br>
<br>
Sent: Monday, February 23, 2015 8:30 AM<br>
<br>
To: W Verb<br>
<br>
Cc: <a href="mailto:omnios-discuss@lists.omniti.com" target="_blank">omnios-discuss@lists.omniti.<u></u>com</a><br></div></div>
<mailto:<a href="mailto:omnios-discuss@lists.omniti.com" target="_blank">omnios-discuss@lists.<u></u>omniti.com</a>>; <a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a><br>
<mailto:<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a>><span class=""><br>
<br>
Subject: Re: [OmniOS-discuss] The ixgbe driver, Lindsay Lohan, and<br>
the Greek economy<br>
<br>
<br>
> Chris, thanks for your specific details. I'd appreciate it if you<br>
<br>
> could tell me which copper NIC you tried, as well as to pass on the<br>
<br>
> iSCSI tuning parameters.<br>
<br>
<br>
Our copper NIC experience is with onboard X540-AT2 ports on<br>
SuperMicro hardware (which have the guaranteed 10-20 msec lock<br>
hold) and dual-port 82599EB TN cards (which have some sort of<br>
driver/hardware failure under load that eventually leads to<br>
2-second lock holds). I can't recommend either with the current<br>
driver; we had to revert to 1G networking in order to get stable<br>
servers.<br>
<br>
<br>
The iSCSI parameter modifications we do, across both initiators<br>
and targets, are:<br>
<br>
<br>
initialr2tno<br>
<br>
firstburstlength128k<br>
<br>
maxrecvdataseglen128k[only on Linux backends]<br>
<br></span>
maxxmitdataseglen128k[only on Linux backends]<span class=""><br>
<br>
<br>
The OmniOS initiator doesn't need tuning for more than the first<br>
two parameters; on the Linux backends we tune up all four. My<br>
extended thoughts on these tuning parameters and why we touch them<br>
can be found<br>
<br>
here:<br>
<br>
<br>
<a href="http://utcc.utoronto.ca/~cks/space/blog/tech/UnderstandingiSCSIProtocol" target="_blank">http://utcc.utoronto.ca/~cks/<u></u>space/blog/tech/<u></u>UnderstandingiSCSIProtocol</a><br>
<br>
<a href="http://utcc.utoronto.ca/~cks/space/blog/tech/LikelyISCSITuning" target="_blank">http://utcc.utoronto.ca/~cks/<u></u>space/blog/tech/<u></u>LikelyISCSITuning</a><br>
<br>
<br>
The short version is that these parameters probably only make a<br>
small difference but their overall goal is to do 128KB ZFS reads<br>
and writes in single iSCSI operations (although they will be<br>
fragmented at the TCP<br>
<br>
layer) and to do iSCSI writes without a back-and-forth delay<br>
between initiator and target (that's 'initialr2t no').<br>
<br>
<br>
I think basically everyone should use InitialR2T set to no and in<br>
fact that it should be the software default. These days only<br>
unusually limited iSCSI targets should need it to be otherwise and<br>
they can change their setting for it (initiator and target must<br>
both agree to it being 'yes', so either can veto it).<br>
<br>
<br>
- cks<br>
<br>
<br>
<br>
On Mon, Feb 23, 2015 at 8:21 AM, Joerg Goltermann <<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a><br></span><div><div class="h5">
<mailto:<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a>>> wrote:<br>
<br>
Hi,<br>
<br>
I think your problem is caused by your link properties or your<br>
switch settings. In general the standard ixgbe seems to perform<br>
well.<br>
<br>
I had trouble after changing the default flow control settings<br>
to "bi"<br>
and this was my motivation to update the ixgbe driver a long<br>
time ago.<br>
After I have updated our systems to ixgbe 2.5.8 I never had any<br>
problems ....<br>
<br>
Make sure your switch has support for jumbo frames and you use<br>
the same mtu on all ports, otherwise the smallest will be used.<br>
<br>
What switch do you use? I can tell you nice horror stories about<br>
different vendors....<br>
<br>
- Joerg<br>
<br>
On 23.02.2015 10:31, W Verb wrote:<br>
<br>
Thank you Joerg,<br>
<br>
I've downloaded the package and will try it tomorrow.<br>
<br>
The only thing I can add at this point is that upon review<br>
of my<br>
testing, I may have performed my "pkg -u" between the<br>
initial quad-gig<br>
performance test and installing the 10G NIC. So this may<br>
be a new<br>
problem introduced in the latest updates.<br>
<br>
Those of you who are running 10G and have not upgraded to<br>
the latest<br>
kernel, etc, might want to do some additional testing<br>
before running the<br>
update.<br>
<br>
-Warren V<br>
<br>
On Mon, Feb 23, 2015 at 1:15 AM, Joerg Goltermann<br>
<<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a> <mailto:<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a>><br></div></div><span class="">
<mailto:<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a> <mailto:<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a>>>> wrote:<br>
<br>
Hi,<br>
<br>
I remember there was a problem with the flow control<br>
settings in the<br>
ixgbe<br>
driver, so I updated it a long time ago for our<br>
internal servers to<br>
2.5.8.<br>
Last weekend I integrated the latest changes from the<br>
FreeBSD driver<br>
to bring<br>
the illumos ixgbe to 2.5.25 but I had no time to test<br>
it, so it's<br>
completely<br>
untested!<br>
<br>
<br>
If you would like to give the latest driver a try you<br>
can fetch the<br>
kernel modules from<br></span>
<a href="https://cloud.osn.de/index.____php/s/Fb4so9RsNnXA7r9" target="_blank">https://cloud.osn.de/index.___<u></u>_php/s/Fb4so9RsNnXA7r9</a><br>
<<a href="https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9" target="_blank">https://cloud.osn.de/index.__<u></u>php/s/Fb4so9RsNnXA7r9</a>><span class=""><br>
<<a href="https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9" target="_blank">https://cloud.osn.de/index.__<u></u>php/s/Fb4so9RsNnXA7r9</a><br>
<<a href="https://cloud.osn.de/index.php/s/Fb4so9RsNnXA7r9" target="_blank">https://cloud.osn.de/index.<u></u>php/s/Fb4so9RsNnXA7r9</a>>><br>
<br>
Clone your boot environment, place the modules in the<br>
new environment<br>
and update the boot-archive of the new BE.<br>
<br>
- Joerg<br>
<br>
<br>
<br>
<br>
<br>
On 23.02.2015 02:54, W Verb wrote:<br>
<br>
By the way, to those of you who have working<br>
setups: please send me<br>
your pool/volume settings, interface linkprops,<br>
and any kernel<br>
tuning<br>
parameters you may have set.<br>
<br>
Thanks,<br>
Warren V<br>
<br>
On Sat, Feb 21, 2015 at 7:59 AM, Schweiss, Chip<br>
<<a href="mailto:chip@innovates.com" target="_blank">chip@innovates.com</a> <mailto:<a href="mailto:chip@innovates.com" target="_blank">chip@innovates.com</a>><br></span>
<mailto:<a href="mailto:chip@innovates.com" target="_blank">chip@innovates.com</a> <mailto:<a href="mailto:chip@innovates.com" target="_blank">chip@innovates.com</a>>>><div><div class="h5"><br>
wrote:<br>
<br>
I can't say I totally agree with your performance<br>
assessment. I run Intel<br>
X520 in all my OmniOS boxes.<br>
<br>
Here is a capture of nfssvrtop I made while<br>
running many<br>
storage vMotions<br>
between two OmniOS boxes hosting NFS<br>
datastores. This is a<br>
10 host VMware<br>
cluster. Both OmniOS boxes are dual 10G<br>
connected with<br>
copper twin-ax to<br>
the in rack Nexus 5010.<br>
<br>
VMware does 100% sync writes, I use ZeusRAM<br>
SSDs for log<br>
devices.<br>
<br>
-Chip<br>
<br>
2014 Apr 24 08:05:51, load: 12.64, read:<br>
17330243 KB,<br>
swrite: 15985 KB,<br>
awrite: 1875455 KB<br>
<br>
Ver Client NFSOPS Reads<br>
SWrites AWrites<br>
Commits Rd_bw<br>
SWr_bw AWr_bw Rd_t SWr_t AWr_t<br>
Com_t Align%<br>
<br>
4 10.28.17.105 0 0<br>
0 0<br>
0 0<br>
0 0 0 0 0 0 0<br>
<br>
4 10.28.17.215 0 0<br>
0 0<br>
0 0<br>
0 0 0 0 0 0 0<br>
<br>
4 10.28.17.213 0 0<br>
0 0<br>
0 0<br>
0 0 0 0 0 0 0<br>
<br>
4 10.28.16.151 0 0<br>
0 0<br>
0 0<br>
0 0 0 0 0 0 0<br>
<br>
4 all 1 0<br>
0 0<br>
0 0<br>
0 0 0 0 0 0 0<br>
<br>
3 10.28.16.175 3 0<br>
3 0<br>
0 1<br>
11 0 4806 48 0 0 85<br>
<br>
3 10.28.16.183 6 0<br>
6 0<br>
0 3<br>
162 0 549 124 0 0<br>
73<br>
<br>
3 10.28.16.180 11 0<br>
10 0<br>
0 3<br>
27 0 776 89 0 0 67<br>
<br>
3 10.28.16.176 28 2<br>
26 0<br>
0 10<br>
405 0 2572 198 0 0<br>
100<br>
<br>
3 10.28.16.178 4606 4602<br>
4 0<br>
0 294534<br>
3 0 723 49 0 0 99<br>
<br>
3 10.28.16.179 4905 4879<br>
26 0<br>
0 312208<br>
311 0 735 271 0 0<br>
99<br>
<br>
3 10.28.16.181 5515 5502<br>
13 0<br>
0 352107<br>
77 0 89 87 0 0 99<br>
<br>
3 10.28.16.184 12095 12059<br>
10 0<br>
0 763014<br>
39 0 249 147 0 0 99<br>
<br>
3 10.28.58.1 15401 6040<br>
116 6354<br>
53 191605<br>
474 202346 192 96 144 83<br>
99<br>
<br></div></div>
3 all <a href="tel:42574%2033086" value="+14257433086" target="_blank">42574 33086</a> <tel:42574%2033086><span class=""><br>
<tel:42574%20%20%2033086> 217<br>
6354 53 1913488<br>
1582 202300 348 138 153 105<br>
99<br>
<br>
<br>
<br>
<br>
<br>
On Fri, Feb 20, 2015 at 11:46 PM, W Verb<br>
<<a href="mailto:wverb73@gmail.com" target="_blank">wverb73@gmail.com</a> <mailto:<a href="mailto:wverb73@gmail.com" target="_blank">wverb73@gmail.com</a>><br></span>
<mailto:<a href="mailto:wverb73@gmail.com" target="_blank">wverb73@gmail.com</a><div><div class="h5"><br>
<mailto:<a href="mailto:wverb73@gmail.com" target="_blank">wverb73@gmail.com</a>>>> wrote:<br>
<br>
<br>
Hello All,<br>
<br>
Thank you for your replies.<br>
I tried a few things, and found the following:<br>
<br>
1: Disabling hyperthreading support in the<br>
BIOS drops<br>
performance overall<br>
by a factor of 4.<br>
2: Disabling VT support also seems to have<br>
some effect,<br>
although it<br>
appears to be minor. But this has the<br>
amusing side<br>
effect of fixing the<br>
hangs I've been experiencing with fast<br>
reboot. Probably<br>
by disabling kvm.<br>
3: The performance tests are a bit tricky<br>
to quantify<br>
because of caching<br>
effects. In fact, I'm not entirely sure<br>
what is<br>
happening here. It's just<br>
best to describe what I'm seeing:<br>
<br>
The commands I'm using to test are<br>
dd if=/dev/zero of=./test.dd bs=2M count=5000<br>
dd of=/dev/null if=./test.dd bs=2M count=5000<br>
The host vm is running Centos 6.6, and has<br>
the latest<br>
vmtools installed.<br>
There is a host cache on an SSD local to<br>
the host that<br>
is also in place.<br>
Disabling the host cache didn't<br>
immediately have an<br>
effect as far as I could<br>
see.<br>
<br>
The host MTU set to 3000 on all iSCSI<br>
interfaces for all<br>
tests.<br>
<br>
Test 1: Right after reboot, with an ixgbe<br>
MTU of 9000,<br>
the write test<br>
yields an average speed over three tests<br>
of 137MB/s. The<br>
read test yields an<br>
average over three tests of 5MB/s.<br>
<br>
Test 2: After setting "ifconfig ixgbe0 mtu<br>
3000", the<br>
write tests yield<br>
140MB/s, and the read tests yield 53MB/s.<br>
It's important<br>
to note here that<br>
if I cut the read test short at only<br>
2-3GB, I get<br>
results upwards of<br>
350MB/s, which I assume is local<br>
cache-related distortion.<br>
<br>
Test 3: MTU of 1500. Read tests are up to<br>
156 MB/s.<br>
Write tests yield<br>
about 142MB/s.<br>
Test 4: MTU of 1000: Read test at 182MB/s.<br>
Test 5: MTU of 900: Read test at 130 MB/s.<br>
Test 6: MTU of 1000: Read test at 160MB/s.<br>
Write tests<br>
are now<br>
consistently at about 300MB/s.<br>
Test 7: MTU of 1200: Read test at 124MB/s.<br>
Test 8: MTU of 1000: Read test at 161MB/s.<br>
Write at 261MB/s.<br>
<br>
A few final notes:<br>
L1ARC grabs about 10GB of RAM during the<br>
tests, so<br>
there's definitely some<br>
read caching going on.<br>
The write operations are easier to observe<br>
with iostat,<br>
and I'm seeing io<br>
rates that closely correlate with the<br>
network write speeds.<br>
<br>
<br>
Chris, thanks for your specific details.<br>
I'd appreciate<br>
it if you could<br>
tell me which copper NIC you tried, as<br>
well as to pass<br>
on the iSCSI tuning<br>
parameters.<br>
<br>
I've ordered an Intel EXPX9502AFXSR, which<br>
uses the<br>
82598 chip instead of<br>
the 82599 in the X520. If I get similar<br>
results with my<br>
fiber transcievers,<br>
I'll see if I can get a hold of copper ones.<br>
<br>
But I should mention that I did indeed<br>
look at PHY/MAC<br>
error rates, and<br>
they are nil.<br>
<br>
-Warren V<br>
<br>
On Fri, Feb 20, 2015 at 7:25 PM, Chris<br>
Siebenmann<br>
<<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a><br></div></div>
<mailto:<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a>> <mailto:<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a><div><div class="h5"><br>
<mailto:<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a>>>><br>
<br>
wrote:<br>
<br>
<br>
After installation and<br>
configuration, I observed<br>
all kinds of bad<br>
behavior<br>
in the network traffic between the<br>
hosts and the<br>
server. All of this<br>
bad<br>
behavior is traced to the ixgbe<br>
driver on the<br>
storage server. Without<br>
going<br>
into the full troubleshooting<br>
process, here are<br>
my takeaways:<br>
<br>
[...]<br>
<br>
For what it's worth, we managed to<br>
achieve much<br>
better line rates on<br>
copper 10G ixgbe hardware of various<br>
descriptions<br>
between OmniOS<br>
and CentOS 7 (I don't think we ever<br>
tested OmniOS to<br>
OmniOS). I don't<br>
believe OmniOS could do TCP at full<br>
line rate but I<br>
think we managed 700+<br>
Mbytes/sec on both transmit and<br>
receive and we got<br>
basically disk-limited<br>
speeds with iSCSI (across multiple<br>
disks on<br>
multi-disk mirrored pools,<br>
OmniOS iSCSI initiator, Linux iSCSI<br>
targets).<br>
<br>
I don't believe we did any specific<br>
kernel tuning<br>
(and in fact some of<br>
our attempts to fiddle ixgbe driver<br>
parameters blew<br>
up in our face).<br>
We did tune iSCSI connection<br>
parameters to increase<br>
various buffer<br>
sizes so that ZFS could do even large<br>
single<br>
operations in single iSCSI<br>
transactions. (More details available<br>
if people are<br>
interested.)<br>
<br>
10: At the wire level, the speed<br>
problems are<br>
clearly due to pauses in<br>
response time by omnios. At 9000<br>
byte frame<br>
sizes, I see a good number<br>
of duplicate ACKs and fast<br>
retransmits during<br>
read operations (when<br>
omnios is transmitting). But below<br>
about a<br>
4100-byte MTU on omnios<br>
(which seems to correlate to<br>
4096-byte iSCSI<br>
block transfers), the<br>
transmission errors fade away and<br>
we only see<br>
the transmission pause<br>
problem.<br>
<br>
<br>
This is what really attracted my<br>
attention. In<br>
our OmniOS setup, our<br>
specific Intel hardware had ixgbe<br>
driver issues that<br>
could cause<br>
activity stalls during once-a-second<br>
link heartbeat<br>
checks. This<br>
obviously had an effect at the TCP and<br>
iSCSI layers.<br>
My initial message<br>
to illumos-developer sparked a potentially<br>
interesting discussion:<br>
<br>
<br></div></div>
<a href="http://www.listbox.com/member/____archive/182179/2014/10/sort/____time_rev/page/16/entry/6:__405/__20141003125035:6357079A-__4B1D-__11E4-A39C-D534381BA44D/" target="_blank">http://www.listbox.com/member/<u></u>____archive/182179/2014/10/<u></u>sort/____time_rev/page/16/<u></u>entry/6:__405/__<u></u>20141003125035:6357079A-__<u></u>4B1D-__11E4-A39C-D534381BA44D/</a><br>
<<a href="http://www.listbox.com/member/__archive/182179/2014/10/sort/__time_rev/page/16/entry/6:405/__20141003125035:6357079A-4B1D-__11E4-A39C-D534381BA44D/" target="_blank">http://www.listbox.com/<u></u>member/__archive/182179/2014/<u></u>10/sort/__time_rev/page/16/<u></u>entry/6:405/__20141003125035:<u></u>6357079A-4B1D-__11E4-A39C-<u></u>D534381BA44D/</a>><br>
<br>
<<a href="http://www.listbox.com/__member/archive/182179/2014/10/__sort/time_rev/page/16/entry/6:__405/20141003125035:6357079A-__4B1D-11E4-A39C-D534381BA44D/" target="_blank">http://www.listbox.com/__<u></u>member/archive/182179/2014/10/<u></u>__sort/time_rev/page/16/entry/<u></u>6:__405/20141003125035:<u></u>6357079A-__4B1D-11E4-A39C-<u></u>D534381BA44D/</a><span class=""><br>
<<a href="http://www.listbox.com/member/archive/182179/2014/10/sort/time_rev/page/16/entry/6:405/20141003125035:6357079A-4B1D-11E4-A39C-D534381BA44D/" target="_blank">http://www.listbox.com/<u></u>member/archive/182179/2014/10/<u></u>sort/time_rev/page/16/entry/6:<u></u>405/20141003125035:6357079A-<u></u>4B1D-11E4-A39C-D534381BA44D/</a>>><br>
<br>
If you think this is a possibility in<br>
your setup,<br>
I've put the DTrace<br>
script I used to hunt for this up on<br>
the web:<br>
<br></span>
<a href="http://www.cs.toronto.edu/~____cks/src/omnios-ixgbe/ixgbe_____delay.d" target="_blank">http://www.cs.toronto.edu/~___<u></u>_cks/src/omnios-ixgbe/ixgbe___<u></u>__delay.d</a><br>
<<a href="http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___delay.d" target="_blank">http://www.cs.toronto.edu/~__<u></u>cks/src/omnios-ixgbe/ixgbe___<u></u>delay.d</a>><span class=""><br>
<br>
<<a href="http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___delay.d" target="_blank">http://www.cs.toronto.edu/~__<u></u>cks/src/omnios-ixgbe/ixgbe___<u></u>delay.d</a><br>
<<a href="http://www.cs.toronto.edu/~cks/src/omnios-ixgbe/ixgbe_delay.d" target="_blank">http://www.cs.toronto.edu/~<u></u>cks/src/omnios-ixgbe/ixgbe_<u></u>delay.d</a>>><br>
<br>
This isn't the only potential source<br>
of driver<br>
stalls by any means, it's<br>
just the one I found. You may also<br>
want to look at<br>
lockstat in general,<br>
as information it reported is what led<br>
us to look<br>
specifically at the<br>
ixgbe code here.<br>
<br>
(If you suspect kernel/driver issues,<br>
lockstat<br>
combined with kernel<br>
source is a really excellent resource.)<br>
<br>
- cks<br>
<br>
<br>
<br>
<br>
<br></span>
______________________________<u></u>_____________________<br>
OmniOS-discuss mailing list<br>
OmniOS-discuss@lists.omniti<br>
<mailto:<a href="mailto:OmniOS-discuss@lists.omniti" target="_blank">OmniOS-discuss@lists.<u></u>omniti</a>>.____com<br>
<mailto:<a href="mailto:OmniOS-discuss@lists." target="_blank">OmniOS-discuss@lists.</a>_<u></u>_<a href="http://omniti.com" target="_blank">omniti.com</a><br>
<mailto:<a href="mailto:OmniOS-discuss@lists.omniti.com" target="_blank">OmniOS-discuss@lists.<u></u>omniti.com</a>>><br>
<a href="http://lists.omniti.com/____mailman/listinfo/omnios-____discuss" target="_blank">http://lists.omniti.com/____<u></u>mailman/listinfo/omnios-____<u></u>discuss</a><br>
<<a href="http://lists.omniti.com/__mailman/listinfo/omnios-__discuss" target="_blank">http://lists.omniti.com/__<u></u>mailman/listinfo/omnios-__<u></u>discuss</a>><br>
<br>
<<a href="http://lists.omniti.com/__mailman/listinfo/omnios-__discuss" target="_blank">http://lists.omniti.com/__<u></u>mailman/listinfo/omnios-__<u></u>discuss</a><br>
<<a href="http://lists.omniti.com/mailman/listinfo/omnios-discuss" target="_blank">http://lists.omniti.com/<u></u>mailman/listinfo/omnios-<u></u>discuss</a>>><br>
<br>
<br>
______________________________<u></u>_____________________<br>
OmniOS-discuss mailing list<br>
OmniOS-discuss@lists.omniti<br>
<mailto:<a href="mailto:OmniOS-discuss@lists.omniti" target="_blank">OmniOS-discuss@lists.<u></u>omniti</a>>.____com<br>
<mailto:<a href="mailto:OmniOS-discuss@lists." target="_blank">OmniOS-discuss@lists.</a>_<u></u>_<a href="http://omniti.com" target="_blank">omniti.com</a><br>
<mailto:<a href="mailto:OmniOS-discuss@lists.omniti.com" target="_blank">OmniOS-discuss@lists.<u></u>omniti.com</a>>><br>
<a href="http://lists.omniti.com/____mailman/listinfo/omnios-____discuss" target="_blank">http://lists.omniti.com/____<u></u>mailman/listinfo/omnios-____<u></u>discuss</a><br>
<<a href="http://lists.omniti.com/__mailman/listinfo/omnios-__discuss" target="_blank">http://lists.omniti.com/__<u></u>mailman/listinfo/omnios-__<u></u>discuss</a>><span class=""><br>
<br>
<<a href="http://lists.omniti.com/__mailman/listinfo/omnios-__discuss" target="_blank">http://lists.omniti.com/__<u></u>mailman/listinfo/omnios-__<u></u>discuss</a><br>
<<a href="http://lists.omniti.com/mailman/listinfo/omnios-discuss" target="_blank">http://lists.omniti.com/<u></u>mailman/listinfo/omnios-<u></u>discuss</a>>><br>
<br>
<br>
--<br>
OSN Online Service Nuernberg GmbH, Bucher Str. 78,<br>
90408 Nuernberg<br>
Tel: <a href="tel:%2B49%20911%2039905-0" value="+49911399050" target="_blank">+49 911 39905-0</a> <tel:%2B49%20911%2039905-0><br>
<tel:%2B49%20911%2039905-0> - Fax: +49 911<br>
39905-55 <tel:%2B49%20911%2039905-55> -<br></span>
<a href="http://www.osn.de" target="_blank">http://www.osn.de</a> <<a href="http://www.osn.de/" target="_blank">http://www.osn.de/</a>><span class=""><br>
HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg<br>
Goltermann<br>
<br>
<br>
<br>
--<br>
OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408 Nuernberg<br>
Tel: <a href="tel:%2B49%20911%2039905-0" value="+49911399050" target="_blank">+49 911 39905-0</a> <tel:%2B49%20911%2039905-0> - Fax: +49<br>
911 39905-55 <tel:%2B49%20911%2039905-55> - <a href="http://www.osn.de" target="_blank">http://www.osn.de</a><br></span>
<<a href="http://www.osn.de/" target="_blank">http://www.osn.de/</a>><span class=""><br>
HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann<br>
<br>
<br></span>
*illumos-developer* | Archives<br>
<<a href="https://www.listbox.com/member/archive/182179/=now" target="_blank">https://www.listbox.com/<u></u>member/archive/182179/=now</a>><br>
<<a href="https://www.listbox.com/member/archive/rss/182179/21239177-3604570e" target="_blank">https://www.listbox.com/<u></u>member/archive/rss/182179/<u></u>21239177-3604570e</a>><br>
| Modify <<a href="https://www.listbox.com/member/?&" target="_blank">https://www.listbox.com/<u></u>member/?&</a>> Your Subscription<br>
[Powered by Listbox] <<a href="http://www.listbox.com/" target="_blank">http://www.listbox.com/</a>><br>
<br>
</blockquote>
<br>
<br>
*illumos-developer* | Archives<br>
<<a href="https://www.listbox.com/member/archive/182179/=now" target="_blank">https://www.listbox.com/<u></u>member/archive/182179/=now</a>><br>
<<a href="https://www.listbox.com/member/archive/rss/182179/21175123-d0c8da4c" target="_blank">https://www.listbox.com/<u></u>member/archive/rss/182179/<u></u>21175123-d0c8da4c</a>> |<br>
Modify<br>
<<a href="https://www.listbox.com/member/?member_id=21175123&id_secret=21175123-d92578cc" target="_blank">https://www.listbox.com/<u></u>member/?member_id=21175123&id_<u></u>secret=21175123-d92578cc</a>><br>
Your Subscription [Powered by Listbox] <<a href="http://www.listbox.com" target="_blank">http://www.listbox.com</a>><br>
<br>
</blockquote><div class=""><div class="h5">
<br>
-- <br>
OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408 Nuernberg<br>
Tel: <a href="tel:%2B49%20911%2039905-0" value="+49911399050" target="_blank">+49 911 39905-0</a> - Fax: <a href="tel:%2B49%20911%2039905-55" value="+499113990555" target="_blank">+49 911 39905-55</a> - <a href="http://www.osn.de" target="_blank">http://www.osn.de</a><br>
HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann<br>
</div></div></blockquote></div><br></div></div>