<div dir="ltr"><div><div><div>Hello all,<br></div>I am not using layer 2 flow control. The switch carries line-rate 10G traffic without error.<br><br></div>I think I have found the issue via lockstat. The first lockstat is taken during a multipath read:<br><br><br></div>lockstat -kWP sleep 30<br><div><div><div><br>Adaptive mutex spin: 21331 events in 30.020 seconds (711 events/sec)<br><br>Count indv cuml rcnt     nsec Hottest Lock           Caller<br>-------------------------------------------------------------------------------<br> 9306  44%  44% 0.00     1557 htable_mutex+0x370     htable_release<br> 6307  23%  68% 0.00     1207 htable_mutex+0x108     htable_lookup<br>  596   7%  75% 0.00     4100 0xffffff0931705188     cv_wait<br>  349   5%  80% 0.00     4437 0xffffff0931705188     taskq_thread<br>  704   2%  82% 0.00      995 0xffffff0935de3c50     dbuf_create<br><br></div><div>The hash table being read here I would guess is the tcp connection hash table.<br></div><div><br></div><div>When lockstat is run during a multipath write operation, I get:<br><br>Adaptive mutex spin: 1097341 events in 30.016 seconds (36558 events/sec)<br><br>Count indv cuml rcnt     nsec Hottest Lock           Caller<br>-------------------------------------------------------------------------------<br>210752  28%  28% 0.00     4781 0xffffff0931705188     taskq_thread<br>174471  22%  50% 0.00     4476 0xffffff0931705188     cv_wait<br>127183  10%  61% 0.00     2871 0xffffff096f29b510     zio_notify_parent<br>176066  10%  70% 0.00     1922 0xffffff0931705188     taskq_dispatch_ent<br>105134   9%  80% 0.00     3110 0xffffff096ffdbf10     zio_remove_child<br>67512   4%  83% 0.00     1938 0xffffff096f3db4b0     zio_add_child<br>45736   3%  86% 0.00     2239 0xffffff0935de3c50     dbuf_destroy<br>27781   3%  89% 0.00     3416 0xffffff0935de3c50     dbuf_create<br>38536   2%  91% 0.00     2122 0xffffff0935de3b70     dnode_rele<br>27841   2%  93% 0.00     2423 0xffffff0935de3b70     dnode_diduse_space<br>19020   2%  95% 0.00     3046 0xffffff09d9e305e0     dbuf_rele<br>14627   1%  96% 0.00     3632 dbuf_hash_table+0x4f8  dbuf_find<br><br><br><br></div><div>Writes are not performing htable lookups, while reads are.<br><br></div><div>-Warren V<br></div><div><br><br><br><br><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 2, 2015 at 3:14 AM, Joerg Goltermann <span dir="ltr"><<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
I would try *one* TPG which includes both interface addresses<br>
and I would double check for packet drops on the Catalyst.<br>
<br>
The 3560 supports only receive flow control which means, that<br>
a sending 10Gbit port can easily overload a 1Gbit port.<br>
Do you have flow control enabled?<br>
<br>
 - Joerg<div><div class="h5"><br>
<br>
On 02.03.2015 09:22, W Verb via illumos-developer wrote:<br>
</div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div class="h5">
Hello Garrett,<br>
<br>
No, no 802.3ad going on in this config.<br>
<br>
Here is a basic schematic:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQVkVqcE5OQUJyUUU/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQVkVqcE5OQUJyUUU/<u></u>view?usp=sharing</a><br>
<br>
Here is the Nexenta MPIO iSCSI Setup Document that I used as a guide:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQbjEyUTBjN2tTNWM/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQbjEyUTBjN2tTNWM/<u></u>view?usp=sharing</a><br>
<br>
Note that I am using an MTU of 3000 on both the 10G and 1G NICs. The<br>
switch is set to allow 9148-byte frames, and I'm not seeing any<br>
errors/buffer overruns on the switch.<br>
<br>
Here is a screenshot of a packet capture from a read operation on the<br>
guest OS (from it's local drive, which is actually a VMDK file on the<br>
storage server). In this example, only a single 1G ESXi kernel interface<br>
(vmk1) is bound to the software iSCSI initiator.<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQa2NYdXhpZkpkbU0/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQa2NYdXhpZkpkbU0/<u></u>view?usp=sharing</a><br>
<br>
Note that there's a nice, well-behaved window sizing process taking<br>
place. The ESXi decreases the scaled window by 11 or 12 for each ACK,<br>
then bumps it back up to 512.<br>
<br>
Here is a similar screenshot of a single-interface write operation:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQbU1RZHRnakxDSFU/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQbU1RZHRnakxDSFU/<u></u>view?usp=sharing</a><br>
<br>
There are no pauses or gaps in the transmission rate in the<br>
single-interface transfers.<br>
<br>
<br>
In the next screenshots, I have enabled an additional 1G interface on<br>
the ESXi host, and bound it to the iSCSI initiator. The new interface is<br>
bound to a separate physical port, uses a different VLAN on the switch,<br>
and talks to a different 10G port on the storage server.<br>
<br>
First, let's look at a write operation on the guest OS, which happily<br>
pumps data at near-line-rate to the storage server.<br>
<br>
Here is a sequence number trace diagram. Note how the transfer has a<br>
nice, smooth increment rate over the entire transfer.<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQWHNIa0drWnNxMmM/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQWHNIa0drWnNxMmM/<u></u>view?usp=sharing</a><br>
<br>
Here are screenshots from packet captures on both 1G interfaces:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQRWhyVVQ4djNaU3c/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQRWhyVVQ4djNaU3c/<u></u>view?usp=sharing</a><br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQaTVjTEtTRloyR2c/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQaTVjTEtTRloyR2c/<u></u>view?usp=sharing</a><br>
<br>
Note how we again see nice, smooth window adjustment, and no gaps in<br>
transmission.<br>
<br>
<br>
But now, let's look at the problematic two-interface Read operation.<br>
First, the sequence graph:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQTzdFVWdQMWZ6LUU/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQTzdFVWdQMWZ6LUU/<u></u>view?usp=sharing</a><br>
<br>
As you can see, there are gaps and jumps in the transmission throughout<br>
the transfer.<br>
It is very illustrative to look at captures of the gaps, which are<br>
occurring on both interfaces:<br>
<br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQc0VISXN6eVFwQzg/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQc0VISXN6eVFwQzg/<u></u>view?usp=sharing</a><br>
<a href="https://drive.google.com/file/d/0BwyUMjibonYQVFREUHp3TGFiUU0/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQVFREUHp3TGFiUU0/<u></u>view?usp=sharing</a><br>
<br>
As you can see, there are ~.4 second pauses in transmission from the<br>
storage server, which kills the transfer rate.<br>
It's clear that the ESXi box ACKs the prior iSCSI operation to<br>
completion, then makes a new LUN request, which the storage server<br>
immediately replies to. The ESXi ACKs the response packet from the<br>
storage server, then waits...and waits....and waits... until eventually<br>
the storage server starts transmitting again.<br>
<br>
Because the pause happens while the ESXi client is waiting for a packet<br>
from the storage server, that tells me that the gaps are not an artifact<br>
of traffic being switched between both active interfaces, but are<br>
actually indicative of short hangs occurring on the server.<br>
<br>
Having a pause or two in transmission is no big deal, but in my case, it<br>
is happening constantly, and dropping my overall read transfer rate down<br>
to 20-60MB/s, which is slower than the single interface transfer rate<br>
(~90-100MB/s).<br>
<br>
Decreasing the MTU makes the pauses shorter, increasing them makes the<br>
pauses longer.<br>
<br>
Another interesting thing is that if I set the multipath io interval to<br>
3 operations instead of 1, I get better throughput. In other words, the<br>
less frequently I swap IP addresses on my iSCSI requests from the ESXi<br>
unit, the fewer pauses I see.<br>
<br>
Basically, COMSTAR seems to choke each time an iSCSI request from a new<br>
IP arrives.<br>
<br>
Because the single interface transfer is near line rate, that tells me<br>
that the storage system (mpt_sas, zfs, etc) is working fine. It's only<br>
when multiple paths are attempted that iSCSI falls on its face during reads.<br>
<br>
All of these captures were taken without a cache device being attached<br>
to the storage zpool, so this isn't looking like some kind of ZFS ARC<br>
problem. As mentioned previously, local transfers to/from the zpool are<br>
showing ~300-500 MB/s rates over long transfers (10G+).<br>
<br>
-Warren V<br>
<br>
On Sun, Mar 1, 2015 at 9:11 PM, Garrett D'Amore <<a href="mailto:garrett@damore.org" target="_blank">garrett@damore.org</a><br></div></div><span class="">
<mailto:<a href="mailto:garrett@damore.org" target="_blank">garrett@damore.org</a>>> wrote:<br>
<br>
    I’m not sure I’ve followed properly.  You have *two* interfaces.<br>
    You are not trying to provision these in an aggr are you? As far as<br>
    I’m aware, VMware does not support 802.3ad link aggregations.  (Its<br>
    possible that you can make it work with ESXi if you give the entire<br>
    NIC to the guest — but I’m skeptical.)  The problem is that if you<br>
    try to use link aggregation, some packets (up to half!) will be<br>
    lost.  TCP and other protocols fare poorly in this situation.<br>
<br>
    Its possible I’ve totally misunderstood what you’re trying to do, in<br>
    which case I apologize.<br>
<br>
    The idle thing is a red-herring — the cpu is waiting for work to do,<br>
    probably because packets haven’t arrived (or where dropped by the<br>
    hypervisor!)  I wouldn’t read too much into that except that your<br>
    network stack is in trouble.  I’d look a bit more closely at the<br>
    kstats for tcp — I suspect you’ll see retransmits or out of order<br>
    values that are unusually high — if so this may help validate my<br>
    theory above.<br>
<br>
    - Garrett<br>
<br>
</span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="">
    On Mar 1, 2015, at 9:03 PM, W Verb via illumos-developer<br></span>
    <<a href="mailto:developer@lists.illumos.org" target="_blank">developer@lists.illumos.org</a> <mailto:<a href="mailto:developer@lists.illumos.org" target="_blank">developer@lists.<u></u>illumos.org</a>>><div><div class="h5"><br>
    wrote:<br>
<br>
    Hello all,<br>
<br>
<br>
    Well, I no longer blame the ixgbe driver for the problems I'm seeing.<br>
<br>
<br>
    I tried Joerg's updated driver, which didn't improve the issue. So<br>
    I went back to the drawing board and rebuilt the server from scratch.<br>
<br>
    What I noted is that if I have only a single 1-gig physical<br>
    interface active on the ESXi host, everything works as expected.<br>
    As soon as I enable two interfaces, I start seeing the performance<br>
    problems I've described.<br>
<br>
    Response pauses from the server that I see in TCPdumps are still<br>
    leading me to believe the problem is delay on the server side, so<br>
    I ran a series of kernel dtraces and produced some flamegraphs.<br>
<br>
<br>
    This was taken during a read operation with two active 10G<br>
    interfaces on the server, with a single target being shared by two<br>
    tpgs- one tpg for each 10G physical port. The host device has two<br>
    1G ports enabled, with VLANs separating the active ports into<br>
    10G/1G pairs. ESXi is set to multipath using both VLANS with a<br>
    round-robin IO interval of 1.<br>
<br>
    <a href="https://drive.google.com/file/d/0BwyUMjibonYQd3ZYOGh4d2pteGs/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQd3ZYOGh4d2pteGs/<u></u>view?usp=sharing</a><br>
<br>
<br>
    This was taken during a write operation:<br>
<br>
    <a href="https://drive.google.com/file/d/0BwyUMjibonYQMnBtU1Q2SXM2ams/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQMnBtU1Q2SXM2ams/<u></u>view?usp=sharing</a><br>
<br>
<br>
    I then rebooted the server and disabled C-State, ACPI T-State, and<br>
    general EIST (Turbo boost) functionality in the CPU.<br>
<br>
    I when I attempted to boot my guest VM, the iSCSI transfer<br>
    gradually ground to a halt during the boot loading process, and<br>
    the guest OS never did complete its boot process.<br>
<br>
    Here is a flamegraph taken while iSCSI is slowly dying:<br>
<br>
    <a href="https://drive.google.com/file/d/0BwyUMjibonYQM21JeFZPX3dZWTg/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQM21JeFZPX3dZWTg/<u></u>view?usp=sharing</a><br>
<br>
<br>
    I edited out cpu_idle_adaptive from the dtrace output and<br>
    regenerated the slowdown graph:<br>
<br>
    <a href="https://drive.google.com/file/d/0BwyUMjibonYQbTVwV3NvXzlPS1E/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQbTVwV3NvXzlPS1E/<u></u>view?usp=sharing</a><br>
<br>
<br>
    I then edited cpu_idle_adaptive out of the speedy write operation<br>
    and regenerated that graph:<br>
<br>
    <a href="https://drive.google.com/file/d/0BwyUMjibonYQeWFYM0pCMDZ1X2s/view?usp=sharing" target="_blank">https://drive.google.com/file/<u></u>d/<u></u>0BwyUMjibonYQeWFYM0pCMDZ1X2s/<u></u>view?usp=sharing</a><br>
<br>
<br>
    I have zero experience with interpreting flamegraphs, but the most<br>
    significant difference I see between the slow read example and the<br>
    fast write example is in unix`thread_start --> unix`idle. There's<br>
    a good chunk of "unix`i86_mwait" in the read example that is not<br>
    present in the write example at all.<br>
<br>
    Disabling the l2arc cache device didn't make a difference, and I<br>
    had to reenable EIST support on the CPU to get my VMs to boot.<br>
<br>
    I am seeing a variety of bug reports going back to 2010 regarding<br>
    excessive mwait operations, with the suggested solutions usually<br>
    being to set "cpupm enable poll-mode" in power.conf. That change<br>
    also had no effect on speed.<br>
<br>
    -Warren V<br>
<br>
<br>
<br>
<br>
    -----Original Message-----<br>
<br>
    From: Chris Siebenmann [mailto:<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a>]<br>
<br>
    Sent: Monday, February 23, 2015 8:30 AM<br>
<br>
    To: W Verb<br>
<br>
    Cc: <a href="mailto:omnios-discuss@lists.omniti.com" target="_blank">omnios-discuss@lists.omniti.<u></u>com</a><br></div></div>
    <mailto:<a href="mailto:omnios-discuss@lists.omniti.com" target="_blank">omnios-discuss@lists.<u></u>omniti.com</a>>; <a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a><br>
    <mailto:<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a>><span class=""><br>
<br>
    Subject: Re: [OmniOS-discuss] The ixgbe driver, Lindsay Lohan, and<br>
    the Greek economy<br>
<br>
<br>
    > Chris, thanks for your specific details. I'd appreciate it if you<br>
<br>
    > could tell me which copper NIC you tried, as well as to pass on the<br>
<br>
    > iSCSI tuning parameters.<br>
<br>
<br>
    Our copper NIC experience is with onboard X540-AT2 ports on<br>
    SuperMicro hardware (which have the guaranteed 10-20 msec lock<br>
    hold) and dual-port 82599EB TN cards (which have some sort of<br>
    driver/hardware failure under load that eventually leads to<br>
    2-second lock holds). I can't recommend either with the current<br>
    driver; we had to revert to 1G networking in order to get stable<br>
    servers.<br>
<br>
<br>
    The iSCSI parameter modifications we do, across both initiators<br>
    and targets, are:<br>
<br>
<br>
    initialr2tno<br>
<br>
    firstburstlength128k<br>
<br>
    maxrecvdataseglen128k[only on Linux backends]<br>
<br></span>
    maxxmitdataseglen128k[only on Linux backends]<span class=""><br>
<br>
<br>
    The OmniOS initiator doesn't need tuning for more than the first<br>
    two parameters; on the Linux backends we tune up all four. My<br>
    extended thoughts on these tuning parameters and why we touch them<br>
    can be found<br>
<br>
    here:<br>
<br>
<br>
    <a href="http://utcc.utoronto.ca/~cks/space/blog/tech/UnderstandingiSCSIProtocol" target="_blank">http://utcc.utoronto.ca/~cks/<u></u>space/blog/tech/<u></u>UnderstandingiSCSIProtocol</a><br>
<br>
    <a href="http://utcc.utoronto.ca/~cks/space/blog/tech/LikelyISCSITuning" target="_blank">http://utcc.utoronto.ca/~cks/<u></u>space/blog/tech/<u></u>LikelyISCSITuning</a><br>
<br>
<br>
    The short version is that these parameters probably only make a<br>
    small difference but their overall goal is to do 128KB ZFS reads<br>
    and writes in single iSCSI operations (although they will be<br>
    fragmented at the TCP<br>
<br>
    layer) and to do iSCSI writes without a back-and-forth delay<br>
    between initiator and target (that's 'initialr2t no').<br>
<br>
<br>
    I think basically everyone should use InitialR2T set to no and in<br>
    fact that it should be the software default. These days only<br>
    unusually limited iSCSI targets should need it to be otherwise and<br>
    they can change their setting for it (initiator and target must<br>
    both agree to it being 'yes', so either can veto it).<br>
<br>
<br>
    - cks<br>
<br>
<br>
<br>
    On Mon, Feb 23, 2015 at 8:21 AM, Joerg Goltermann <<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a><br></span><div><div class="h5">
    <mailto:<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a>>> wrote:<br>
<br>
        Hi,<br>
<br>
        I think your problem is caused by your link properties or your<br>
        switch settings. In general the standard ixgbe seems to perform<br>
        well.<br>
<br>
        I had trouble after changing the default flow control settings<br>
        to "bi"<br>
        and this was my motivation to update the ixgbe driver a long<br>
        time ago.<br>
        After I have updated our systems to ixgbe 2.5.8 I never had any<br>
        problems ....<br>
<br>
        Make sure your switch has support for jumbo frames and you use<br>
        the same mtu on all ports, otherwise the smallest will be used.<br>
<br>
        What switch do you use? I can tell you nice horror stories about<br>
        different vendors....<br>
<br>
         - Joerg<br>
<br>
        On 23.02.2015 10:31, W Verb wrote:<br>
<br>
            Thank you Joerg,<br>
<br>
            I've downloaded the package and will try it tomorrow.<br>
<br>
            The only thing I can add at this point is that upon review<br>
            of my<br>
            testing, I may have performed my "pkg -u" between the<br>
            initial quad-gig<br>
            performance test and installing the 10G NIC. So this may<br>
            be a new<br>
            problem introduced in the latest updates.<br>
<br>
            Those of you who are running 10G and have not upgraded to<br>
            the latest<br>
            kernel, etc, might want to do some additional testing<br>
            before running the<br>
            update.<br>
<br>
            -Warren V<br>
<br>
            On Mon, Feb 23, 2015 at 1:15 AM, Joerg Goltermann<br>
            <<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a> <mailto:<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a>><br></div></div><span class="">
            <mailto:<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a> <mailto:<a href="mailto:jg@osn.de" target="_blank">jg@osn.de</a>>>> wrote:<br>
<br>
                Hi,<br>
<br>
                I remember there was a problem with the flow control<br>
            settings in the<br>
                ixgbe<br>
                driver, so I updated it a long time ago for our<br>
            internal servers to<br>
                2.5.8.<br>
                Last weekend I integrated the latest changes from the<br>
            FreeBSD driver<br>
                to bring<br>
                the illumos ixgbe to 2.5.25 but I had no time to test<br>
            it, so it's<br>
                completely<br>
                untested!<br>
<br>
<br>
                If you would like to give the latest driver a try you<br>
            can fetch the<br>
                kernel modules from<br></span>
            <a href="https://cloud.osn.de/index.____php/s/Fb4so9RsNnXA7r9" target="_blank">https://cloud.osn.de/index.___<u></u>_php/s/Fb4so9RsNnXA7r9</a><br>
            <<a href="https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9" target="_blank">https://cloud.osn.de/index.__<u></u>php/s/Fb4so9RsNnXA7r9</a>><span class=""><br>
                <<a href="https://cloud.osn.de/index.__php/s/Fb4so9RsNnXA7r9" target="_blank">https://cloud.osn.de/index.__<u></u>php/s/Fb4so9RsNnXA7r9</a><br>
            <<a href="https://cloud.osn.de/index.php/s/Fb4so9RsNnXA7r9" target="_blank">https://cloud.osn.de/index.<u></u>php/s/Fb4so9RsNnXA7r9</a>>><br>
<br>
                Clone your boot environment, place the modules in the<br>
            new environment<br>
                and update the boot-archive of the new BE.<br>
<br>
                  - Joerg<br>
<br>
<br>
<br>
<br>
<br>
                On 23.02.2015 02:54, W Verb wrote:<br>
<br>
                    By the way, to those of you who have working<br>
            setups: please send me<br>
                    your pool/volume settings, interface linkprops,<br>
            and any kernel<br>
                    tuning<br>
                    parameters you may have set.<br>
<br>
                    Thanks,<br>
                    Warren V<br>
<br>
                    On Sat, Feb 21, 2015 at 7:59 AM, Schweiss, Chip<br>
                    <<a href="mailto:chip@innovates.com" target="_blank">chip@innovates.com</a> <mailto:<a href="mailto:chip@innovates.com" target="_blank">chip@innovates.com</a>><br></span>
            <mailto:<a href="mailto:chip@innovates.com" target="_blank">chip@innovates.com</a> <mailto:<a href="mailto:chip@innovates.com" target="_blank">chip@innovates.com</a>>>><div><div class="h5"><br>
            wrote:<br>
<br>
                        I can't say I totally agree with your performance<br>
                        assessment.   I run Intel<br>
                        X520 in all my OmniOS boxes.<br>
<br>
                        Here is a capture of nfssvrtop I made while<br>
            running many<br>
                        storage vMotions<br>
                        between two OmniOS boxes hosting NFS<br>
            datastores.   This is a<br>
                        10 host VMware<br>
                        cluster.  Both OmniOS boxes are dual 10G<br>
            connected with<br>
                        copper twin-ax to<br>
                        the in rack Nexus 5010.<br>
<br>
                        VMware does 100% sync writes, I use ZeusRAM<br>
            SSDs for log<br>
                        devices.<br>
<br>
                        -Chip<br>
<br>
                        2014 Apr 24 08:05:51, load: 12.64, read:<br>
            17330243 KB,<br>
                        swrite: 15985    KB,<br>
                        awrite: 1875455  KB<br>
<br>
                        Ver     Client           NFSOPS   Reads<br>
            SWrites AWrites<br>
                        Commits   Rd_bw<br>
                        SWr_bw  AWr_bw    Rd_t   SWr_t   AWr_t<br>
             Com_t  Align%<br>
<br>
                        4       10.28.17.105          0       0<br>
             0       0<br>
                          0       0<br>
                        0       0       0       0       0       0       0<br>
<br>
                        4       10.28.17.215          0       0<br>
             0       0<br>
                          0       0<br>
                        0       0       0       0       0       0       0<br>
<br>
                        4       10.28.17.213          0       0<br>
             0       0<br>
                          0       0<br>
                        0       0       0       0       0       0       0<br>
<br>
                        4       10.28.16.151          0       0<br>
             0       0<br>
                          0       0<br>
                        0       0       0       0       0       0       0<br>
<br>
                        4       all                   1       0<br>
             0       0<br>
                          0       0<br>
                        0       0       0       0       0       0       0<br>
<br>
                        3       10.28.16.175          3       0<br>
             3       0<br>
                          0       1<br>
                        11       0    4806      48       0       0      85<br>
<br>
                        3       10.28.16.183          6       0<br>
             6       0<br>
                          0       3<br>
                        162       0     549     124       0       0<br>
              73<br>
<br>
                        3       10.28.16.180         11       0<br>
            10       0<br>
                          0       3<br>
                        27       0     776      89       0       0      67<br>
<br>
                        3       10.28.16.176         28       2<br>
            26       0<br>
                          0      10<br>
                        405       0    2572     198       0       0<br>
             100<br>
<br>
                        3       10.28.16.178       4606    4602<br>
             4       0<br>
                          0  294534<br>
                        3       0     723      49       0       0      99<br>
<br>
                        3       10.28.16.179       4905    4879<br>
            26       0<br>
                          0  312208<br>
                        311       0     735     271       0       0<br>
              99<br>
<br>
                        3       10.28.16.181       5515    5502<br>
            13       0<br>
                          0  352107<br>
                        77       0      89      87       0       0      99<br>
<br>
                        3       10.28.16.184      12095   12059<br>
            10       0<br>
                          0  763014<br>
                        39       0     249     147       0       0      99<br>
<br>
                        3       10.28.58.1        15401    6040<br>
             116    6354<br>
                        53  191605<br>
                        474  202346     192      96     144      83<br>
              99<br>
<br></div></div>
                        3       all <a href="tel:42574%2033086" value="+14257433086" target="_blank">42574 33086</a> <tel:42574%2033086><span class=""><br>
            <tel:42574%20%20%2033086>     217<br>
                        6354      53 1913488<br>
                        1582  202300     348     138     153     105<br>
                99<br>
<br>
<br>
<br>
<br>
<br>
                        On Fri, Feb 20, 2015 at 11:46 PM, W Verb<br>
            <<a href="mailto:wverb73@gmail.com" target="_blank">wverb73@gmail.com</a> <mailto:<a href="mailto:wverb73@gmail.com" target="_blank">wverb73@gmail.com</a>><br></span>
                        <mailto:<a href="mailto:wverb73@gmail.com" target="_blank">wverb73@gmail.com</a><div><div class="h5"><br>
            <mailto:<a href="mailto:wverb73@gmail.com" target="_blank">wverb73@gmail.com</a>>>> wrote:<br>
<br>
<br>
                            Hello All,<br>
<br>
                            Thank you for your replies.<br>
                            I tried a few things, and found the following:<br>
<br>
                            1: Disabling hyperthreading support in the<br>
            BIOS drops<br>
                            performance overall<br>
                            by a factor of 4.<br>
                            2: Disabling VT support also seems to have<br>
            some effect,<br>
                            although it<br>
                            appears to be minor. But this has the<br>
            amusing side<br>
                            effect of fixing the<br>
                            hangs I've been experiencing with fast<br>
            reboot. Probably<br>
                            by disabling kvm.<br>
                            3: The performance tests are a bit tricky<br>
            to quantify<br>
                            because of caching<br>
                            effects. In fact, I'm not entirely sure<br>
            what is<br>
                            happening here. It's just<br>
                            best to describe what I'm seeing:<br>
<br>
                            The commands I'm using to test are<br>
                            dd if=/dev/zero of=./test.dd bs=2M count=5000<br>
                            dd of=/dev/null if=./test.dd bs=2M count=5000<br>
                            The host vm is running Centos 6.6, and has<br>
            the latest<br>
                            vmtools installed.<br>
                            There is a host cache on an SSD local to<br>
            the host that<br>
                            is also in place.<br>
                            Disabling the host cache didn't<br>
            immediately have an<br>
                            effect as far as I could<br>
                            see.<br>
<br>
                            The host MTU set to 3000 on all iSCSI<br>
            interfaces for all<br>
                            tests.<br>
<br>
                            Test 1: Right after reboot, with an ixgbe<br>
            MTU of 9000,<br>
                            the write test<br>
                            yields an average speed over three tests<br>
            of 137MB/s. The<br>
                            read test yields an<br>
                            average over three tests of 5MB/s.<br>
<br>
                            Test 2: After setting "ifconfig ixgbe0 mtu<br>
            3000", the<br>
                            write tests yield<br>
                            140MB/s, and the read tests yield 53MB/s.<br>
            It's important<br>
                            to note here that<br>
                            if I cut the read test short at only<br>
            2-3GB, I get<br>
                            results upwards of<br>
                            350MB/s, which I assume is local<br>
            cache-related distortion.<br>
<br>
                            Test 3: MTU of 1500. Read tests are up to<br>
            156 MB/s.<br>
                            Write tests yield<br>
                            about 142MB/s.<br>
                            Test 4: MTU of 1000: Read test at 182MB/s.<br>
                            Test 5: MTU of 900: Read test at 130 MB/s.<br>
                            Test 6: MTU of 1000: Read test at 160MB/s.<br>
            Write tests<br>
                            are now<br>
                            consistently at about 300MB/s.<br>
                            Test 7: MTU of 1200: Read test at 124MB/s.<br>
                            Test 8: MTU of 1000: Read test at 161MB/s.<br>
            Write at 261MB/s.<br>
<br>
                            A few final notes:<br>
                            L1ARC grabs about 10GB of RAM during the<br>
            tests, so<br>
                            there's definitely some<br>
                            read caching going on.<br>
                            The write operations are easier to observe<br>
            with iostat,<br>
                            and I'm seeing io<br>
                            rates that closely correlate with the<br>
            network write speeds.<br>
<br>
<br>
                            Chris, thanks for your specific details.<br>
            I'd appreciate<br>
                            it if you could<br>
                            tell me which copper NIC you tried, as<br>
            well as to pass<br>
                            on the iSCSI tuning<br>
                            parameters.<br>
<br>
                            I've ordered an Intel EXPX9502AFXSR, which<br>
            uses the<br>
                            82598 chip instead of<br>
                            the 82599 in the X520. If I get similar<br>
            results with my<br>
                            fiber transcievers,<br>
                            I'll see if I can get a hold of copper ones.<br>
<br>
                            But I should mention that I did indeed<br>
            look at PHY/MAC<br>
                            error rates, and<br>
                            they are nil.<br>
<br>
                            -Warren V<br>
<br>
                            On Fri, Feb 20, 2015 at 7:25 PM, Chris<br>
            Siebenmann<br>
                            <<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a><br></div></div>
            <mailto:<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a>> <mailto:<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a><div><div class="h5"><br>
            <mailto:<a href="mailto:cks@cs.toronto.edu" target="_blank">cks@cs.toronto.edu</a>>>><br>
<br>
                            wrote:<br>
<br>
<br>
                                    After installation and<br>
            configuration, I observed<br>
                                    all kinds of bad<br>
                                    behavior<br>
                                    in the network traffic between the<br>
            hosts and the<br>
                                    server. All of this<br>
                                    bad<br>
                                    behavior is traced to the ixgbe<br>
            driver on the<br>
                                    storage server. Without<br>
                                    going<br>
                                    into the full troubleshooting<br>
            process, here are<br>
                                    my takeaways:<br>
<br>
                                [...]<br>
<br>
                                   For what it's worth, we managed to<br>
            achieve much<br>
                                better line rates on<br>
                                copper 10G ixgbe hardware of various<br>
            descriptions<br>
                                between OmniOS<br>
                                and CentOS 7 (I don't think we ever<br>
            tested OmniOS to<br>
                                OmniOS). I don't<br>
                                believe OmniOS could do TCP at full<br>
            line rate but I<br>
                                think we managed 700+<br>
                                Mbytes/sec on both transmit and<br>
            receive and we got<br>
                                basically disk-limited<br>
                                speeds with iSCSI (across multiple<br>
            disks on<br>
                                multi-disk mirrored pools,<br>
                                OmniOS iSCSI initiator, Linux iSCSI<br>
            targets).<br>
<br>
                                   I don't believe we did any specific<br>
            kernel tuning<br>
                                (and in fact some of<br>
                                our attempts to fiddle ixgbe driver<br>
            parameters blew<br>
                                up in our face).<br>
                                We did tune iSCSI connection<br>
            parameters to increase<br>
                                various buffer<br>
                                sizes so that ZFS could do even large<br>
            single<br>
                                operations in single iSCSI<br>
                                transactions. (More details available<br>
            if people are<br>
                                interested.)<br>
<br>
                                    10: At the wire level, the speed<br>
            problems are<br>
                                    clearly due to pauses in<br>
                                    response time by omnios. At 9000<br>
            byte frame<br>
                                    sizes, I see a good number<br>
                                    of duplicate ACKs and fast<br>
            retransmits during<br>
                                    read operations (when<br>
                                    omnios is transmitting). But below<br>
            about a<br>
                                    4100-byte MTU on omnios<br>
                                    (which seems to correlate to<br>
            4096-byte iSCSI<br>
                                    block transfers), the<br>
                                    transmission errors fade away and<br>
            we only see<br>
                                    the transmission pause<br>
                                    problem.<br>
<br>
<br>
                                   This is what really attracted my<br>
            attention. In<br>
                                our OmniOS setup, our<br>
                                specific Intel hardware had ixgbe<br>
            driver issues that<br>
                                could cause<br>
                                activity stalls during once-a-second<br>
            link heartbeat<br>
                                checks. This<br>
                                obviously had an effect at the TCP and<br>
            iSCSI layers.<br>
                                My initial message<br>
                                to illumos-developer sparked a potentially<br>
                                interesting discussion:<br>
<br>
<br></div></div>
            <a href="http://www.listbox.com/member/____archive/182179/2014/10/sort/____time_rev/page/16/entry/6:__405/__20141003125035:6357079A-__4B1D-__11E4-A39C-D534381BA44D/" target="_blank">http://www.listbox.com/member/<u></u>____archive/182179/2014/10/<u></u>sort/____time_rev/page/16/<u></u>entry/6:__405/__<u></u>20141003125035:6357079A-__<u></u>4B1D-__11E4-A39C-D534381BA44D/</a><br>
            <<a href="http://www.listbox.com/member/__archive/182179/2014/10/sort/__time_rev/page/16/entry/6:405/__20141003125035:6357079A-4B1D-__11E4-A39C-D534381BA44D/" target="_blank">http://www.listbox.com/<u></u>member/__archive/182179/2014/<u></u>10/sort/__time_rev/page/16/<u></u>entry/6:405/__20141003125035:<u></u>6357079A-4B1D-__11E4-A39C-<u></u>D534381BA44D/</a>><br>
<br>
            <<a href="http://www.listbox.com/__member/archive/182179/2014/10/__sort/time_rev/page/16/entry/6:__405/20141003125035:6357079A-__4B1D-11E4-A39C-D534381BA44D/" target="_blank">http://www.listbox.com/__<u></u>member/archive/182179/2014/10/<u></u>__sort/time_rev/page/16/entry/<u></u>6:__405/20141003125035:<u></u>6357079A-__4B1D-11E4-A39C-<u></u>D534381BA44D/</a><span class=""><br>
            <<a href="http://www.listbox.com/member/archive/182179/2014/10/sort/time_rev/page/16/entry/6:405/20141003125035:6357079A-4B1D-11E4-A39C-D534381BA44D/" target="_blank">http://www.listbox.com/<u></u>member/archive/182179/2014/10/<u></u>sort/time_rev/page/16/entry/6:<u></u>405/20141003125035:6357079A-<u></u>4B1D-11E4-A39C-D534381BA44D/</a>>><br>
<br>
                                If you think this is a possibility in<br>
            your setup,<br>
                                I've put the DTrace<br>
                                script I used to hunt for this up on<br>
            the web:<br>
<br></span>
            <a href="http://www.cs.toronto.edu/~____cks/src/omnios-ixgbe/ixgbe_____delay.d" target="_blank">http://www.cs.toronto.edu/~___<u></u>_cks/src/omnios-ixgbe/ixgbe___<u></u>__delay.d</a><br>
            <<a href="http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___delay.d" target="_blank">http://www.cs.toronto.edu/~__<u></u>cks/src/omnios-ixgbe/ixgbe___<u></u>delay.d</a>><span class=""><br>
<br>
            <<a href="http://www.cs.toronto.edu/~__cks/src/omnios-ixgbe/ixgbe___delay.d" target="_blank">http://www.cs.toronto.edu/~__<u></u>cks/src/omnios-ixgbe/ixgbe___<u></u>delay.d</a><br>
            <<a href="http://www.cs.toronto.edu/~cks/src/omnios-ixgbe/ixgbe_delay.d" target="_blank">http://www.cs.toronto.edu/~<u></u>cks/src/omnios-ixgbe/ixgbe_<u></u>delay.d</a>>><br>
<br>
                                This isn't the only potential source<br>
            of driver<br>
                                stalls by any means, it's<br>
                                just the one I found. You may also<br>
            want to look at<br>
                                lockstat in general,<br>
                                as information it reported is what led<br>
            us to look<br>
                                specifically at the<br>
                                ixgbe code here.<br>
<br>
                                (If you suspect kernel/driver issues,<br>
            lockstat<br>
                                combined with kernel<br>
                                source is a really excellent resource.)<br>
<br>
                                          - cks<br>
<br>
<br>
<br>
<br>
<br></span>
            ______________________________<u></u>_____________________<br>
                            OmniOS-discuss mailing list<br>
            OmniOS-discuss@lists.omniti<br>
            <mailto:<a href="mailto:OmniOS-discuss@lists.omniti" target="_blank">OmniOS-discuss@lists.<u></u>omniti</a>>.____com<br>
                            <mailto:<a href="mailto:OmniOS-discuss@lists." target="_blank">OmniOS-discuss@lists.</a>_<u></u>_<a href="http://omniti.com" target="_blank">omniti.com</a><br>
            <mailto:<a href="mailto:OmniOS-discuss@lists.omniti.com" target="_blank">OmniOS-discuss@lists.<u></u>omniti.com</a>>><br>
            <a href="http://lists.omniti.com/____mailman/listinfo/omnios-____discuss" target="_blank">http://lists.omniti.com/____<u></u>mailman/listinfo/omnios-____<u></u>discuss</a><br>
            <<a href="http://lists.omniti.com/__mailman/listinfo/omnios-__discuss" target="_blank">http://lists.omniti.com/__<u></u>mailman/listinfo/omnios-__<u></u>discuss</a>><br>
<br>
            <<a href="http://lists.omniti.com/__mailman/listinfo/omnios-__discuss" target="_blank">http://lists.omniti.com/__<u></u>mailman/listinfo/omnios-__<u></u>discuss</a><br>
            <<a href="http://lists.omniti.com/mailman/listinfo/omnios-discuss" target="_blank">http://lists.omniti.com/<u></u>mailman/listinfo/omnios-<u></u>discuss</a>>><br>
<br>
<br>
                    ______________________________<u></u>_____________________<br>
                    OmniOS-discuss mailing list<br>
            OmniOS-discuss@lists.omniti<br>
            <mailto:<a href="mailto:OmniOS-discuss@lists.omniti" target="_blank">OmniOS-discuss@lists.<u></u>omniti</a>>.____com<br>
                    <mailto:<a href="mailto:OmniOS-discuss@lists." target="_blank">OmniOS-discuss@lists.</a>_<u></u>_<a href="http://omniti.com" target="_blank">omniti.com</a><br>
            <mailto:<a href="mailto:OmniOS-discuss@lists.omniti.com" target="_blank">OmniOS-discuss@lists.<u></u>omniti.com</a>>><br>
            <a href="http://lists.omniti.com/____mailman/listinfo/omnios-____discuss" target="_blank">http://lists.omniti.com/____<u></u>mailman/listinfo/omnios-____<u></u>discuss</a><br>
            <<a href="http://lists.omniti.com/__mailman/listinfo/omnios-__discuss" target="_blank">http://lists.omniti.com/__<u></u>mailman/listinfo/omnios-__<u></u>discuss</a>><span class=""><br>
<br>
            <<a href="http://lists.omniti.com/__mailman/listinfo/omnios-__discuss" target="_blank">http://lists.omniti.com/__<u></u>mailman/listinfo/omnios-__<u></u>discuss</a><br>
            <<a href="http://lists.omniti.com/mailman/listinfo/omnios-discuss" target="_blank">http://lists.omniti.com/<u></u>mailman/listinfo/omnios-<u></u>discuss</a>>><br>
<br>
<br>
                --<br>
                OSN Online Service Nuernberg GmbH, Bucher Str. 78,<br>
            90408 Nuernberg<br>
                Tel: <a href="tel:%2B49%20911%2039905-0" value="+49911399050" target="_blank">+49 911 39905-0</a> <tel:%2B49%20911%2039905-0><br>
            <tel:%2B49%20911%2039905-0> - Fax: +49 911<br>
                39905-55 <tel:%2B49%20911%2039905-55> -<br></span>
            <a href="http://www.osn.de" target="_blank">http://www.osn.de</a> <<a href="http://www.osn.de/" target="_blank">http://www.osn.de/</a>><span class=""><br>
                HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg<br>
            Goltermann<br>
<br>
<br>
<br>
        --<br>
        OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408 Nuernberg<br>
        Tel: <a href="tel:%2B49%20911%2039905-0" value="+49911399050" target="_blank">+49 911 39905-0</a> <tel:%2B49%20911%2039905-0> - Fax: +49<br>
        911 39905-55 <tel:%2B49%20911%2039905-55> - <a href="http://www.osn.de" target="_blank">http://www.osn.de</a><br></span>
        <<a href="http://www.osn.de/" target="_blank">http://www.osn.de/</a>><span class=""><br>
        HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann<br>
<br>
<br></span>
    *illumos-developer* | Archives<br>
    <<a href="https://www.listbox.com/member/archive/182179/=now" target="_blank">https://www.listbox.com/<u></u>member/archive/182179/=now</a>><br>
    <<a href="https://www.listbox.com/member/archive/rss/182179/21239177-3604570e" target="_blank">https://www.listbox.com/<u></u>member/archive/rss/182179/<u></u>21239177-3604570e</a>><br>
    | Modify <<a href="https://www.listbox.com/member/?&" target="_blank">https://www.listbox.com/<u></u>member/?&</a>> Your Subscription<br>
    [Powered by Listbox] <<a href="http://www.listbox.com/" target="_blank">http://www.listbox.com/</a>><br>
<br>
</blockquote>
<br>
<br>
*illumos-developer* | Archives<br>
<<a href="https://www.listbox.com/member/archive/182179/=now" target="_blank">https://www.listbox.com/<u></u>member/archive/182179/=now</a>><br>
<<a href="https://www.listbox.com/member/archive/rss/182179/21175123-d0c8da4c" target="_blank">https://www.listbox.com/<u></u>member/archive/rss/182179/<u></u>21175123-d0c8da4c</a>> |<br>
Modify<br>
<<a href="https://www.listbox.com/member/?member_id=21175123&id_secret=21175123-d92578cc" target="_blank">https://www.listbox.com/<u></u>member/?member_id=21175123&id_<u></u>secret=21175123-d92578cc</a>><br>
Your Subscription       [Powered by Listbox] <<a href="http://www.listbox.com" target="_blank">http://www.listbox.com</a>><br>
<br>
</blockquote><div class=""><div class="h5">
<br>
-- <br>
OSN Online Service Nuernberg GmbH, Bucher Str. 78, 90408 Nuernberg<br>
Tel: <a href="tel:%2B49%20911%2039905-0" value="+49911399050" target="_blank">+49 911 39905-0</a> - Fax: <a href="tel:%2B49%20911%2039905-55" value="+499113990555" target="_blank">+49 911 39905-55</a> - <a href="http://www.osn.de" target="_blank">http://www.osn.de</a><br>
HRB 15022 Nuernberg, USt-Id: DE189301263, GF: Joerg Goltermann<br>
</div></div></blockquote></div><br></div></div>