[OmniOS-discuss] Slow NFS speeds at rsize > 128k

Wed Jan 7 23:01:28 UTC 2015

> On Jan 7, 2015, at 1:21 PM, Stephan Budach <stephan.budach at jvm.de> wrote:
> 
> Am 07.01.15 um 21:48 schrieb Richard Elling:
>> 
>>> On Jan 7, 2015, at 12:11 PM, Stephan Budach <stephan.budach at jvm.de <mailto:stephan.budach at jvm.de>> wrote:
>>> 
>>> Am 07.01.15 um 18:00 schrieb Richard Elling:
>>>> 
>>>>> On Jan 7, 2015, at 2:28 AM, Stephan Budach <stephan.budach at JVM.DE <mailto:stephan.budach at JVM.DE>> wrote:
>>>>> 
>>>>> Hello everyone,
>>>>> 
>>>>> I am sharing my zfs via NFS to a couple of OVM nodes. I noticed really bad NFS read performance, when rsize goes beyond 128k, whereas the performance is just fine at 32k. The issue is, that the ovs-agent, which is performing the actual mount, doesn't accept or pass any NFS mount options to the NFS server.
>>>> 
>>>> The other issue is that illumos/Solaris on x86 tuning of server-side size settings does
>>>> not work because the compiler optimizes away the tunables. There is a trivial fix, but it
>>>> requires a rebuild.
>>>> 
>>>>> To give some numbers, a rsize of 1mb results in a read throughput of approx. 2Mb/s, whereas a rsize of 32k gives me 110Mb/s. Mounting a NFS export from a OEL 6u4 box has no issues with this, as the read speeds from this export are 108+MB/s regardles of the rsize of the NFS mount.
>>>> 
>>>> Brendan wrote about a similar issue in the Dtrace book as a case study. See chapter 5
>>>> case study on ZFS 8KB mirror reads.
>>>> 
>>>>> 
>>>>> The OmniOS box is currently connected to a 10GbE port at our core 6509, but the NFS client is connected through a 1GbE port only. MTU is at 1500 and can currently not be upped.
>>>>> Anyone having a tip, why a rsize of 64k+ will result in such a performance drop?
>>>> 
>>>> It is entirely due to optimizations for small I/O going way back to the 1980s.
>>>>  -- richard
>>> But, doesn't that mean, that Oracle Solaris will have the same issue or has Oracle addressed that in recent Solaris versions? Not, that I am intending to switch over, but that would be something I'd like to give my SR engineer to chew on…
>> 
>> Look for yourself :-)
>> In "broken" systems, such as this Solaris 11.1 system:
>> # echo nfs3_tsize::dis | mdb -k
>> nfs3_tsize:                     pushq  %rbp
>> nfs3_tsize+1:                   movq   %rsp,%rbp
>> nfs3_tsize+4:                   subq   $0x8,%rsp
>> nfs3_tsize+8:                   movq   %rdi,-0x8(%rbp)
>> nfs3_tsize+0xc:                 movl   (%rdi),%eax
>> nfs3_tsize+0xe:                 leal   -0x2(%rax),%ecx
>> nfs3_tsize+0x11:                cmpl   $0x1,%ecx
>> nfs3_tsize+0x14:                jbe    +0x12    <nfs3_tsize+0x28>
>> nfs3_tsize+0x16:                cmpl   $0x5,%eax
>> nfs3_tsize+0x19:                movl   $0x100000,%eax
>> nfs3_tsize+0x1e:                movl   $0x8000,%ecx
>> nfs3_tsize+0x23:                cmovl.ne %ecx,%eax
>> nfs3_tsize+0x26:                jmp    +0x5     <nfs3_tsize+0x2d>
>> nfs3_tsize+0x28:                movl   $0x100000,%eax
>> nfs3_tsize+0x2d:                leave  
>> nfs3_tsize+0x2e:                ret    
>> 
>> at +0x19 you'll notice hardwired 1MB
> Ouch! Is that from a NFS client or server?

server

> Or rather, I know that the NFS server negotiates the options with the client and if no options are passed from the client to the server, the server sets up the connection with it's defaults.

the server and client negotiate, so both can have defaults

> So, this S11.1 output - is that from the NFS server? If yes, it would mean that the NFS server would go with the 1mb rsize/wsize since the OracleVM Server has not provided any options to it.

You are not mistaken. AFAIK, this has been broken in Solaris x86 for more than 10 years.
Fortunately, most people can adjust on the client side, unless you're running ESX or something
that is difficult to adjust... like you seem to be.

>> 
>> by contrast, on a proper system
>> # echo nfs3_tsize::dis | mdb -k
>> nfs3_tsize:                     pushq  %rbp
>> nfs3_tsize+1:                   movq   %rsp,%rbp
>> nfs3_tsize+4:                   subq   $0x10,%rsp
>> nfs3_tsize+8:                   movq   %rdi,-0x8(%rbp)
>> nfs3_tsize+0xc:                 movl   (%rdi),%edx
>> nfs3_tsize+0xe:                 leal   -0x2(%rdx),%eax
>> nfs3_tsize+0x11:                cmpl   $0x1,%eax
>> nfs3_tsize+0x14:                jbe    +0x12    <nfs3_tsize+0x28>
>> nfs3_tsize+0x16:                
>> movl   -0x37f8ea60(%rip),%eax   <nfs3_max_transfer_size_rdma>
>> nfs3_tsize+0x1c:                cmpl   $0x5,%edx
>> nfs3_tsize+0x1f:                
>> cmovl.ne -0x37f8ea72(%rip),%eax <nfs3_max_transfer_size_clts>
>> nfs3_tsize+0x26:                leave  
>> nfs3_tsize+0x27:                ret    
>> nfs3_tsize+0x28:                
>> movl   -0x37f8ea76(%rip),%eax   <nfs3_max_transfer_size_cots>
>> nfs3_tsize+0x2e:                leave  
>> nfs3_tsize+0x2f:                ret    
>> 
>> where you can actually tune it according to the Solaris Tunable Parameters guide.
>> 
>> NB, we fixed this years ago at Nexenta and I'm certain it has not been upstreamed. There are
>> a number of other related fixes, all of the same nature. If someone is inclined to upstream 
>> contact me directly.
>> 
>> Once, fixed, you'll be able to change the server's settings for negotiating the rsize/wsize with
>> the clients. Many NAS vendors use smaller limits, and IMHO it is a good idea anyway. For 
>> example, see http://blog.richardelling.com/2012/04/latency-and-io-size-cars-vs-trains.html <http://blog.richardelling.com/2012/04/latency-and-io-size-cars-vs-trains.html>
>>  -- richard
>> 
> I am mostly satisfied with a transfer size of 32k and as this NFS is used as storage repository for the vdisk images and approx 80 guests are accessing those, so the i/o is random anyway. So smaller I/Os are preferred anyway. However, the NFS export from the OEL box just doesn't have this massive performance hit, even with a rsize/wsize of 1mb.

Yes, this is not the only issue you're facing. Even with modest hardware and OOB settings, it is
easy to soak 1GbE. For ZFS backends, we use 128k as the max rsize/wsize, since that is a
practical upper limit (even though you can have larger block sizes in ZFS).

Here are the OOB tcp parameters we use
tcp   max_buf               rw   16777216     16777216     1048576      8192-1073741824
tcp   recv_buf              rw   1250000      1250000      1048576      2048-16777216
tcp   sack                  rw   active       --           active       never,passive,
                                                                        active
tcp   send_buf              rw   1250000      1250000      128000       4096-16777216

no real magic here, but if you measure your network closely and it doesn't change much, then
you can pre-set the values from your BDP.

And, of course, following the USE methodology, check for errors... I can't count the number of
times bad transceivers, cabling, or switch settings tripped people up.
 -- richard

>> 
>>> 
>>> In any way, the first bummer is, that Oracle chose to not have it's ovs-agent be capable of accepting and passing the NFS mount options…
>>> 
>>> Cheers,
>>> budy
>> 
> Thanks,
> budy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150107/1c07440c/attachment-0001.html>