[OmniOS-discuss] Slow NFS speeds at rsize > 128k

Stephan Budach stephan.budach at JVM.DE
Thu Jan 8 10:25:57 UTC 2015


Am 08.01.15 um 00:01 schrieb Richard Elling:
>
>> On Jan 7, 2015, at 1:21 PM, Stephan Budach <stephan.budach at jvm.de 
>> <mailto:stephan.budach at jvm.de>> wrote:
>>
>> Am 07.01.15 um 21:48 schrieb Richard Elling:
>>>
>>>> On Jan 7, 2015, at 12:11 PM, Stephan Budach <stephan.budach at jvm.de 
>>>> <mailto:stephan.budach at jvm.de>> wrote:
>>>>
>>>> Am 07.01.15 um 18:00 schrieb Richard Elling:
>>>>>
>>>>>> On Jan 7, 2015, at 2:28 AM, Stephan Budach <stephan.budach at JVM.DE 
>>>>>> <mailto:stephan.budach at JVM.DE>> wrote:
>>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> I am sharing my zfs via NFS to a couple of OVM nodes. I noticed 
>>>>>> really bad NFS read performance, when rsize goes beyond 128k, 
>>>>>> whereas the performance is just fine at 32k. The issue is, that 
>>>>>> the ovs-agent, which is performing the actual mount, doesn't 
>>>>>> accept or pass any NFS mount options to the NFS server.
>>>>>
>>>>> The other issue is that illumos/Solaris on x86 tuning of 
>>>>> server-side size settings does
>>>>> not work because the compiler optimizes away the tunables. There 
>>>>> is a trivial fix, but it
>>>>> requires a rebuild.
>>>>>
>>>>>> To give some numbers, a rsize of 1mb results in a read throughput 
>>>>>> of approx. 2Mb/s, whereas a rsize of 32k gives me 110Mb/s. 
>>>>>> Mounting a NFS export from a OEL 6u4 box has no issues with this, 
>>>>>> as the read speeds from this export are 108+MB/s regardles of the 
>>>>>> rsize of the NFS mount.
>>>>>
>>>>> Brendan wrote about a similar issue in the Dtrace book as a case 
>>>>> study. See chapter 5
>>>>> case study on ZFS 8KB mirror reads.
>>>>>
>>>>>>
>>>>>> The OmniOS box is currently connected to a 10GbE port at our core 
>>>>>> 6509, but the NFS client is connected through a 1GbE port only. 
>>>>>> MTU is at 1500 and can currently not be upped.
>>>>>> Anyone having a tip, why a rsize of 64k+ will result in such a 
>>>>>> performance drop?
>>>>>
>>>>> It is entirely due to optimizations for small I/O going way back 
>>>>> to the 1980s.
>>>>>  -- richard
>>>> But, doesn't that mean, that Oracle Solaris will have the same 
>>>> issue or has Oracle addressed that in recent Solaris versions? Not, 
>>>> that I am intending to switch over, but that would be something I'd 
>>>> like to give my SR engineer to chew on…
>>>
>>> Look for yourself :-)
>>> In "broken" systems, such as this Solaris 11.1 system:
>>> # echo nfs3_tsize::dis | mdb -k
>>> nfs3_tsize:                     pushq  %rbp
>>> nfs3_tsize+1:                   movq   %rsp,%rbp
>>> nfs3_tsize+4:                   subq   $0x8,%rsp
>>> nfs3_tsize+8:                   movq   %rdi,-0x8(%rbp)
>>> nfs3_tsize+0xc:                 movl   (%rdi),%eax
>>> nfs3_tsize+0xe:                 leal   -0x2(%rax),%ecx
>>> nfs3_tsize+0x11:               cmpl   $0x1,%ecx
>>> nfs3_tsize+0x14:               jbe    +0x12 <nfs3_tsize+0x28>
>>> nfs3_tsize+0x16:               cmpl   $0x5,%eax
>>> nfs3_tsize+0x19:               movl   $0x100000,%eax
>>> nfs3_tsize+0x1e:               movl   $0x8000,%ecx
>>> nfs3_tsize+0x23:               cmovl.ne %ecx,%eax
>>> nfs3_tsize+0x26:               jmp    +0x5 <nfs3_tsize+0x2d>
>>> nfs3_tsize+0x28:               movl   $0x100000,%eax
>>> nfs3_tsize+0x2d:               leave
>>> nfs3_tsize+0x2e:               ret
>>>
>>> at +0x19 you'll notice hardwired 1MB
>> Ouch! Is that from a NFS client or server?
>
> server
>
>> Or rather, I know that the NFS server negotiates the options with the 
>> client and if no options are passed from the client to the server, 
>> the server sets up the connection with it's defaults.
>
> the server and client negotiate, so both can have defaults
>
>> So, this S11.1 output - is that from the NFS server? If yes, it would 
>> mean that the NFS server would go with the 1mb rsize/wsize since the 
>> OracleVM Server has not provided any options to it.
>
> You are not mistaken. AFAIK, this has been broken in Solaris x86 for 
> more than 10 years.
> Fortunately, most people can adjust on the client side, unless you're 
> running ESX or something
> that is difficult to adjust... like you seem to be.
Yes, I am  - and my current workaround is to remount the NFS shares 
manually, prior to starting any guests that reside on those shares. This 
is so dumb from Oracle… I have raised an ER for that, since this is the 
only way to make sure, this scenario can reliably work in any NFS 
environment, but that's of course totally off-topic. ;)
>
>>>
>>> by contrast, on a proper system
>>> # echo nfs3_tsize::dis | mdb -k
>>> nfs3_tsize:               pushq  %rbp
>>> nfs3_tsize+1:               movq   %rsp,%rbp
>>> nfs3_tsize+4:               subq   $0x10,%rsp
>>> nfs3_tsize+8:               movq   %rdi,-0x8(%rbp)
>>> nfs3_tsize+0xc:               movl   (%rdi),%edx
>>> nfs3_tsize+0xe:               leal   -0x2(%rdx),%eax
>>> nfs3_tsize+0x11:               cmpl   $0x1,%eax
>>> nfs3_tsize+0x14:               jbe    +0x12 <nfs3_tsize+0x28>
>>> nfs3_tsize+0x16:
>>> movl -0x37f8ea60(%rip),%eax <nfs3_max_transfer_size_rdma>
>>> nfs3_tsize+0x1c:               cmpl   $0x5,%edx
>>> nfs3_tsize+0x1f:
>>> cmovl.ne -0x37f8ea72(%rip),%eax <nfs3_max_transfer_size_clts>
>>> nfs3_tsize+0x26:               leave
>>> nfs3_tsize+0x27:               ret
>>> nfs3_tsize+0x28:
>>> movl -0x37f8ea76(%rip),%eax <nfs3_max_transfer_size_cots>
>>> nfs3_tsize+0x2e:               leave
>>> nfs3_tsize+0x2f:               ret
>>>
>>> where you can actually tune it according to the Solaris Tunable 
>>> Parameters guide.
>>>
>>> NB, we fixed this years ago at Nexenta and I'm certain it has not 
>>> been upstreamed. There are
>>> a number of other related fixes, all of the same nature. If someone 
>>> is inclined to upstream
>>> contact me directly.
>>>
>>> Once, fixed, you'll be able to change the server's settings for 
>>> negotiating the rsize/wsize with
>>> the clients. Many NAS vendors use smaller limits, and IMHO it is a 
>>> good idea anyway. For
>>> example, see 
>>> http://blog.richardelling.com/2012/04/latency-and-io-size-cars-vs-trains.html
>>>  -- richard
>>>
>> I am mostly satisfied with a transfer size of 32k and as this NFS is 
>> used as storage repository for the vdisk images and approx 80 guests 
>> are accessing those, so the i/o is random anyway. So smaller I/Os are 
>> preferred anyway. However, the NFS export from the OEL box just 
>> doesn't have this massive performance hit, even with a rsize/wsize of 
>> 1mb.
>
> Yes, this is not the only issue you're facing. Even with modest 
> hardware and OOB settings, it is
> easy to soak 1GbE. For ZFS backends, we use 128k as the max 
> rsize/wsize, since that is a
> practical upper limit (even though you can have larger block sizes in 
> ZFS).
>
> Here are the OOB tcp parameters we use
> tcp   max_buf               rw   16777216     16777216 1048576      
> 8192-1073741824
> tcp   recv_buf              rw   1250000      1250000 1048576      
> 2048-16777216
> tcp   sack                  rw   active       -- active       
> never,passive,
>         active
> tcp   send_buf              rw   1250000      1250000 128000       
> 4096-16777216
>
> no real magic here, but if you measure your network closely and it 
> doesn't change much, then
> you can pre-set the values from your BDP.
>
> And, of course, following the USE methodology, check for errors... I 
> can't count the number of
> times bad transceivers, cabling, or switch settings tripped people up.
>  -- richard
>
Thanks for sharing your insights. Do you think, that the situation will 
be improved once we finish our network transistion from our (mostly) 
1GbE network infrastructure to Nexus gear running 10 GbE?


Thanks,
budy


-- 
Stephan Budach
Managing Director
Jung von Matt/it-services GmbH
Glashüttenstraße 79
20357 Hamburg


Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.budach at jvm.de
Internet: http://www.jvm.com

Geschäftsführer: Stephan Budach
AG HH HRB 98380

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150108/3d4ba211/attachment-0001.html>


More information about the OmniOS-discuss mailing list