[OmniOS-discuss] Testing RSF-1 with zpool/nfs HA

Thu Feb 18 08:29:46 UTC 2016

On 18/02/2016 06:13, Stephan Budach wrote:
> Hi,
>
> I have been test driving RSF-1 for the last week to accomplish the 
> following:
>
> - cluster a zpool, that is made up from 8 mirrored vdevs, which are 
> based on 8 x 2 SSD mirrors via iSCSI from another OmniOS box
> - export a nfs share from above zpool via a vip
> - have RSF-1 provide the fail-over and vip-moving
> - use the nfs share as a repository for my Oracle VM guests and vdisks
>
> The setup seems to work fine, but I do have one issue, I can't seem to 
> get solved. Whenever I failover the zpool, any inflight nfs data, will 
> be stalled for some unpredictable time. Sometimes it takes not much 
> longer than the "move" time of the resources but sometimes it takes up 
> to 5 mins. until the nfs client on my VM server becomes alive again.
>
> So, when I issue a simple ls -l on the folder of the vdisks, while the 
> switchover is happening, the command somtimes comcludes in 18 to 20 
> seconds, but sometime ls will just sit there for minutes.
>
> I wonder, if there's anything, I could do about that. I have already 
> played with several timeouts, nfs wise and tcp wise, but nothing seem 
> to yield any effect on this issue. Anyone, who knows some tricks to 
> speed up the inflight data?

I would capture a snoop trace on both sides of the cluster, and see 
what's happening. In this case, I would run snoop in non-promiscuous 
mode at least initially, to avoid picking up any frames which the IP 
stack is going to discard.

Can you look at the ARP cache on client during the stall?

BTW, if you have 2 clustered heads both relying on another single system 
providing the iSCSI, that's a strange setup which may be giving you less 
availability (and less performance) than serving NFS directly from the 
SSD system without clustering.

-- 
Andrew