[OmniOS-discuss] r151012 nlockmgr fails to start

Jacob Vosmaer contact at jacobvosmaer.nl
Mon Oct 20 22:53:11 UTC 2014


I appear to be having the same problem on a HP MicroServer N54L. Messages:

/sbin/dhcpagent[95]: [ID 778557 daemon.warning] configure_v4_lease: no IP
broadcast specified for bge0, making best guess
rootnex: [ID 349649 kern.info] iscsi0 at root
genunix: [ID 936769 kern.info] iscsi0 is /iscsi
pseudo: [ID 129642 kern.info] pseudo-device: dtrace0
genunix: [ID 936769 kern.info] dtrace0 is /pseudo/dtrace at 0
klmmod: [ID 814159 kern.notice] NOTICE: Failed to connect to local statd
(rpcerr=5)
/usr/lib/nfs/lockd[473]: [ID 491006 daemon.error] Cannot establish NLM
service over <file desc. 9, protocol udp> : I/O error. Exiting
svc.startd[10]: [ID 652011 daemon.warning]
svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed
with exit status 1.
klmmod: [ID 814159 kern.notice] NOTICE: Failed to connect to local statd
(rpcerr=5)
/usr/lib/nfs/lockd[534]: [ID 491006 daemon.error] Cannot establish NLM
service over <file desc. 9, protocol udp> : I/O error. Exiting
svc.startd[10]: [ID 652011 daemon.warning]
svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed
with exit status 1.
pseudo: [ID 129642 kern.info] pseudo-device: pool0
genunix: [ID 936769 kern.info] pool0 is /pseudo/pool at 0
klmmod: [ID 814159 kern.notice] NOTICE: Failed to connect to local statd
(rpcerr=5)
/usr/lib/nfs/lockd[537]: [ID 491006 daemon.error] Cannot establish NLM
service over <file desc. 9, protocol udp> : I/O error. Exiting
svc.startd[10]: [ID 652011 daemon.warning]
svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed
with exit status 1.
svc.startd[10]: [ID 748625 daemon.error] network/nfs/nlockmgr:default
failed: transitioned to maintenance (see 'svcs -xv' for details)

Rebooting sometimes makes the nlockmgr problem go away, but sometimes it
does not. If there is anything I can do diagnostics-wise let me know. I
have no clue.

2014-10-10 21:09 GMT+02:00 Schweiss, Chip <chip at innovates.com>:

> Apparently something common in my OmniOS setup is triggering this.   I
> have no idea what yet, and I'm feeling green at digging through this
> issue.
>
> On one of my VMs for doing script development, I exported the data pool
> planing to test importing it with a different cache location and the
> problem immediately happened.   Now I cannot get nlockmgr to start at all
> on this VM.   I tried disabling all nfs services and re-enabling.  Still
> failing with /usr/lib/nfs/lockd[862]: [ID 491006 daemon.error] Cannot
> establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting
>
> root at ZFSsendTest1:/root# svcs -a|grep nfs
> disabled       13:47:05 svc:/network/nfs/log:default
> disabled       13:47:11 svc:/network/nfs/rquota:default
> disabled       13:55:05 svc:/network/nfs/server:default
> disabled       13:55:32 svc:/network/nfs/nlockmgr:default
> disabled       13:55:32 svc:/network/nfs/mapid:default
> disabled       13:55:32 svc:/network/nfs/status:default
> disabled       13:55:32 svc:/network/nfs/client:default
> disabled       13:55:57 svc:/network/nfs/cbd:default
> root at ZFSsendTest1:/root# svcadm enable svc:/network/nfs/status:default
> svc:/network/nfs/cbd:default svc:/network/nfs/mapid:default
> svc:/network/nfs/server:default svc:/network/nfs/nlockmgr:default
> root at ZFSsendTest1:/root# svcs -a|grep nfs
> disabled       13:47:05 svc:/network/nfs/log:default
> disabled       13:47:11 svc:/network/nfs/rquota:default
> disabled       13:55:32 svc:/network/nfs/client:default
> online         13:56:56 svc:/network/nfs/status:default
> online         13:56:56 svc:/network/nfs/cbd:default
> online         13:56:56 svc:/network/nfs/mapid:default
> offline        13:56:56 svc:/network/nfs/server:default
> offline*       13:56:56 svc:/network/nfs/nlockmgr:default
> root at ZFSsendTest1:/root# svcs -a|grep nfs
> disabled       13:47:05 svc:/network/nfs/log:default
> disabled       13:47:11 svc:/network/nfs/rquota:default
> disabled       13:55:32 svc:/network/nfs/client:default
> online         13:56:56 svc:/network/nfs/status:default
> online         13:56:56 svc:/network/nfs/cbd:default
> online         13:56:56 svc:/network/nfs/mapid:default
> offline        13:56:56 svc:/network/nfs/server:default
> maintenance    13:58:11 svc:/network/nfs/nlockmgr:default
>
> This VM has never had RSF-1 on it, so that definitely isn't the trigger.
> This VM has never exhibited this problem before today.  It has been
> rebooted many times.
>
> I wonder if the problem is triggered by exporting a pool with NFS exports
> that have active client connections.   That is always the case on my
> production systems.   This VM has one NFS client that was connected when I
> exported the pool.
>
> Now nlockmgr dies and goes to maintenance mode regardless if I import the
> data pool or not.
>
> Any advice on where to dig for better diagnosis of this would be
> helpful.   If any developers would like to get access to this VM I'd be
> happy to arrange that too.
>
> -Chip
>
>
> On Fri, Oct 10, 2014 at 9:26 AM, Richard Elling <
> richard.elling at richardelling.com> wrote:
>
>>
>> On Oct 10, 2014, at 6:15 AM, "Schweiss, Chip" <chip at innovates.com> wrote:
>>
>>
>> On Thu, Oct 9, 2014 at 9:54 PM, Dan McDonald <danmcd at omniti.com> wrote:
>>
>>>
>>> On Oct 9, 2014, at 10:23 PM, Schweiss, Chip <chip at innovates.com> wrote:
>>>
>>> > Just tried my 2nd system.   r151010 nlockmgr starts after clearing
>>> maintenance mode.   r151012 it will not start at all.  nfs/status was
>>> enabled and online.
>>> >
>>> > The commonality I see on the two systems I have tried is they are both
>>> part of an HA cluster.   So they don't import the pool at boot, but RSF-1
>>> imports it with cache mapped to a different location.
>>>
>>> That could be something HA Inc. needs to further test.  We don't
>>> directly support RSF-1, after all.
>>>
>>>
>> I there really isn't anything different than an auto imported pool.  I'm
>> suspecting using an alternate cache location my be triggering something
>> else to go wrong in the nlockmgr.
>>
>>
>> no, these are totally separate subsystems. RSF-1 imports the pool. NFS
>> sharing is started by the zpool command, in userland, via sharing after the
>> dataset is mounted. You can do the same procedure manually... no magic
>> pixie dust needed.
>>
>>
>> Here's the command RSF-1 uses to import the pool:
>> zpool import -c /opt/HAC/RSF-1/etc/volume-cache/nrgpool.cache -o
>> cachefile=/opt/HAC/RSF-1/etc/v
>> olume-cache/nrgpool.cache-live -o failmode=panic  nrgpool
>>
>> After the pool import it  puts the ip addresses back and is done.   That
>> happens in less than 1 second.
>>
>> In the mean time NFS services auto start and nlockmgr starts spinning.
>>
>>
>> Perhaps share doesn't properly start all of the services? Does it work ok
>> if you manually "svcadm enable" all of the NFS services?
>>
>>
>>   -- richard
>>
>>
>>
>>
>>> > nlockmgr is becoming a real show stopper.
>>>
>>> svcadm disable nlockmgr nfs/status
>>> svcadm enable nfs/status
>>> svcadm enable nlockmgr
>>>
>>> You may wish to discuss this on illumos as well, I'm not sure who all
>>> else is seeing this save me one time, and you seemingly a lot of times.
>>>
>>
>> I did that this time, no joy.   Today I'm working on a virtual setup with
>> HA to see if I can get this reproduced on r151012.
>>
>> I thought this nlockmgr propblem was related to lots of nfs exports
>> until, I ran into this on my SSD pool.  It used to be able to fail over in
>> about 3-5 seconds.   It takes nlockmgr now sits in a spinning state for a
>> few minutes and fails every time.   A clear of the maintenance mode, brings
>> it back nearly instantly.   This is on r151010.  On r151012 it fails every
>> time.
>>
>> Hopefully I can reproduce and I'll start a new thread copying Illumos too.
>>
>> -Chip
>>
>>
>> _______________________________________________
>> OmniOS-discuss mailing list
>> OmniOS-discuss at lists.omniti.com
>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>>
>>
>
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20141021/13e4d9a5/attachment.html>


More information about the OmniOS-discuss mailing list