[OmniOS-discuss] Clues for tracking down a drastic ZFS fs space difference?

Fri May 15 15:51:11 UTC 2015

Several weeks ago I reported:
>  We have a filesystem/dataset with no snapshots, no subordinate
> filesystems, nothing complicated (and no compression), that has a
> drastic difference in space used between what df/zfs list/etc report
> at the ZFS level and what du reports at the filesystem level. [...]

(At the time ZFS reported 70.5 GB used and du reported 17 GB.)

 With the assistance of George Wilson of Delphix, we've now identified
what the cause of this was: nlockmgr was apparently holding references
to now-deleted files in the kernel, preventing them from being reclaimed
by ZFS. Because these references were held in the kernel in some way,
they weren't visible to tools like fuser. Restarting nlockmgr immediately
reclaimed the space and dropped usage to what it should be.

 Delphix made a fix to their version of the nlm code to avoid this but
has not yet pushed it upstream. The summary of the problem (from a
comment in the commit):

	A busy client will prevent the idle timeout from ever being
	reached but may have stale holds associated with it. If these
	stale holds are for vnodes which have been removed they will
	prevent the file system from being able to reclaim the file's
	space.

 George Wilson's initial reply to me on the illumos-zfs mailing list
is:
	http://permalink.gmane.org/gmane.os.illumos.zfs/4836

(and it includes a link to the Delphix commit.)

 Obviously this is only a concern for people doing NFS service on
OmniOS machines, but if this is your environment you may want to watch
for this issue and consider periodic precautionary nlockmgr restarts or
the like until the fix is pushed upstream and is incorporated into an
OmniOS update.

	- cks