[OmniOS-discuss] How do non-rpool ZFS filesystems get mounted?

Chris Siebenmann cks at cs.toronto.edu
Wed Mar 5 21:23:56 UTC 2014


 With the aid of DTrace (and Illumos source) I have traced down what is
going on and where the race is. The short version is that the 'zfs mount
-a' in /lib/svc/method/fs-local is racing with syseventd's ZFS module.
I have a dtrace capture (well, several of them) that shows this clearly:

	http://www.cs.toronto.edu/~cks/t/fs-local-mounttrace.txt

(produced by http://www.cs.toronto.edu/~cks/t/mounttrace.d which I
started at the top of /lib/svc/method/fs-local.)

 Looking at various things suggests that this may be happening partly
because these additional pools are on iSCSI disks and the iSCSI disks
seem to be taking a bit of time to show up (I've never fully understood
how iSCSI disks are probed by Illumos). This may make it spiritually
related to the bug that Bryan Horstmann-Allen mentioned in that both
result in delayed device appearances.

 The following is a longer explanation of the race and assumes you
have some familiarity with Illumos ZFS kernel internals.

- pools present in /etc/zfs/zpool.cache are loaded into the kernel
  very early in boot, but they are not initialized and activated.
  This is done in spa_config_load(), calling spa_add(), which sets
  them to spa->spa_state = POOL_STATE_UNINITIALIZED.

- inactive pools are activated through spa_activate(), which is
  called (among other times) whenever you open a pool. By a chain
  of calls this happens any time you make a ZFS IOCTL that involves
  a pool name.
	zfsdev_ioctl() -> pool_status_check() -> spa_open() -> etc.

- 'zfs mount -a' of course does ZFS IOCTLs that involve pools
  because it wants to get pool configurations to find out what
  datasets it might have to mount. As such, it activate all
  additional pools present in zpool.cache when it runs (assuming
  that their vdev configuration is good, of course).

- when a pool is activated this way in our environment, some sort of
  events are delivered to syseventd. I don't know enough about syseventd
  to say exactly what sort of event it is and it may well be iSCSI disk
  'device appeared' messages. I have a very verbose syseventd debugging
  dump but I don't know enough to see anything useful in it.

- when syseventd gets these events, its ZFS module decides that it
  too should mount (aka 'activate') all datasets for the newly-active
  pools.

 At this point a multithreaded syseventd and 'zfs mount -a' are
racing to see who can mount all of the pool datasets, creating two
failure modes for 'zfs mount -a'. The first failure mode is simply
that syseventd wins the race and fully mounts a filesystem before 'zfs
mount -a' looks at it, triggering a safety check of 'directory is not
empty'. The second failure mode is that syseventd and 'zfs mount -a'
both call mount() on the same filesystem at the same time and syseventd
is the one that succeeds. In this case mount() itself will return an
error and 'zfs mount -a' will report:

	cannot mount 'fs3-test-02': mountpoint or dataset is busy

	- cks


More information about the OmniOS-discuss mailing list