[OmniOS-discuss] A problem and puzzle with disappearing ZFS snapshots

Chris Siebenmann cks at cs.toronto.edu
Fri Jan 6 22:22:57 UTC 2017


 We have an automated system for making regular (roughly hourly)
snapshots of some especially important filesystems where we want
fast restores. This has been running smoothly for some time and
without problems. However, starting this week we have twice gone
to do from-snapshot restores on one of the filesystems involved
and discovered that almost all of the snapshots are mysteriously
missing.

 By 'missing' I mean that they aren't present in either
<fs>/.zfs/snapshots or in 'zfs list -r -t all <fs>', which
as far as I know means they don't exist at all.

 By 'mysteriously' I mean that not only did the snapshot-making
process not report any errors to us, but 'zpool history' reports
that the snapshot commands happened and there were no matching
snapshot deletions. In addition this has only been happening on
one of the filesystems that gets snapshots; all of the other ones
(which are all in the same pool) have everything present.

 'zpool history' recent output for the filesystem is:

	2017-01-06.06:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-06
	2017-01-06.07:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-07
	2017-01-06.08:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-08
	2017-01-06.09:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-09
	2017-01-06.10:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-10
	2017-01-06.11:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-11
	2017-01-06.12:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-12
	2017-01-06.13:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-13
	2017-01-06.14:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-14
	2017-01-06.15:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-15
	2017-01-06.16:10:01 zfs snapshot fs0-admin-02/h/105 at Fri-16
	2017-01-06.16:45:55 zfs snapshot fs0-admin-02/h/105 at Fri-16

The actual snapshots present in the pool are:
	NAME                        USED  AVAIL  REFER  MOUNTPOINT
	fs0-admin-02/h/105 at Fri-15   604M      -   343G  -
	fs0-admin-02/h/105 at Fri-16  23.4M      -   343G  -

(The second @Fri-16 snapshot was made when we discovered that the first
one was missing.)

 As far as I can tell from 'zpool history', no errant broad 'zfs
destroy' operations have been done against the pool that might have
swept up these snapshots as a side effect. (I don't think thet's
even possible, but ...)

(Also, because of how our automation for this operates, I'm
confident that none of the @Fri-NN snapshots existed before
they were nominally created. If they had appeared in eg 'zfs
list' output, the automation would have deleted them before
trying to recreate them.)

 The fileserver in question has not suffered a power failure or
crash since before this started happening.

 Does anyone have any idea what could be happening here? For example,
is there some way where snapshots can be removed without that being
logged in 'zpool history'?

(I'm scrubbing the pool now, so far without errors.)

 Thanks in advance.

	- cks
PS: we're on OmniOS r151014, kernel rev omnios-f090f73. Yes, I know, it's
    old. We like stability whenever possible, and testing & mostly
    qualifying upgrades takes a lot of work. (We can't be sure an upgrade
    works until we're running it in production, either; we can't reproduce
    production loads and stresses in testing.)


More information about the OmniOS-discuss mailing list