[OmniOS-discuss] kernel panic

Dan McDonald danmcd at omniti.com
Wed Apr 16 17:44:46 UTC 2014


On Apr 16, 2014, at 12:39 PM, Kevin Swab <Kevin.Swab at colostate.edu> wrote:
> <SNIP!>


> Traversing all blocks to verify checksums ...
> 
> assertion failed for thread 0xfffffd7fff162a40, thread-id 1: c <
> SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT, file
> ../../../uts/common/fs/zfs/zio.c, line 226
> Abort (core dumped)
> #
> 
> # zpool import -F -f -o readonly=on -R /mnt data1
> plankton console login:
> panic[cpu1]/thread=ffffff000ef07c40: BAD TRAP: type=e (#pf Page fault)
> rp=ffffff000ef07530 addr=278 occurred in module "unix" due to a NULL
> pointer dereference

Interesting.

We've seen one (just one) panic just like this in-house.  In our case, some very strange corruption was written to disk, and ZFS couldn't cope.  I have a request out to the ZFS community to improve the coping mechanisms.  :)

I've some dumb questions:

	1.) Earlier in the thread, you mention these are SATA drives.  When the panic occurred, were they attached via AHCI?  Or to a controller of some sort?  You mention you tried attaching these disks to an mpt_sas controller to try and recover them.  Our machine was using plain SATA drives attached via AHCI.

	2.) Is the kernel coredump available?  If this is what we were seeing, I'd VERY much like to see what your corruption actually looks like. Knowing might help us root-cause the corruption in the first place.

The corruption is of the blkptr_t, in particular its size, which ZFS now assumes is sane.  zdb indicates this via an assertion failure, a non-debug kernel will just panic when it goes dereferencing a pointer in hyperspace.  The coping mechanism involved would throw an IO error if an insane size is read off disk.

The biggest question, of course, is how the corruption was introduced.  THAT's why I want to see your coredump.  If your corruption is close to ours - ours has a disk name of all things scribbled there - we share a common source of corruption.

> I really want to recover the data on this pool if at all possible.  I
> can provide crash dumps if needed.  Barring recovery, I would at least
> like to understand what went wrong so I can avoid doing it again in the
> future.

If we can get a version of ZFS that can cope with corrupted blkptrs, that may help in recovery.

I know *IN THIS PARTICULAR CODEPATH* how to cope, but I'm concerned it would expose other errors, and even read-only, I don't want to perform such experiments on a customer's data.  :)

It does seem, however, that our box is in the same state, so I will try it there.  If I have success, I can share the modified "zfs" module.

Dan



More information about the OmniOS-discuss mailing list