[OmniOS-discuss] kernel panic

Wed Apr 16 19:32:23 UTC 2014

Hello Dan - Thanks for your help, I really appreciate it!  Answers to
your questions are inline below....

On 04/16/2014 11:44 AM, Dan McDonald wrote:
> 
> On Apr 16, 2014, at 12:39 PM, Kevin Swab <Kevin.Swab at colostate.edu> wrote:
>> <SNIP!>
> 
> 
>> Traversing all blocks to verify checksums ...
>>
>> assertion failed for thread 0xfffffd7fff162a40, thread-id 1: c <
>> SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT, file
>> ../../../uts/common/fs/zfs/zio.c, line 226
>> Abort (core dumped)
>> #
>>
>> # zpool import -F -f -o readonly=on -R /mnt data1
>> plankton console login:
>> panic[cpu1]/thread=ffffff000ef07c40: BAD TRAP: type=e (#pf Page fault)
>> rp=ffffff000ef07530 addr=278 occurred in module "unix" due to a NULL
>> pointer dereference
> 
> Interesting.
> 
> We've seen one (just one) panic just like this in-house.  In our case, some very strange corruption was written to disk, and ZFS couldn't cope.  I have a request out to the ZFS community to improve the coping mechanisms.  :)
> 
> I've some dumb questions:
> 
> 	1.) Earlier in the thread, you mention these are SATA drives.  When the panic occurred, were they attached via AHCI?  Or to a controller of some sort?  You mention you tried attaching these disks to an mpt_sas controller to try and recover them.  Our machine was using plain SATA drives attached via AHCI.

Yes, at the time of the initial panic, the drives were attached to
motherboard SATA ports that are conigured to run in AHCI mode.  At the
current time, they are in a test machine at work attached via mpt_sas.

> 
> 	2.) Is the kernel coredump available?  If this is what we were seeing, I'd VERY much like to see what your corruption actually looks like. Knowing might help us root-cause the corruption in the first place.

I believe the original crash dump files are available on my home
fileserver, I'll check tonight.  I can reproduce the crash at will in my
test system at work and have those crash dump files available now.
Which would you like to see?

> The corruption is of the blkptr_t, in particular its size, which ZFS now assumes is sane.  zdb indicates this via an assertion failure, a non-debug kernel will just panic when it goes dereferencing a pointer in hyperspace.  The coping mechanism involved would throw an IO error if an insane size is read off disk.
> 
> The biggest question, of course, is how the corruption was introduced.  THAT's why I want to see your coredump.  If your corruption is close to ours - ours has a disk name of all things scribbled there - we share a common source of corruption.
> 
>> I really want to recover the data on this pool if at all possible.  I
>> can provide crash dumps if needed.  Barring recovery, I would at least
>> like to understand what went wrong so I can avoid doing it again in the
>> future.
> 
> If we can get a version of ZFS that can cope with corrupted blkptrs, that may help in recovery.
> 
> I know *IN THIS PARTICULAR CODEPATH* how to cope, but I'm concerned it would expose other errors, and even read-only, I don't want to perform such experiments on a customer's data.  :)
> 

I appreciate your caution, but without a fix of some kind, my data's
gone anyway so I'm willing to experiment...

> It does seem, however, that our box is in the same state, so I will try it there.  If I have success, I can share the modified "zfs" module.
> 
> Dan
> 

Thanks!  that would be great.  Let me know what I can do to help...

-- 
-------------------------------------------------------------------
Kevin Swab                          UNIX Systems Administrator
ACNS                                Colorado State University
Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C