[OmniOS-discuss] kernel panic

Wed Apr 16 16:39:21 UTC 2014

Any thoughts on this one?  I can provide some more info if that helps.
The system is all desktop-grade hardware, with a core-i3 540 CPU and
8gigs of (non-ecc) ram.  The pool in question is a 3-disk raidz built on
Toshiba DT01ACA3 3T SATA drives attached to the motherboard SATA ports.
 The pool was working fine for about 12 months prior to the panic.  The
pool originally had dedup running, but the stack trace from an isolated
panic about 2 months ago indicated dedup problems, so I turned it off.

In an attempt to eliminate hardware problems, I've tried the following:

- Ran memtest86+ for about 30 hours, no errors found
- Ran SMART long tests on all the drives, no errors
- Read the entire drive with 'dd' to /dev/null (all 3 drives), no errors
reported by dd or iostat
- Put the drives in another machine w/ an LSI SAS controller, same result.
- dd'ed the contents of the drives to 3 borrowed SAS drives, and
attmpted to import the pool from there, same results.

I found this page with steps that solved a similar problem for someone else:

http://sigtar.com/2009/10/19/opensolaris-zfs-recovery-after-kernel-panic/

Importing the pool read-only as suggested still results in a kernel
panic.  The 'zdb' command mentioned dumps core before completing:

# zpool import
   pool: data1
     id: 17144127232233481271
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        data1       ONLINE
          raidz1-0  ONLINE
            c2t3d0  ONLINE
            c2t2d0  ONLINE
            c2t4d0  ONLINE
# zdb -e -bcsvL data1

Traversing all blocks to verify checksums ...

assertion failed for thread 0xfffffd7fff162a40, thread-id 1: c <
SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT, file
../../../uts/common/fs/zfs/zio.c, line 226
Abort (core dumped)
#

# zpool import -F -f -o readonly=on -R /mnt data1
plankton console login:
panic[cpu1]/thread=ffffff000ef07c40: BAD TRAP: type=e (#pf Page fault)
rp=ffffff000ef07530 addr=278 occurred in module "unix" due to a NULL
pointer dereference

sched: #pf Page fault
Bad kernel fault at addr=0x278
pid=0, pc=0xfffffffffb85ed1b, sp=0xffffff000ef07628, eflags=0x10246
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4:
26f8<vmxe,xmme,fxsr,pge,mce,pae,pse,de>
cr2: 278cr3: bc00000cr8: c

        rdi:              278 rsi:                4 rdx: ffffff000ef07c40
        rcx:                0  r8: ffffff02d9168840  r9:                2
        rax:                0 rbx:              278 rbp: ffffff000ef07680
        r10: fffffffffb8540bc r11: ffffff02d91b7000 r12:                0
        r13:                1 r14:                4 r15:                0
        fsb:                0 gsb: ffffff02cbb4dac0  ds:               4b
         es:               4b  fs:                0  gs:              1c3
        trp:                e err:                2 rip: fffffffffb85ed1b
         cs:               30 rfl:            10246 rsp: ffffff000ef07628
         ss:               38

ffffff000ef07410 unix:die+df ()
ffffff000ef07520 unix:trap+db3 ()
ffffff000ef07530 unix:cmntrap+e6 ()
ffffff000ef07680 unix:mutex_enter+b ()
ffffff000ef076a0 zfs:zio_buf_alloc+25 ()
ffffff000ef076e0 zfs:arc_get_data_buf+1d0 ()
ffffff000ef07730 zfs:arc_buf_alloc+b5 ()
ffffff000ef07820 zfs:arc_read+42b ()
ffffff000ef07880 zfs:traverse_prefetch_metadata+9d ()
ffffff000ef07970 zfs:traverse_visitbp+38b ()
ffffff000ef07a00 zfs:traverse_dnode+8b ()
ffffff000ef07af0 zfs:traverse_visitbp+5fd ()
ffffff000ef07b90 zfs:traverse_prefetch_thread+79 ()
ffffff000ef07c20 genunix:taskq_d_thread+b7 ()
ffffff000ef07c30 unix:thread_start+8 ()

syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
 0:44 100% done
100% done: 146470 pages dumped, dump succeeded
rebooting...

Most other 'zdb' commands I've tried also dump core.

I really want to recover the data on this pool if at all possible.  I
can provide crash dumps if needed.  Barring recovery, I would at least
like to understand what went wrong so I can avoid doing it again in the
future.

Please, can anyone help?
Thanks - Kevin

On 04/07/2014 09:32 PM, Kevin Swab wrote:
> I've got OmniOS 151008j running on a home file server, and the other day
> it went into a reboot loop, displaying a kernel panic on the console
> just after the kernel banner was printed.
> 
> The panic message on screen showed some zfs function calls so following
> that lead, I booted off the install media, mounted my root pool and
> removed /etc/zpool.cache.  The system was able to boot after that but
> when I attempt to import the pool containing my data, it panics again.
> 
> FMD shows that a reboot occurred after a kernel panic, and says more
> info is available from fmdump.  Here's the stack trace from 'fmdump':
> 
> # fmdump -Vp -u 38f6aa49-6c97-4675-b526-e455b1ae215b
> TIME                           UUID
> SUNW-MSG-ID
> Apr 07 2014 21:03:45.097921000 38f6aa49-6c97-4675-b526-e455b1ae215b
> SUNOS-8000-KL
> 
>   TIME                 CLASS                                 ENA
>   Apr 07 21:03:45.0237 ireport.os.sunos.panic.dump_available
> 0x0000000000000000
>   Apr 07 21:03:03.8496 ireport.os.sunos.panic.dump_pending_on_device
> 0x0000000000000000
> 
> nvlist version: 0
>         version = 0x0
>         class = list.suspect
>         uuid = 38f6aa49-6c97-4675-b526-e455b1ae215b
>         code = SUNOS-8000-KL
>         diag-time = 1396926225 62791
>         de = fmd:///module/software-diagnosis
>         fault-list-sz = 0x1
>         fault-list = (array of embedded nvlists)
>         (start fault-list[0])
>         nvlist version: 0
>                 version = 0x0
>                 class = defect.sunos.kernel.panic
>                 certainty = 0x64
>                 asru =
> sw:///:path=/var/crash/unknown/.38f6aa49-6c97-4675-b526-e455b1ae215b
>                 resource =
> sw:///:path=/var/crash/unknown/.38f6aa49-6c97-4675-b526-e455b1ae215b
>                 savecore-succcess = 1
>                 dump-dir = /var/crash/unknown
>                 dump-files = vmdump.1
>                 os-instance-uuid = 38f6aa49-6c97-4675-b526-e455b1ae215b
>                 panicstr = BAD TRAP: type=e (#pf Page fault)
> rp=ffffff000fadafc0 addr=2b8 occurred in module "unix" due to a NULL
> pointer dereference
>                 panicstack = unix:die+df () | unix:trap+db3 () |
> unix:cmntrap+e6 () | unix:mutex_enter+b () | zfs:zio_buf_alloc+25 () |
> zfs:arc_get_data_buf+2b8 () | zfs:arc_buf_alloc+b5 () | zfs:arc_read+42b
> () | zfs:dsl_scan_prefetch+a7 () | zfs:dsl_scan_recurse+16f () |
> zfs:dsl_scan_visitbp+eb () | zfs:dsl_scan_visitdnode+bd () |
> zfs:dsl_scan_recurse+439 () | zfs:dsl_scan_visitbp+eb () |
> zfs:dsl_scan_visit_rootbp+61 () | zfs:dsl_scan_visit+26b () |
> zfs:dsl_scan_sync+12f () | zfs:spa_sync+334 () | zfs:txg_sync_thread+227
> () | unix:thread_start+8 () |
>                 crashtime = 1396801998
>                 panic-time = Sun Apr  6 10:33:18 2014 MDT
>         (end fault-list[0])
> 
>         fault-status = 0x1
>         severity = Major
>         __ttl = 0x1
>         __tod = 0x53436711 0x5d627e8
> 
> 
> 
> I'd really like to recover the data on that pool if possible, any
> suggestions on what I can try next?
> 
> Thanks,
> Kevin
> 
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
> 

-- 
-------------------------------------------------------------------
Kevin Swab                          UNIX Systems Administrator
ACNS                                Colorado State University
Phone: (970)491-6572                Email: Kevin.Swab at ColoState.EDU
GPG Fingerprint: 7026 3F66 A970 67BD 6F17  8EB8 8A7D 142F 2392 791C