<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Am 20.01.15 um 14:15 schrieb Stephan
Budach:<br>
</div>
<blockquote cite="mid:54BE54D6.509@jvm.de" type="cite">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<font face="Helvetica, Arial, sans-serif">Hi guys,<br>
<br>
we just experienced a lock-up on one of our OmniOS r006 boxes in
a way that we had to reset it to get it working again. This box
is running on a SuperMicro storage server and it had been
checked using smartctl by our check_mk client each 10 mins.<br>
<br>
Looking through the logs, I found these messages being
repeatedly written to them…<br>
<br>
Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.warning]
WARNING: /scsi_vhci/disk@g5000cca22bc46337 (sd12):<br>
Dec 20 03:18:17 nfsvmpool01 Error for Command: <undecoded
cmd 0x85> Error Level: Recovered<br>
Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice]
Requested Block: 0 Error Block: 0<br>
Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice]
Vendor: ATA Serial Number:
PK1361<br>
Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice]
Sense Key: Soft_Error<br>
Dec 20 03:18:17 nfsvmpool01 scsi: [ID 107833 kern.notice]
ASC: 0x0 (<vendor unique code 0x0>), ASCQ: 0x1d, FRU: 0x0<br>
Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.warning]
WARNING: /scsi_vhci/disk@g5000cca22bc4e51d (sd11):<br>
Dec 20 03:18:19 nfsvmpool01 Error for Command: <undecoded
cmd 0x85> Error Level: Recovered<br>
Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice]
Requested Block: 0 Error Block: 0<br>
Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice]
Vendor: ATA Serial Number:
PK1361<br>
Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice]
Sense Key: Soft_Error<br>
Dec 20 03:18:19 nfsvmpool01 scsi: [ID 107833 kern.notice]
ASC: 0x0 (<vendor unique code 0x0>), ASCQ: 0x1d, FRU: 0x0<br>
Dec 20 03:18:21 nfsvmpool01 scsi: [ID 107833 kern.warning]
WARNING: /scsi_vhci/disk@g5000cca22bc512c5 (sd3):<br>
Dec 20 03:18:21 nfsvmpool01 Error for Command: <undecoded
cmd 0x85> Error Level: Recovered<br>
<br>
Could it be, that the use of smartctl somehow caused that
lock-up?<br>
<br>
Thanks,<br>
budy</font></blockquote>
Seems that this was the real issue:<br>
<br>
=> this was smartctl: Jan 20 13:14:04 nfsvmpool01 scsi: [ID
107833 kern.notice] ASC: 0x3a (medium not present - tray
closed), ASCQ: 0x1, FRU: 0x0<br>
Jan 20 13:18:58 nfsvmpool01 scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,3c08@3/pci1000,3020@0 (mpt_sas1):<br>
Jan 20 13:18:58 nfsvmpool01 MPT Firmware Fault, code: 2651<br>
Jan 20 13:19:00 nfsvmpool01 scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,3c08@3/pci1000,3020@0 (mpt_sas1):<br>
Jan 20 13:19:00 nfsvmpool01 mpt1 Firmware version v15.0.0.0 (?)<br>
Jan 20 13:19:00 nfsvmpool01 scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,3c08@3/pci1000,3020@0 (mpt_sas1):<br>
Jan 20 13:19:00 nfsvmpool01 mpt1: IOC Operational.<br>
=> System reset: Jan 20 13:30:45 nfsvmpool01 genunix: [ID 540533
kern.notice] ^MSunOS Release 5.11 Version omnios-b281e50 64-bit<br>
Jan 20 13:30:45 nfsvmpool01 genunix: [ID 877030 kern.notice]
Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights
reserved.<br>
<br>
Tried a bit on googling about that fault and came up with this one
from the LSI SCS Engineering Release Notice:<br>
<br>
(SCGCQ00257616 - Port of SCGCQ00237417)<br>
HEADLINE: Controller may fault on bad response with incomplete write
data transfer<br>
<br>
DESC OF CHANGE: When completing a write IO with incomplete data
transfer with bad status, clean the IO from the transmit hardware to
prevent it from accessing an invalid memory address while attempting
to service the already-completed IO.<br>
<br>
TO REPRODUCE: Run heavy write IO against a very large topology of
SAS drives. Repeatedly cause multiple drives to send response frames
containing sense data for outstanding IOs before the initiator has
finished transferring the write data for the IOs<br>
<br>
ISSUE DESC: f a SAS drive sends a response frame with response or
sense data for a write command before the transfer length specified
in the last XferReady frame is satisfied, an 0xD04 or 0x2651 fault
may occur.<br>
<br>
The question is, why did the box lock up? It seems that only one of
the LSI HBAs was affected and my zpools are entirey spread across
two HBAs, except the cache logs:<br>
<tt><br>
</tt><tt>root@nfsvmpool01:/var/adm# zpool status sasTank</tt><tt><br>
</tt><tt> pool: sasTank</tt><tt><br>
</tt><tt> state: ONLINE</tt><tt><br>
</tt><tt> scan: scrub repaired 0 in 0h8m with 0 errors on Wed Dec
24 09:21:40 2014</tt><tt><br>
</tt><tt>config:</tt><tt><br>
</tt><tt><br>
</tt><tt> NAME STATE READ WRITE
CKSUM</tt><tt><br>
</tt><tt> sasTank ONLINE 0 0
0</tt><tt><br>
</tt><tt> mirror-0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c2t5000CCA04106EAA5d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c5t5000CCA04106EE41d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> mirror-1 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c3t5000CCA02A9BE9E1d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c6t5000CCA02ADEE805d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> mirror-2 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c4t5000CCA04106EF21d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c7t5000CCA04106C1F5d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> logs</tt><tt><br>
</tt><tt> c1t5001517803D653E2d0p1 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c1t5001517803D83760d0p1 ONLINE 0 0
0</tt><tt><br>
</tt><tt> cache</tt><tt><br>
</tt><tt> c1t50015179596C5A85d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt><br>
</tt><tt>errors: No known data errors</tt><tt><br>
</tt><tt><br>
</tt><tt>root@nfsvmpool01:/var/adm# zpool status sataTank</tt><tt><br>
</tt><tt> pool: sataTank</tt><tt><br>
</tt><tt> state: ONLINE</tt><tt><br>
</tt><tt> scan: scrub repaired 0 in 10h39m with 0 errors on Wed Dec
24 20:22:27 2014</tt><tt><br>
</tt><tt>config:</tt><tt><br>
</tt><tt><br>
</tt><tt> NAME STATE READ WRITE
CKSUM</tt><tt><br>
</tt><tt> sataTank ONLINE 0 0
0</tt><tt><br>
</tt><tt> mirror-0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c1t5000CCA22BC4E51Dd0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c1t5000CCA22BC512C5d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> mirror-1 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c1t5000CCA22BC51BADd0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c1t5000CCA22BC46337d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> mirror-2 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c1t5000CCA22BC51BB9d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c1t5000CCA23DED646Fd0 ONLINE 0 0
0</tt><tt><br>
</tt><tt> logs</tt><tt><br>
</tt><tt> c1t5001517803D653E2d0p2 ONLINE 0 0
0</tt><tt><br>
</tt><tt> c1t5001517803D83760d0p2 ONLINE 0 0
0</tt><tt><br>
</tt><tt> cache</tt><tt><br>
</tt><tt> c1t5001517803D00E64d0 ONLINE 0 0
0</tt><tt><br>
</tt><tt><br>
</tt><tt>errors: No known data errors</tt><br>
<br>
Cheers,<br>
budy<br>
<br>
<br>
</body>
</html>