<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Am 22.08.15 um 19:02 schrieb Doug
Hughes:<br>
</div>
<blockquote cite="mid:55D8AB14.3010705@will.to" type="cite">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I've been experiencing spontaneous checksum failure/corruption on
read at the zvol level recently on a box running r12 as well. None
of the disks show any errors. All of the errors show up at the
zvol level until all the disks in the vol get marked as degraded
and then a reboot clears it up. repeated scrubs find files to
delete, but then after additional heavy read I/O activity, more
checksum on read errors occur, and more files need to be removed.
So far on r14 I haven't seen this, but I'm keeping an eye on it.<br>
<br>
The write activity on this server is very low. I'm currently
trying to evacuate it with zfs send | mbuffer to another host over
10g, so the read activity is very high and consistent over a long
period of time since I have to move about 10TB.<br>
<br>
</blockquote>
This morning, I received another of these zvol errors, which was
also reported up to my RAC cluster. I haven't fully checked that
yet, but I think the ASM/ADVM simply issued a re-read and was happy
with the result. Otherwise ASM would have issued a read against the
mirror side and probably have taken the "faulty" failure group
offline, which it didn't.<br>
<br>
However, I was wondering how to get some more information from the
STMF framework and found a post, how to read from the STMF trace
buffer…<br>
<br>
<tt>root@nfsvmpool07:/root# echo '*stmf_trace_buf/s' | mdb -k |
more</tt><tt><br>
</tt><tt>0xffffff090f828000: :0002579: Imported the LU
600144f090860e6b000055</tt><tt><br>
</tt><tt>0c3a290001</tt><tt><br>
</tt><tt>:0002580: Imported the LU 600144f090860e6b0000550c3e240002</tt><tt><br>
</tt><tt>:0002581: Imported the LU 600144f090860e6b0000550c3e270003</tt><tt><br>
</tt><tt>:0002603: Imported the LU 600144f090860e6b000055925a120001</tt><tt><br>
</tt><tt>:0002604: Imported the LU 600144f090860e6b000055a50ebf0002</tt><tt><br>
</tt><tt>:0002604: Imported the LU 600144f090860e6b000055a8f7d70003</tt><tt><br>
</tt><tt>:0002605: Imported the LU 600144f090860e6b000055a8f7e30004</tt><tt><br>
</tt><tt>:150815416: UIO_READ failed, ret = 5, resid = 131072</tt><tt><br>
</tt><tt>:224314824: UIO_READ failed, ret = 5, resid = 131072</tt><tt><br>
</tt><br>
So, this basically shows two read errors, which is consistent with
the incidents I had on this system. Unfortuanetly, this doesn't buy
me much more, since I don't know how to track that further down, but
it seems that COMSTAR had issues reading from the zvol.<br>
<br>
Is it possible to debug this further?<br>
<br>
<blockquote cite="mid:55D8AB14.3010705@will.to" type="cite"> <br>
<div class="moz-cite-prefix">On 8/21/2015 2:06 AM, wuffers wrote:<br>
</div>
<blockquote
cite="mid:CA+tR_Kyic7J3HL00SV3e_NvbC3viMr8soWQ5QTrth9JJeGiA-g@mail.gmail.com"
type="cite">
<div dir="ltr">Oh, the PSOD is not caused by the corruption in
ZFS - I suspect it was the other way around (VMware host PSOD
-> ZFS corruption). I've experienced the PSOD before, it
may be related to IO issues which I outlined in another post
here:
<div><a moz-do-not-send="true"
href="http://lists.omniti.com/pipermail/omnios-discuss/2015-June/005222.html">http://lists.omniti.com/pipermail/omnios-discuss/2015-June/005222.html</a></div>
<div><br>
</div>
<div>Nobody chimed in, but it's an ongoing issue. I need to
dedicate more time to troubleshoot but other projects are
taking my attention right now (coupled with a personal house
move time is at a premium!).<br>
<div><br>
</div>
<div>Also, I've had many improper shutdowns of the hosts and
VMs, and this was the first time I've seen a ZFS
corruption. </div>
<div><br>
</div>
<div>I know I'm repeating myself, but my question is still:</div>
<div>- Can I safely use this block device again now that it
reports no errors? Again, I've moved all data off of it..
and there are no other signs of hardware issues. Recreate
it? <br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Aug 19, 2015 at 12:49
PM, Stephan Budach <span dir="ltr"><<a
moz-do-not-send="true"
class="moz-txt-link-abbreviated"
href="mailto:stephan.budach@jvm.de">stephan.budach@jvm.de</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hi
Joerg,<br>
<br>
Am 19.08.15 um 14:59 schrieb Joerg Goltermann:
<div>
<div><br>
<blockquote class="gmail_quote" style="margin:0
0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex"> Hi,<br>
<br>
the PSOD you got can cause the problems on
your exchange database.<br>
<br>
Can you check the ESXi logs for the root cause
of the PSOD?<br>
<br>
I never got a PSOD on such a "corruption". I
still think this is<br>
a "cosmetic" bug, but this should be verified
by one of the ZFS<br>
developers ...<br>
<br>
- Joerg</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</blockquote>
<br>
<br>
</body>
</html>