[OmniOS-discuss] ZFS data corruption

Wed Aug 19 16:49:05 UTC 2015

Hi Joerg,

Am 19.08.15 um 14:59 schrieb Joerg Goltermann:
> Hi,
>
> the PSOD you got can cause the problems on your exchange database.
>
> Can you check the ESXi logs for the root cause of the PSOD?
>
> I never got a PSOD on such a "corruption". I still think this is
> a "cosmetic" bug, but this should be verified by one of the ZFS
> developers ...
>
>  - Joerg
>
> On 17.08.2015 15:48, wuffers wrote:
>>
>> On Mon, Aug 17, 2015 at 8:04 AM, Joerg Goltermann <jg at osn.de
>> <mailto:jg at osn.de>> wrote:
>>
>>     Hi,
>>
>>     we have the same problems. First time it occurs about 6 month
>>     ago, I wrote several mails on the zfs list but I was not able
>>     to solve the problem.
>>
>>     The last mail was 
>> http://permalink.gmane.org/gmane.os.illumos.zfs/4883
>>     I tried to debug the issue, but my zfs knowledge is not deep enough.
>>
>>     Hopefully we can solve this nasty thing now ....
>>
>>
>>     In my case I am quite sure this is not a real corruption, it's a 
>> retry
>>     with "strange" flags which caused my "errors". Maybe this IO is very
>>     slow, which can cause problems on the hosts, but i have never seen
>>     any real problems....
>>
>>
>> One of the VMs on that datastore was Exchange, and it definitely had
>> issues. I had to evacuate and move several mailboxes to another
>> database, and repair some of them (users were reporting strange issues
>> like not being able to move emails to existing folders).
>>
>> I don't think it's a coincidence that a VM that was on that block device
>> suddenly had weird issues (and the Exchange VM was consuming the largest
>> amount of space in that datastore).
>>
>>
>>     On 16.08.2015 19:11, Stephan Budach wrote:
>>
>>         So, did your first scrub reveal any error at all? Mine didn't 
>> and I
>>         suspect, that you issued a zpool clear prior to scrubbing, which
>>         made
>>         the errors go away on both of my two zpools…
>>
>>         I'd say, that you had excatly the same error as me.
>>
>>
>> I am 100% certain I did not issue a zpool clear. I ran the scrub only
>> once (as it takes ~8 days for it to go through in my case).
>>    pool: tank
>>   state: ONLINE
>>    scan: scrub repaired 0 in 184h28m with 0 errors on Wed Aug  5
>> 06:38:32 2015
>>
>

I don't think, that this is entirely true, though. Just today, I got 
another if these "bogus" ZFS errors. Just as the other two, this one got 
removed by a zpool clean/zpool scrub. However, this error occurred on 
the zvol which hosts one of the RAC CSS votings, so nothing was 
upstreamed to my consumer DGs/VGs.

The first two occurences were noticed by ASM, but as I am running 
mirrored disk groups, this error actually didn't punch through to my 
consumers. I guess, if I hadn't those mirrored DGs, the read error 
(since this had been reported on the RAC nodes) might actually very well 
had affected my VMs running of that NFS store.

Looking at what we've got here, I don't think, that we're  actually 
dealing with real disk errors, as those should have been reported as 
read errors, or mayby as checksum errors otherwise. This must be 
something else, as it only seems to affect zvols and iSCSI targets. 
Maybe I willl create a LUN using a file and hook that up to the same RAC 
cluster, if I don't get any of these errors with that, it has to be 
something in accessing the zvol. Maybe it's all COMSTAR's fault entirely…

Cheers,
Stephan