[OmniOS-discuss] iscsi timeouts

Tue Jan 21 22:18:51 UTC 2014

On 1/21/14, 10:16 PM, Saso Kiselkov wrote:
> On 1/21/14, 10:09 PM, Saso Kiselkov wrote:
>> On 1/21/14, 10:01 PM, Tobias Oetiker wrote:
>>> Hi Nld,
>>>
>>> Today Narayan Desai wrote:
>>>
>>>> Sorry, I should have given the requisite "yes, I know that this is a recipe
>>>> for sadness, for I too have experienced said sadness".
>>>>
>>>> That said, we've seen this kind of problem when there was a device in a
>>>> vdev that was dying a slow death. There wouldn't necessarily be any sign,
>>>> aside from insanely high service times on an individual device in the pool.
>>>> From this, I assume that ZFS is still sensitive to variation in underlying
>>>> drive performance.
>>>>
>>>> Tobi, what do your drive service times look like?
>>>>  -nld
>>>
>>> the drives seem fine, smart is not reporting anything out of the
>>> ordinary and also iostat -En shows 0 on all counts
>>>
>>> I don't think it is a disk issue, but rather something connected
>>> with the network ...
>>>
>>> On times the machine becomes unreachable for some time, and then it
>>> is possible to login via console and all seems well internally.
>>> setting the network interface offline and then online again using
>>> the dladm tool brings the connectivity back immediatly. waiting
>>> helps as well ... since the problem sorts itself out after a few
>>> seconds to minutes ...
>>>
>>> we just had another 'off the net' periode for 30 minutes
>>>
>>> unfortunately omnios itself does not seem to realize that something
>>> is off, at least dmesg does not show any kernel messages about this
>>> problem ...
>>>
>>> we have several systems running on the S2600CP MB ... this is the
>>> only one showing problems ...
>>>
>>> the next thing I intend todo is to upgrade the MB firmware since I
>>> found that this box has an older version than the other ones ...
>>>
>>> System Configuration: Intel Corporation S2600CP
>>> BIOS Configuration: Intel Corp. SE5C600.86B.01.06.0002.110120121539 11/01/2012
>>>
>>> other ideas, most welcome !
>>
>> You mentioned a couple of e-mails back that you're using Intel I350s.
>> Can you verify that your kernel has:
>>
>> commit 43ae55058ad99c869a9ae39d039490e8a3680520
>> Author: Dan McDonald <danmcd at nexenta.com>
>> Date:   Thu Feb 7 19:27:18 2013 -0500
>>
>>     3534 Disable EEE support in igb for I350
>>     Reviewed by: Robert Mustacchi <rm at joyent.com>
>>     Reviewed by: Jason King <jason.brian.king at gmail.com>
>>     Reviewed by: Marcel Telka <marcel at telka.sk>
>>     Reviewed by: Sebastien Roy <sebastien.roy at delphix.com>
>>     Approved by: Richard Lowe <richlowe at richlowe.net>
>>
>> I guess you can check for this string at runtime:
>> $ strings /kernel/drv/amd64/igb | grep _eee_support
>>
>> If it is missing, then it could be the buggy EEE support that's throwing
>> your link out of whack here.
> 
> Nevermind, missed your description of the KVM guests being reachable
> while only the host goes offline... Did snoop show anything arriving at
> the host while it is offline?

However, on second thought, you did mention that you're running
crossover between two hosts, which would match the description of the
EEE issue:

https://illumos.org/issues/3534
"The energy efficient Ethernet (EEE) support in Intel's I350 GigE NIC
drops link on directly-attached link cases."

Anyhow, make sure you're running the EEE fix.

-- 
Saso