[OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2

Wed Jan 18 16:16:25 UTC 2017

Am 18.01.17 um 09:01 schrieb Dale Ghent:
>> On Jan 18, 2017, at 2:38 AM, Stephan Budach <stephan.budach at jvm.de> wrote:
>>
>> Am 17.01.17 um 23:09 schrieb Dale Ghent:
>>>> On Jan 17, 2017, at 2:39 PM, Stephan Budach <stephan.budach at JVM.DE>
>>>>   wrote:
>>>>
>>>> Am 17.01.17 um 17:37 schrieb Dale Ghent:
>>>>
>>>>>> On Jan 17, 2017, at 11:31 AM, Stephan Budach <stephan.budach at JVM.DE>
>>>>>>
>>>>>>   wrote:
>>>>>>
>>>>>> Hi Dale,
>>>>>>
>>>>>> Am 17.01.17 um 17:22 schrieb Dale Ghent:
>>>>>>
>>>>>>
>>>>>>>> On Jan 17, 2017, at 11:12 AM, Stephan Budach <stephan.budach at JVM.DE>
>>>>>>>>
>>>>>>>>
>>>>>>>>   wrote:
>>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> I am sorry, but I do have to undig this old topic, since I do now have three hosts running omniOS 018/020, which show these pesky  issues with flapping their ixgbeN links on my Nexus FEXes…
>>>>>>>>
>>>>>>>> Does anyone know, if there has any change been made to the ixgbe drivers since 06/2016?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Since June 2016? Yes! A large update to the ixgbe driver happened in August. This added X550 support, and also brought the Intel Shared Code it uses from its 2012 vintage up to current. The updated driver is available in 014 and later.
>>>>>>>
>>>>>>> /dale
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> do you know of any option to get to know, why three of my boxes are flapping their 10GbE ports? It's actually not only when in aggr mode, but on single use as well. Last week I presumeably had one of my RSF-1 nodes panic, since it couldn't get to it's iSCSI LUNs anymore. The thing ist, that somewhere doen the line, the ixgbe driver seems to be fine, to configure one port to 1GbE instead of 10GbE, which will stop the flapping, but wich will break the VPC on my Nexus nevertheless.
>>>>>>
>>>>>> In syslog, this looks like this:
>>>>>>
>>>>>> ...
>>>>>> Jan 17 14:46:07 zfsha02gh79 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 1000 Mbps, full duplex
>>>>>> Jan 17 14:46:21 zfsha02gh79 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
>>>>>> Jan 17 14:46:22 zfsha02gh79 mac: [ID 435574 kern.info] NOTICE: ixgbe3 link up, 10000 Mbps, full duplex
>>>>>> Jan 17 14:46:22 zfsha02gh79 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
>>>>>> Jan 17 14:46:26 zfsha02gh79 mac: [ID 435574 kern.info] NOTICE: ixgbe3 link up, 10000 Mbps, full duplex
>>>>>> Jan 17 14:52:22 zfsha02gh79 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
>>>>>> Jan 17 14:52:22 zfsha02gh79 mac: [ID 435574 kern.info] NOTICE: ixgbe3 link up, 10000 Mbps, full duplex
>>>>>> Jan 17 14:52:22 zfsha02gh79 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
>>>>>> Jan 17 14:52:32 zfsha02gh79 mac: [ID 435574 kern.info] NOTICE: ixgbe3 link up, 10000 Mbps, full duplex
>>>>>> Jan 17 14:54:50 zfsha02gh79 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
>>>>>> Jan 17 14:54:55 zfsha02gh79 mac: [ID 435574 kern.info] NOTICE: ixgbe3 link up, 10000 Mbps, full duplex
>>>>>> Jan 17 14:58:12 zfsha02gh79 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
>>>>>> Jan 17 14:58:16 zfsha02gh79 mac: [ID 435574 kern.info] NOTICE: ixgbe3 link up, 10000 Mbps, full duplex
>>>>>> Jan 17 14:59:46 zfsha02gh79 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
>>>>>>
>>>>>> Note on 14:46:07, where the system settles on a 1GbE connection…
>>>>>>
>>>>>>
>>>>> Sounds like a cabling issue? Are the runs too long or are you not using CAT6a? It flapping at 10Gb and then settling at 1Gb would indicate a cabling issue to me. The driver will always try to link at the fastest speed that the local controller and the remote peer will negotiate at... it will not proactively downgrade the link speed. If that happens, it is because that is what the controller managed to negotiate with the remote peer at.
>>>>>
>>>>> Are you using jumbo frames or anything outside of a normal 1500mtu link?
>>>>>
>>>>> /dale
>>>>>
>>>>>
>>>> The cables are actually specifically purchased cat6 cables. They run about 2m, not more. It could be tna cables, but I am running a couple of those and afaik, I only get these issues on these three nodes. I can try some other cables, but I hoped to be able to get maybe some kind of debug messages from the driver.
>>>>
>>> The chip provides no reason for a LoS or downgrade of the link. For configuration issues it interrupts only on a few things. "LSC" (Link Status Change) interrupts one of these things and are what tells the driver to interrogate the chip for its current speed, but beyond that, the hardware provides no further details. Any details regarding why the PHY had to re-train the link are completely hidden to the driver.
>>>
>>> Are these X540 interfaces actually built into the motherboard, or are they separate PCIe cards? Also, CAT6 alone might not be enough, and even the magnetics on the older X540 might not even be able to eek out a 10Gb connection, even at 2m. I would remove all doubt of cabling being an issue by replacing them with CAT6a. Beware of cable vendors who sell CAT6 cables as "CAT6a". It could also be an issue with the modular jacks on the ends.
>>>
>>> Since you mentioned "after 6/2016" for the ixgbe driver, have you tried the newer one yet? Large portions of it were re-written and re-factored, and many bugs fixed including portions that touch the X540 due to the new X550 also being copper and the two models needing to share some logic related to that.
>>>
>>> /dale
>>>
>> Thanks for clarifying that. I just checked the cables and they classify as Cat6a and they are from a respectable german vendor, not that this would be any guarantee, but at least they're no bulkware from china. ;)
>>
>> The X540s are either onboard on some Supermicro X10 boards, but also on a genuine Intel PCI adaptor. I will check some other cables, maybe the ones I got were somewhat faulty. However, this leaves only a few options  to the user, finding out, what is actually wrong with the connection, isn't it?
>>
>> Regarding the release of omniOS, I will update my RSF-1 node to the latest r18, the other two new nodes are actually on r20 and thus should already have the new driver installed.
> I the ixgbe package installed on your systems has a time stamp after July 19 2016, it will have the updated code.
>
> Regarding the X540s which are integrated on some of your SMCI X10 boards, does a 10Gb link remain stable after you issue the following two commands in the shown order:
>
> dladm set-linkprop -p en_10gfdx_cap=0 ixgbeN
> dladm set-linkprop -p en_10gfdx_cap=1 ixgbeN
>
> Also, check flowctrl:
>
> dladm show-linkprop -p flowctrl
>
> For your ixgbe devices, this should be the default of "no"
>
> /dale
I tried all of that on each of the four interfaces, but it doesn't seem 
to help. I will get me some new cables - Dominik suggested a brand, that 
I also know, and I will have my current cables be tested against the 
Cat6a spec.

If that also should lead me nowhere, anyone can suggest a working 10GbE 
Dual-Port copper adaptor, that will work with omniOS, e.g. QLE3442-RJ-CK 
or some Broadcom stuff?

Thanks,
Stephan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20170118/41fa7e69/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5546 bytes
Desc: not available
URL: <https://omniosce.org/ml-archive/attachments/20170118/41fa7e69/attachment-0001.bin>