[OmniOS-discuss] System hangs every few days

Fri Mar 7 20:28:11 UTC 2014

I was at about 6 months of uptime, then added some new SSD's for cache to
the motherboard SATA ports. They weren't hot-plug recognized, so I rebooted
over the weekend. Added the caches, all seemed good.

Five days later, the system was locked. No kernel panic, just a frozen
console and no network access. Not ping-able.

Looking through the logs, I saw mostly just the typical (and benign?)
netatalk messages:

------

mDNSResponder: [ID 702911 daemon.error] ERROR: getOptRdata - unknown opt 4

mDNSResponder: [ID 702911 daemon.error] Correcting TTL from 4500 to 3600
for  312 nexus

-------

Etc.

But also, something new right before the crash:

------

Mar  7 11:18:28 colossus mac: [ID 486395 kern.info] NOTICE: igb3 link down

Mar  7 11:18:28 colossus mac: [ID 486395 kern.info] NOTICE: igb4 link down

Mar  7 11:18:28 colossus mac: [ID 486395 kern.info] NOTICE: igb2 link down

Mar  7 11:18:28 colossus mac: [ID 486395 kern.info] NOTICE: igb5 link down

Mar  7 11:18:28 colossus mac: [ID 486395 kern.info] NOTICE: aggr1000 link
down

Mar  7 11:18:30 colossus mac: [ID 435574 kern.info] NOTICE: igb3 link up,
1000 Mbps, full duplex

Mar  7 11:18:30 colossus mac: [ID 435574 kern.info] NOTICE: aggr1000 link
up, 1000 Mbps, full duplex

Mar  7 11:18:30 colossus mac: [ID 435574 kern.info] NOTICE: igb2 link up,
1000 Mbps, full duplex

Mar  7 11:18:30 colossus mac: [ID 435574 kern.info] NOTICE: igb4 link up,
1000 Mbps, full duplex

Mar  7 11:18:30 colossus mac: [ID 435574 kern.info] NOTICE: igb5 link up,
1000 Mbps, full duplex

Mar  7 11:18:35 colossus mac: [ID 486395 kern.info] NOTICE: igb3 link down

Mar  7 11:18:35 colossus mac: [ID 486395 kern.info] NOTICE: igb4 link down

Mar  7 11:18:35 colossus mac: [ID 486395 kern.info] NOTICE: igb2 link down

Mar  7 11:18:36 colossus mac: [ID 486395 kern.info] NOTICE: igb5 link down

Mar  7 11:18:36 colossus mac: [ID 486395 kern.info] NOTICE: aggr1000 link
down

------

This goes on indefinitely, interfaces going down, coming up, over and over.
All of the igb interfaces listed here are part of an aggregate group
(although it's actually called aggr1, not aggr1000?). The other interfaces
(2 additional igb's and 4 ixgbe's) did not log error messages, but by this
point the server is unresponsive via ssh over the network and at the
console. Interesting however, is that established file-sharing connections
over the unaffected interfaces continue to function for quite a whole after
the lockup, all night in one case. This includes AFP, SMB, and iSCSI
(giving me enough time to shut down my virtual machines and log off some
key clients). In other words, the zpools are functional, and so are enough
services to keep that particular type of access alive. Establishing new
connections over those protocols after the incident doesn't appear to be
possible. A hard reboot is necessary to regain access to the console and
permit new connections.

My initial thought was that it could be an issue with the switch, but that
seems unlikely because I have other LACP groups that are unaffected. I'm
also thinking that it can't be a coincidence that this only started
happening right after that initial reboot?

Since that reboot, this crash has happened three times. The first, as I
noted, was five days after the reconfiguration, but now they seem to be
happening slightly more frequently, although they're always several days
apart.

I'm considering reverting to a base install and rebuilding the system
config this weekend, as it's very basic.. but still curious if anyone has
seen this type of behavior before.

Regards,

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20140307/c4157b5f/attachment.html>