[OmniOS-discuss] OmniOS random freezes

Niels Goossens ngoossens at gmail.com
Mon Jan 6 22:50:18 UTC 2014


All,

Short introduction: I'm a non-professional user running a home lab
(actually in a Sun 900 rack, but that's a different story). One of my
machines is a Supermicro X9SCL-F motherboard with a Xeon 1230v2 running the
latest OmniOS release, r151008 - 6de5e81.

The reason I chose OmniOS is because it allows me to have one machine act
as FC target and KVM host simultaneously. In my home lab, where the
electricity bill is always an issue, this is a good thing.

I'm running Cloudstack in my home lab divided over three machines: two
Supermicro X9SCL-F boards with Xeon 1230v2 act as compute hosts (running a
basic Ubuntu 13.10 with KVM because Cloudstack needs libvirt for
controlling the hosts), and one machine acts as Cloudstack controller. This
is the OmniOS machine and it runs a few VMs (mysql, bind,
Cloudstack-management, alfresco, plex).

The machine itself has 32gb ram, 5x1tb drives on the motherboard sata
controller (in raidz1). Boot drive is a dedicated 320gb sata drive on the
same onboard sata controller. Two PCIe cards are connected: one QLogic 2460
FC card, and one IBM branded Intel PRO/1000 PT dual port.

My OmniOS host experiences random freezes. These appear out of nowhere. I
will list the steps I've taken to isolate the issue so far.

1. Is the system under load when the freeze occurs?
A. No. I've let the machine run idle the past few days, and it will freeze
anyway.

2. Is there anything in /var/adm/messages?
A. Not anymore. There used to be though: I've seen the following:

Jan  4 20:40:16 controller mac: [ID 469746 kern.info] NOTICE: e1000g2
registered
Jan  4 20:40:19 controller mac: [ID 435574 kern.info] NOTICE: e1000g2 link
up, 1000 Mbps, full duplex
Jan  4 20:49:30 controller mac: [ID 486395 kern.info] NOTICE: e1000g2 link
down
Jan  4 20:49:30 controller in.routed[1382]: [ID 238047 daemon.warning]
interface e1000g2 to 10.10.3.8 turned off

I've disabled e1000g2 (simply ifconfig e1000g2 down and the warnings have
disappeared - this leads me to suspect the Intel NIC though)

I've also seen lots of these entries in /var/adm/messages:

Jan  5 11:08:21 controller ahci: [ID 117845 kern.warning] WARNING: satapkt
0xffffff0935ec3900: cmd_reg = 0xb0 features_reg = 0x0 sec_count_msb = 0x0
lba_low_msb = 0x4f lba_mid_msb = 0x4f lba_high_msb = 0x0 sec_count_lsb =
0x0 lba_low_lsb = 0x0 lba_mid_lsb = 0x4f lba_high_lsb = 0xc2 device_reg =
0x0 addr_type = 0x4 cmd_flags = 0x12

There was an additional PCIe sata card in this machine, which I have
removed. The warnings went away.

3. Is the pool healthy?
A. The drives are consumer grade sata drives and about 3 years old. They
are not really used that much - they used to be in my Opensolaris based NAS
before I upgraded that to something bigger. Smartctl tells me SMART status
of all drives is OK. There are no other log entries that lead me to believe
a drive is bad. Zpool status is OK.

4. Is there anything in Supermicro IPMI?
A. The following, which has occurred only twice now:

2013/11/23 19:53:28 Correctable Memory ECC @ DIMM2A(CPU1) - Asserted
2013/11/23 19:53:29 Uncorrectable Memory ECC @ DIMM2A(CPU1) - Asserted
2013/11/23 22:28:56 Correctable Memory ECC @ DIMM2A(CPU1) - Asserted
2013/11/23 22:28:57 Uncorrectable Memory ECC @ DIMM2A(CPU1) - Asserted

Even though I'd rather not see this error, I'm not alarmed considering it
has not occurred since.
5. Are core dump or crash files available?
A. I've setup dumpadm and coreadm only today. There are no core files in /,
or crash files in /var/crash. There are no log entries in /var/log. There
is nothing in /var/adm/messages, the last entry there is hours before the
machine freezes.

Now I am not a full blown sysadmin but I do have some experience with
Solaris, so I know my way around a little bit. I start to feel stuck
however, I need help in further isolating this issue. Therefore any help is
seriously appreciated!

Thanks for any advice and kind regards,

Niels Goossens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20140106/c8880295/attachment.html>


More information about the OmniOS-discuss mailing list