<div dir="ltr">We were primarily using the machines for serving iscsi to VMs, and we'd see bad cascading failures (iscsi lun timeouts would cause the watchdog to kick in on the linux hosts, resetting the initiator, meanwhile the VM would decide that the virtio devices in the VM were dead, requiring a client reboot). In some cases, the problems would happen across all luns, in others it would be just particular luns. I assume this followed the severity of the situation with the failing drive (or number of failing drives before got aggressive about replacement). Similarly, we'd see a range of behaviors with local pool commands, ranging from everything looking alright to zpool commands hanging or running *extremely* slowly.<div><br></div><div>I'd hacked up some quick scripts to correlate info from the different sources. They are here:</div><div><a href="https://github.com/narayandesai/diy-lsi">https://github.com/narayandesai/diy-lsi</a></div><div>They may or may not be portable, but demonstrate all of the info gathering methods we found useful. Another thing that was useful was maintaining a pool inventory (stored somewhere else) with device addresses, serial numbers, and jbod bay mappings. Having to map that you when things are falling apart is seriously sad times.</div><div><br></div><div>fwiw, you might still be ok with seagate drives; we were only using the self-check predictive failure flag, as opposed to anything more complicated. </div><div>good luck</div><div> -nld</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 31, 2015 at 5:08 AM, Matej Zerovnik <span dir="ltr"><<a href="mailto:matej@zunaj.si" target="_blank">matej@zunaj.si</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br> On <a href="tel:27.%2003.%202015%2016" value="+12703201516" target="_blank">27. 03. 2015 16</a>:13, Narayan Desai wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Having been on the receiving end of similar advice, it is a frustrating situation to be in, since you have (and will likely continue to have) the hardware in production, without much option for replacement.<br> <br> When we had systems like this, we had a lot of success being aggressive in swapping out disks that were showing signs of going bad, even before critical failures occurred. Also looking at SMART statistics, and aggressively replacing those as well. This made the situation manageable. Basically, having sata drives in sas expanders means the system is brittle, and you should treat it as such. Look for:<br> - errors in iostat -En<br> - high service times in iostat -xnz<br> - smartctl (this causes harmless sense messages when devices are probed, but it is easy enough to ignore these)<br> - any errors reported out of lsiutil, showing either problems with cabling/enclosures, or devices<br> - decode any sense errors reported by the lsi driver<br> <br> Aggressively replace devices implicated by these, and hope for the best. The best may or may not be what you're hoping for, but may be livable; it was for us.<br> <br> </blockquote></span> When errors happened to you, were you able to use the pool itself and only iscsi target froze or did you have troubles with the pool itself as well...<br> <br> Because on our end, when iscsi target freezes, zpool is perfectly ok. We can access it and use it locally, but iscsi target is frozen and can't be restarted.<br> <br> I will check my sistem with iostat and smartctl, but we are using seagate drives, so some of the smartctl stats are useless on 1st sight:)<span class="HOEnZb"><font color="#888888"><br> <br> Matej<br> </font></span></blockquote></div><br></div>