<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Hello Josten,<div class=""><br class=""></div><div class=""><br class=""><div><blockquote type="cite" class=""><div class="">On 26 May 2015, at 22:18, Anon <<a href="mailto:anon@omniti.com" class="">anon@omniti.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Hi Matej,<br class=""><br class="">Do you have sar running on your system? I'd recommend maybe running it at a short interval so that you can get historical disk statistics. You can use this info to rule out if its the disks or not. You can also use iotop -P to get a real time view of %IO to see if it's the disks. You can also use zpool iostat -v 1.<br class=""></div></div></blockquote><div><br class=""></div><div>I didn’t have sar or iotop running, but I had 'iostat -xn' and 'zpool iostat -v 1' running when things stopped working, but there is nothing unusual in there. Write ops suddenly fall to 0 and that’s it. Reads are still happening and according to network traffic, there is outgoing traffic when I’m unable to write to the ZFS FS (even locally on the server). I created a simple text file, so next time system hangs, I will be able to check if system is readable (currently, I only have iscsi volumes, so I’m unable to check that locally on server).</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><br class="">Also, do you have baseline benchmark of performance and know if you're meeting/exceeding it? The baseline should be for random and sequential IO; you can use bonnie++ to get this information.<br class=""></div></div></blockquote><div><br class=""></div><div>I can, with 99,99% say, I’m exceeding performance of the pool itself. It’s a single raidz2 vdev with 50 hard drives and 70 connected clients. some are idling, but 10-20 clients are pushing data to server. I know zpool configuration is very bad, but that’s a legacy I can’t change easily. I’m already syncing data to another 7 vdev server, but since this server is so busy, transfers are happening VERY SLOW (read, zfs sync doing 10MB/s).</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><br class="">Are you able to share your ZFS configuration and iSCSI configuration?<br class=""></div></div></blockquote><div><br class=""></div>Sure! Here are zfs settings:</div><div><br class=""></div><div>zfs get all data:</div><div><div>NAME PROPERTY VALUE SOURCE</div><div>data type filesystem -</div><div>data creation Fri Oct 25 20:26 2013 -</div><div>data used 104T -</div><div>data available 61.6T -</div><div>data referenced 1.09M -</div><div>data compressratio 1.08x -</div><div>data mounted yes -</div><div>data quota none default</div><div>data reservation none default</div><div>data recordsize 128K default</div><div>data mountpoint /volumes/data received</div><div>data sharenfs off default</div><div>data checksum on default</div><div>data compression off received</div><div>data atime off local</div><div>data devices on default</div><div>data exec on default</div><div>data setuid on default</div><div>data readonly off local</div><div>data zoned off default</div><div>data snapdir hidden default</div><div>data aclmode discard default</div><div>data aclinherit restricted default</div><div>data canmount on default</div><div>data xattr on default</div><div>data copies 1 default</div><div>data version 5 -</div><div>data utf8only off -</div><div>data normalization none -</div><div>data casesensitivity sensitive -</div><div>data vscan off default</div><div>data nbmand off default</div><div>data sharesmb off default</div><div>data refquota none default</div><div>data refreservation none default</div><div>data primarycache all default</div><div>data secondarycache all default</div><div>data usedbysnapshots 0 -</div><div>data usedbydataset 1.09M -</div><div>data usedbychildren 104T -</div><div>data usedbyrefreservation 0 -</div><div>data logbias latency default</div><div>data dedup off local</div><div>data mlslabel none default</div><div>data sync standard default</div><div>data refcompressratio 1.00x -</div><div>data written 1.09M -</div><div>data logicalused 98.1T -</div><div>data logicalreferenced 398K -</div><div>data filesystem_limit none default</div><div>data snapshot_limit none default</div><div>data filesystem_count none default</div><div>data snapshot_count none default</div><div>data redundant_metadata all default</div><div>data nms:dedup-dirty on received</div><div>data nms:description datauporabnikov received</div><div><br class=""></div><div>I’m not sure what iSCSI configuration do you want/need? But as far as I figured out in the last 'freeze', iSCSI is not the problem, since I’m unable to write to ZFS volume even if I’m local on the server itself.</div><div><br class=""></div><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><br class="">For iSCSI, can you take a look at this: <a href="http://docs.oracle.com/cd/E23824_01/html/821-1459/fpjwy.html#fsume" class="">http://docs.oracle.com/cd/E23824_01/html/821-1459/fpjwy.html#fsume</a><br class=""></div></div></blockquote><div><br class=""></div>Interesting. I tried running 'iscsiadm list target' but it doesn’t return anything. There is also nothing in /var/adm/messages as usual:) But target service is online (according to svcs), clients are connected and having traffic.</div><div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><br class="">Do you have detailed logs for the clients experiencing the issues? If not are you able to enable verbose logging (such as debug level logs)?<br class=""></div></div></blockquote><div><br class=""></div><div>I have clients logs, but they mostly just report loosing connections and reconnecting:</div><div><br class=""></div><div>Example 1:</div><div>Apr 29 10:33:53 eee kernel: connection1:0: detected conn error (1021)<br class="">Apr 29 10:33:54 eee iscsid: Kernel reported iSCSI connection 1:0 error (1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result of SCSI error recovery) state (3)<br class="">Apr 29 10:33:56 eee iscsid: connection1:0 is operational after recovery (1 attempts)<br class="">Apr 29 10:36:37 eee kernel: connection1:0: detected conn error (1021)<br class="">Apr 29 10:36:37 eee iscsid: Kernel reported iSCSI connection 1:0 error (1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result of SCSI error recovery) state (3)<br class="">Apr 29 10:36:40 eee iscsid: connection1:0 is operational after recovery (1 attempts)</div><div>Apr 29 10:36:50 eee kernel: sd 3:0:0:0: Device offlined - not ready after error recovery<br class="">Apr 29 10:36:51 eee kernel: sd 3:0:0:0: Device offlined - not ready after error recovery<br class="">Apr 29 10:36:51 eee kernel: sd 3:0:0:0: Device offlined - not ready after error recovery</div><div><br class=""></div><div>Example 2:</div><div>Apr 16 08:41:40 vf kernel: connection1:0: pdu (op 0x5e itt 0x1) rejected. Reason code 0x7<br class="">Apr 16 08:43:11 vf kernel: connection1:0: pdu (op 0x5e itt 0x1) rejected. Reason code 0x7<br class="">Apr 16 08:44:13 vf kernel: connection1:0: pdu (op 0x5e itt 0x1) rejected. Reason code 0x7<br class="">Apr 16 08:45:51 vf kernel: connection1:0: detected conn error (1021) Apr 16 08:45:51 317 iscsid: Kernel reported iSCSI connection 1:0 error (1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result of SCSI error recovery) state (3)<br class="">Apr 16 08:45:53 vf iscsid: connection1:0 is operational after recovery (1 attempts)</div><div><br class=""></div><div><br class=""></div><div>I’m already in contact with OmniTI regarding our new build, but in the mean time, I would love for our clients to be able to use the storage so I’m trying to resolve the current issue somehow…</div><div><br class=""></div><div>Matej</div><div><br class=""></div><div><br class=""></div></div></div></body></html>