<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Jan 26, 2015, at 5:16 PM, W Verb <<a href="mailto:wverb73@gmail.com" class="">wverb73@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class=""><div class="">Hello All,<br class=""><br class=""></div>I am mildly confused by something <span style="font-family:monospace,monospace" class="">iostat</span> does when displaying statistics for a zpool. Before I begin rooting through the <span style="font-family:monospace,monospace" class="">iostat</span> source, does anyone have an idea of why I am seeing high "<span style="font-family:monospace,monospace" class="">wait</span>" and "<span style="font-family:monospace,monospace" class="">wsvc_t</span>" values for "<span style="font-family:monospace,monospace" class="">ppool</span>" when my devices apparently are not busy? I would have assumed that the stats for the pool would be the sum of the stats for the zdevs....<br class=""></div></div></div></blockquote><div><br class=""></div><div>welcome to queuing theory! ;-)</div><div><br class=""></div><div>First, iostat knows nothing about the devices being measured. It is really just a processor</div><div>for kstats of type KSTAT_TYPE_IO (see the kstat(3kstat) man page for discussion) For that</div><div>type, you get a 2-queue set. For many cases, 2-queues is a fine model, but when there is</div><div>only one interesting queue, sometimes developers choose to put less interesting info in the</div><div>"wait" queue.</div><div><br class=""></div><div>Second, it is the responsibility of the developer to define the queues. In the case of pools,</div><div>the queues are defined as:</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>wait = vdev_queue_io_add() until vdev_queue_io_remove()</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>run = vdev_queue_pending_add() until vdev_queue_pending_remove()</div><div><br class=""></div><div>The run queue is closer to the actual measured I/O to the vdev (the juicy performance bits)</div><div>The wait queue is closer to the transaction engine and includes time for aggregation.</div><div>Thus we expect the wait queue to be higher, especially for async workloads. But since I/Os</div><div>can and do get aggregated prior to being sent to the vdev, it is not a very useful measure of</div><div>overall performance. In other words, optimizing this away could actually hurt performance.</div><div><br class=""></div><div>In general, worry about the run queues and don't worry so much about the wait queues.</div><div>NB, iostat calls "run" queues "active" queues. You say Tomato, I say 'mater.</div><div> -- richard</div><div><br class=""></div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""><span style="font-family:monospace,monospace" class=""> extended device statistics<br class=""> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device<br class=""> 10.0 9183.0 40.5 344942.0 0.0 1.8 0.0 0.2 0 178 c4<br class=""> 1.0 187.0 4.0 19684.0 0.0 0.1 0.0 0.5 0 8 c4t5000C5006A597B93d0<br class=""> 2.0 199.0 12.0 20908.0 0.0 0.1 0.0 0.6 0 12 c4t5000C500653DE049d0<br class=""> 2.0 197.0 8.0 20788.0 0.0 0.2 0.0 0.8 0 15 c4t5000C5003607D87Bd0<br class=""> 0.0 202.0 0.0 20908.0 0.0 0.1 0.0 0.6 0 11 c4t5000C5006A5903A2d0<br class=""> 0.0 189.0 0.0 19684.0 0.0 0.1 0.0 0.5 0 10 c4t5000C500653DEE58d0<br class=""> 5.0 957.0 16.5 1966.5 0.0 0.1 0.0 0.1 0 7 c4t50026B723A07AC78d0<br class=""> 0.0 201.0 0.0 20787.9 0.0 0.1 0.0 0.7 0 14 c4t5000C5003604ED37d0<br class=""> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t5000C500653E447Ad0<br class=""> 0.0 3525.0 0.0 110107.7 0.0 0.5 0.0 0.2 0 51 c4t500253887000690Dd0<br class=""> 0.0 3526.0 0.0 110107.7 0.0 0.5 0.0 0.1 1 50 c4t5002538870006917d0<br class=""> 10.0 6046.0 40.5 344941.5 837.4 1.9 138.3 0.3 23 67 ppool<br class=""><br class=""></span><br class=""></div><div class="">For those following the VAAI thread, this is the system I will be using as my testbed.<br class=""></div><div class=""><br class=""></div>Here is the structure of <span style="font-family:monospace,monospace" class="">ppool</span> (taken at a different time than above):<br class=""><span style="font-family:monospace,monospace" class=""><br class="">root@sanbox:/root# zpool iostat -v ppool<br class=""> capacity operations bandwidth<br class="">pool alloc free read write read write<br class="">------------------------- ----- ----- ----- ----- ----- -----<br class="">ppool 191G 7.97T 23 637 140K 15.0M<br class=""> mirror 63.5G 2.66T 7 133 46.3K 840K<br class=""> c4t5000C5006A597B93d0 - - 1 13 24.3K 844K<br class=""> c4t5000C500653DEE58d0 - - 1 13 24.1K 844K<br class=""> mirror 63.6G 2.66T 7 133 46.5K 839K<br class=""> c4t5000C5006A5903A2d0 - - 1 13 24.0K 844K<br class=""> c4t5000C500653DE049d0 - - 1 13 24.6K 844K<br class=""> mirror 63.5G 2.66T 7 133 46.8K 839K<br class=""> c4t5000C5003607D87Bd0 - - 1 13 24.5K 843K<br class=""> c4t5000C5003604ED37d0 - - 1 13 24.4K 843K<br class="">logs - - - - - -<br class=""> mirror 301M 222G 0 236 0 12.5M<br class=""> c4t5002538870006917d0 - - 0 236 5 12.5M<br class=""> c4t500253887000690Dd0 - - 0 236 5 12.5M<br class="">cache - - - - - -<br class=""> c4t50026B723A07AC78d0 62.3G 11.4G 19 113 83.0K 1.07M<br class="">------------------------- ----- ----- ----- ----- ----- -----</span><br class=""><div class=""><br class=""><span style="font-family:monospace,monospace" class="">root@sanbox:/root# zfs get all ppool<br class="">NAME PROPERTY VALUE SOURCE<br class="">ppool type filesystem -<br class="">ppool creation Sat Jan 24 18:37 2015 -<br class="">ppool used 5.16T -<br class="">ppool available 2.74T -<br class="">ppool referenced 96K -<br class="">ppool compressratio 1.51x -<br class="">ppool mounted yes -<br class="">ppool quota none default<br class="">ppool reservation none default<br class="">ppool recordsize 128K default<br class="">ppool mountpoint /ppool default<br class="">ppool sharenfs off default<br class="">ppool checksum on default<br class="">ppool compression lz4 local<br class="">ppool atime on default<br class="">ppool devices on default<br class="">ppool exec on default<br class="">ppool setuid on default<br class="">ppool readonly off default<br class="">ppool zoned off default<br class="">ppool snapdir hidden default<br class="">ppool aclmode discard default<br class="">ppool aclinherit restricted default<br class="">ppool canmount on default<br class="">ppool xattr on default<br class="">ppool copies 1 default<br class="">ppool version 5 -<br class="">ppool utf8only off -<br class="">ppool normalization none -<br class="">ppool casesensitivity sensitive -<br class="">ppool vscan off default<br class="">ppool nbmand off default<br class="">ppool sharesmb off default<br class="">ppool refquota none default<br class="">ppool refreservation none default<br class="">ppool primarycache all default<br class="">ppool secondarycache all default<br class="">ppool usedbysnapshots 0 -<br class="">ppool usedbydataset 96K -<br class="">ppool usedbychildren 5.16T -<br class="">ppool usedbyrefreservation 0 -<br class="">ppool logbias latency default<br class="">ppool dedup off default<br class="">ppool mlslabel none default<br class="">ppool sync standard local<br class="">ppool refcompressratio 1.00x -<br class="">ppool written 96K -<br class="">ppool logicalused 445G -<br class="">ppool logicalreferenced 9.50K -<br class="">ppool filesystem_limit none default<br class="">ppool snapshot_limit none default<br class="">ppool filesystem_count none default<br class="">ppool snapshot_count none default<br class="">ppool redundant_metadata all default</span><br class=""><br class=""></div><div class="">Currently, <span style="font-family:monospace,monospace" class="">ppool</span> contains a single 5TB zvol that I am hosting as an iSCSI block device. At the zdev level, I have ensured that the ashift is 12 for all devices, all physical devices are 4k-native SATA, and the cache/log SSDs are also set for 4k. The block sizes are manually set in <span style="font-family:monospace,monospace" class="">sd.conf</span>, and confirmed with "<span style="font-family:monospace,monospace" class="">echo ::sd_state | mdb -k | egrep '(^un|_blocksize)'</span>". The zvol blocksize is 4k, and the iSCSI block transfer size is 512B (not that it matters).<br class=""><br class=""></div><div class="">All drives contain a single Solaris2 partition with an EFI label, and are properly aligned:<br class=""><span style="font-family:monospace,monospace" class="">format> verify<br class=""><br class="">Volume name = < ><br class="">ascii name = <ATA-ST3000DM001-1CH1-CC27-2.73TB><br class="">bytes/sector = 512<br class="">sectors = 5860533167<br class="">accessible sectors = 5860533134<br class="">Part Tag Flag First Sector Size Last Sector<br class=""> 0 usr wm 256 2.73TB 5860516750 <br class=""> 1 unassigned wm 0 0 0<br class=""> 2 unassigned wm 0 0 0<br class=""> 3 unassigned wm 0 0 0<br class=""> 4 unassigned wm 0 0 0<br class=""> 5 unassigned wm 0 0 0<br class=""> 6 unassigned wm 0 0 0<br class=""> 8 reserved wm 5860516751 8.00MB 5860533134 </span><br class=""></div><div class=""><br class=""></div><div class="">I scrubbed the pool last night, which completed without error. From "<span style="font-family:monospace,monospace" class="">zdb ppool</span>", I have extracted (with minor formatting):<br class=""></div><div class=""><span style="font-family:monospace,monospace" class=""><br class=""> capacity operations bandwidth ---- errors ----<br class="">description used avail read write read write read write cksum<br class="">ppool 339G 7.82T 26.6K 0 175M 0 0 0 5<br class=""> mirror 113G 2.61T 8.87K 0 58.5M 0 0 0 2<br class=""> /dev/dsk/c4t5000C5006A597B93d0s0 3.15K 0 48.8M 0 0 0 2<br class=""> /dev/dsk/c4t5000C500653DEE58d0s0 3.10K 0 49.0M 0 0 0 2<br class=""> <br class=""> mirror 113G 2.61T 8.86K 0 58.5M 0 0 0 8<br class=""> /dev/dsk/c4t5000C5006A5903A2d0s0 3.12K 0 48.7M 0 0 0 8<br class=""> /dev/dsk/c4t5000C500653DE049d0s0 3.08K 0 48.9M 0 0 0 8<br class=""> <br class=""> mirror 113G 2.61T 8.86K 0 58.5M 0 0 0 10<br class=""> /dev/dsk/c4t5000C5003607D87Bd0s0 2.48K 0 48.8M 0 0 0 10<br class=""> /dev/dsk/c4t5000C5003604ED37d0s0 2.47K 0 48.9M 0 0 0 10<br class=""> <br class=""> log mirror 44.0K 222G 0 0 37 0 0 0 0<br class=""> /dev/dsk/c4t5002538870006917d0s0 0 0 290 0 0 0 0<br class=""> /dev/dsk/c4t500253887000690Dd0s0 0 0 290 0 0 0 0<br class=""> Cache<br class=""> /dev/dsk/c4t50026B723A07AC78d0s0<br class=""> 0 73.8G 0 0 35 0 0 0 0<br class=""> Spare<br class=""> /dev/dsk/c4t5000C500653E447Ad0s0 4 0 136K 0 0 0 0</span><br class=""><br class=""></div><div class="">This shows a few checksum errors, which is not consistent with the output of "<span style="font-family:monospace,monospace" class="">zfs status -v</span>", and "<span style="font-family:monospace,monospace" class="">iostat -eE</span>" shows no physical error count. I again see the discrepancy between the "<span style="font-family:monospace,monospace" class="">ppool</span>" value and what I would expect, which would be a sum of the <span style="font-family:monospace,monospace" class="">cksum</span> errors for each vdev.<br class=""><br class=""></div><div class="">I also observed a ton of leaked space, which I expect from a live pool, as well as a single:<br class=""><span style="font-family:monospace,monospace" class="">db_blkptr_cb: Got error 50 reading <96, 1, 2, 3fc8> DVA[0]=<1:1dc4962000:1000> DVA[1]=<2:1dc4654000:1000> [L2 zvol object] fletcher4 lz4 LE contiguous unique double size=4000L/a00P birth=52386L/52386P fill=4825 cksum=c70e8a7765:f2a </span><br class=""><span style="font-family:monospace,monospace" class="">dce34f59c:c8a289b51fe11d:7e0af40fe154aab4 -- skipping</span><br class=""></div><div class=""><br class=""><br class=""></div><div class="">By the way, I also found:<br class=""><span style="font-family:monospace,monospace" class=""><br class="">Uberblock:<br class=""> magic = 000000000<b class="">0bab10c</b></span><br class=""><br class=""></div><div class="">Wow. Just wow.<br class=""></div><div class=""><br class=""><br class=""></div><div class="">-Warren V<br class=""></div><div class=""><br class=""></div></div>
_______________________________________________<br class="">OmniOS-discuss mailing list<br class=""><a href="mailto:OmniOS-discuss@lists.omniti.com" class="">OmniOS-discuss@lists.omniti.com</a><br class="">http://lists.omniti.com/mailman/listinfo/omnios-discuss<br class=""></div></blockquote></div><br class=""></body></html>