[OmniOS-discuss] Mildly confusing ZFS iostat output

Richard Elling richard.elling at richardelling.com
Tue Jan 27 04:14:14 UTC 2015


> On Jan 26, 2015, at 5:16 PM, W Verb <wverb73 at gmail.com> wrote:
> 
> Hello All,
> 
> I am mildly confused by something iostat does when displaying statistics for a zpool. Before I begin rooting through the iostat source, does anyone have an idea of why I am seeing high "wait" and "wsvc_t" values for "ppool" when my devices apparently are not busy? I would have assumed that the stats for the pool would be the sum of the stats for the zdevs....

welcome to queuing theory! ;-)

First, iostat knows nothing about the devices being measured. It is really just a processor
for kstats of type KSTAT_TYPE_IO (see the kstat(3kstat) man page for discussion)  For that
type, you get a 2-queue set. For many cases, 2-queues is a fine model, but when there is
only one interesting queue, sometimes developers choose to put less interesting info in the
"wait" queue.

Second, it is the responsibility of the developer to define the queues. In the case of pools,
the queues are defined as:
	wait = vdev_queue_io_add() until vdev_queue_io_remove()
	run = vdev_queue_pending_add() until vdev_queue_pending_remove()

The run queue is closer to the actual measured I/O to the vdev (the juicy performance bits)
The wait queue is closer to the transaction engine and includes time for aggregation.
Thus we expect the wait queue to be higher, especially for async workloads. But since I/Os
can and do get aggregated prior to being sent to the vdev, it is not a very useful measure of
overall performance. In other words, optimizing this away could actually hurt performance.

In general, worry about the run queues and don't worry so much about the wait queues.
NB, iostat calls "run" queues "active" queues. You say Tomato, I say 'mater.
 -- richard


> 
>                     extended device statistics
>     r/s    w/s   kr/s     kw/s  wait actv wsvc_t asvc_t  %w  %b device
>    10.0 9183.0   40.5 344942.0   0.0  1.8    0.0    0.2   0 178 c4
>     1.0  187.0    4.0  19684.0   0.0  0.1    0.0    0.5   0   8 c4t5000C5006A597B93d0
>     2.0  199.0   12.0  20908.0   0.0  0.1    0.0    0.6   0  12 c4t5000C500653DE049d0
>     2.0  197.0    8.0  20788.0   0.0  0.2    0.0    0.8   0  15 c4t5000C5003607D87Bd0
>     0.0  202.0    0.0  20908.0   0.0  0.1    0.0    0.6   0  11 c4t5000C5006A5903A2d0
>     0.0  189.0    0.0  19684.0   0.0  0.1    0.0    0.5   0  10 c4t5000C500653DEE58d0
>     5.0  957.0   16.5   1966.5   0.0  0.1    0.0    0.1   0   7 c4t50026B723A07AC78d0
>     0.0  201.0    0.0  20787.9   0.0  0.1    0.0    0.7   0  14 c4t5000C5003604ED37d0
>     0.0    0.0    0.0      0.0   0.0  0.0    0.0    0.0   0   0 c4t5000C500653E447Ad0
>     0.0 3525.0    0.0 110107.7   0.0  0.5    0.0    0.2   0  51 c4t500253887000690Dd0
>     0.0 3526.0    0.0 110107.7   0.0  0.5    0.0    0.1   1  50 c4t5002538870006917d0
>    10.0 6046.0   40.5 344941.5 837.4  1.9  138.3    0.3  23  67 ppool
> 
> 
> For those following the VAAI thread, this is the system I will be using as my testbed.
> 
> Here is the structure of ppool (taken at a different time than above):
> 
> root at sanbox:/root# zpool iostat -v ppool
>                               capacity     operations    bandwidth
> pool                       alloc   free   read  write   read  write
> -------------------------  -----  -----  -----  -----  -----  -----
> ppool                       191G  7.97T     23    637   140K  15.0M
>   mirror                   63.5G  2.66T      7    133  46.3K   840K
>     c4t5000C5006A597B93d0      -      -      1     13  24.3K   844K
>     c4t5000C500653DEE58d0      -      -      1     13  24.1K   844K
>   mirror                   63.6G  2.66T      7    133  46.5K   839K
>     c4t5000C5006A5903A2d0      -      -      1     13  24.0K   844K
>     c4t5000C500653DE049d0      -      -      1     13  24.6K   844K
>   mirror                   63.5G  2.66T      7    133  46.8K   839K
>     c4t5000C5003607D87Bd0      -      -      1     13  24.5K   843K
>     c4t5000C5003604ED37d0      -      -      1     13  24.4K   843K
> logs                           -      -      -      -      -      -
>   mirror                    301M   222G      0    236      0  12.5M
>     c4t5002538870006917d0      -      -      0    236      5  12.5M
>     c4t500253887000690Dd0      -      -      0    236      5  12.5M
> cache                          -      -      -      -      -      -
>   c4t50026B723A07AC78d0    62.3G  11.4G     19    113  83.0K  1.07M
> -------------------------  -----  -----  -----  -----  -----  -----
> 
> root at sanbox:/root# zfs get all ppool
> NAME   PROPERTY              VALUE                  SOURCE
> ppool  type                  filesystem             -
> ppool  creation              Sat Jan 24 18:37 2015  -
> ppool  used                  5.16T                  -
> ppool  available             2.74T                  -
> ppool  referenced            96K                    -
> ppool  compressratio         1.51x                  -
> ppool  mounted               yes                    -
> ppool  quota                 none                   default
> ppool  reservation           none                   default
> ppool  recordsize            128K                   default
> ppool  mountpoint            /ppool                 default
> ppool  sharenfs              off                    default
> ppool  checksum              on                     default
> ppool  compression           lz4                    local
> ppool  atime                 on                     default
> ppool  devices               on                     default
> ppool  exec                  on                     default
> ppool  setuid                on                     default
> ppool  readonly              off                    default
> ppool  zoned                 off                    default
> ppool  snapdir               hidden                 default
> ppool  aclmode               discard                default
> ppool  aclinherit            restricted             default
> ppool  canmount              on                     default
> ppool  xattr                 on                     default
> ppool  copies                1                      default
> ppool  version               5                      -
> ppool  utf8only              off                    -
> ppool  normalization         none                   -
> ppool  casesensitivity       sensitive              -
> ppool  vscan                 off                    default
> ppool  nbmand                off                    default
> ppool  sharesmb              off                    default
> ppool  refquota              none                   default
> ppool  refreservation        none                   default
> ppool  primarycache          all                    default
> ppool  secondarycache        all                    default
> ppool  usedbysnapshots       0                      -
> ppool  usedbydataset         96K                    -
> ppool  usedbychildren        5.16T                  -
> ppool  usedbyrefreservation  0                      -
> ppool  logbias               latency                default
> ppool  dedup                 off                    default
> ppool  mlslabel              none                   default
> ppool  sync                  standard               local
> ppool  refcompressratio      1.00x                  -
> ppool  written               96K                    -
> ppool  logicalused           445G                   -
> ppool  logicalreferenced     9.50K                  -
> ppool  filesystem_limit      none                   default
> ppool  snapshot_limit        none                   default
> ppool  filesystem_count      none                   default
> ppool  snapshot_count        none                   default
> ppool  redundant_metadata    all                    default
> 
> Currently, ppool contains a single 5TB zvol that I am hosting as an iSCSI block device. At the zdev level, I have ensured that the ashift is 12 for all devices, all physical devices are 4k-native SATA, and the cache/log SSDs are also set for 4k. The block sizes are manually set in sd.conf, and confirmed with "echo ::sd_state | mdb -k | egrep '(^un|_blocksize)'". The zvol blocksize is 4k, and the iSCSI block transfer size is 512B (not that it matters).
> 
> All drives contain a single Solaris2 partition with an EFI label, and are properly aligned:
> format> verify
> 
> Volume name = <        >
> ascii name  = <ATA-ST3000DM001-1CH1-CC27-2.73TB>
> bytes/sector    =  512
> sectors = 5860533167
> accessible sectors = 5860533134
> Part      Tag    Flag     First Sector          Size          Last Sector
>   0        usr    wm               256         2.73TB           5860516750   
>   1 unassigned    wm                 0            0                0
>   2 unassigned    wm                 0            0                0
>   3 unassigned    wm                 0            0                0
>   4 unassigned    wm                 0            0                0
>   5 unassigned    wm                 0            0                0
>   6 unassigned    wm                 0            0                0
>   8   reserved    wm        5860516751         8.00MB           5860533134 
> 
> I scrubbed the pool last night, which completed without error. From "zdb ppool", I have extracted (with minor formatting):
> 
>                              capacity  operations   bandwidth  ---- errors ----
> description                used avail  read write  read write  read write cksum
> ppool                      339G 7.82T 26.6K     0  175M     0     0     0     5
>   mirror                   113G 2.61T 8.87K     0 58.5M     0     0     0     2
>     /dev/dsk/c4t5000C5006A597B93d0s0  3.15K     0 48.8M     0     0     0     2
>     /dev/dsk/c4t5000C500653DEE58d0s0  3.10K     0 49.0M     0     0     0     2
>   
>   mirror                   113G 2.61T 8.86K     0 58.5M     0     0     0     8
>     /dev/dsk/c4t5000C5006A5903A2d0s0  3.12K     0 48.7M     0     0     0     8
>     /dev/dsk/c4t5000C500653DE049d0s0  3.08K     0 48.9M     0     0     0     8
>   
>   mirror                   113G 2.61T 8.86K     0 58.5M     0     0     0    10
>     /dev/dsk/c4t5000C5003607D87Bd0s0  2.48K     0 48.8M     0     0     0    10
>     /dev/dsk/c4t5000C5003604ED37d0s0  2.47K     0 48.9M     0     0     0    10
>   
>   log mirror              44.0K  222G     0     0    37     0     0     0     0
>     /dev/dsk/c4t5002538870006917d0s0      0     0   290     0     0     0     0
>     /dev/dsk/c4t500253887000690Dd0s0      0     0   290     0     0     0     0
>   Cache
>   /dev/dsk/c4t50026B723A07AC78d0s0
>                               0 73.8G     0     0    35     0     0     0     0
>   Spare
>   /dev/dsk/c4t5000C500653E447Ad0s0        4     0  136K     0     0     0     0
> 
> This shows a few checksum errors, which is not consistent with the output of "zfs status -v", and "iostat -eE" shows no physical error count. I again see the discrepancy between the "ppool" value and what I would expect, which would be a sum of the cksum errors for each vdev.
> 
> I also observed a ton of leaked space, which I expect from a live pool, as well as a single:
> db_blkptr_cb: Got error 50 reading <96, 1, 2, 3fc8> DVA[0]=<1:1dc4962000:1000> DVA[1]=<2:1dc4654000:1000> [L2 zvol object] fletcher4 lz4 LE contiguous unique double size=4000L/a00P birth=52386L/52386P fill=4825 cksum=c70e8a7765:f2a                                         
> dce34f59c:c8a289b51fe11d:7e0af40fe154aab4 -- skipping
> 
> 
> By the way, I also found:
> 
> Uberblock:
>         magic = 0000000000bab10c
> 
> Wow. Just wow.
> 
> 
> -Warren V
> 
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150126/774c85ed/attachment-0001.html>


More information about the OmniOS-discuss mailing list