[OmniOS-discuss] Building a new storage

Richard Elling richard.elling at richardelling.com
Fri Apr 10 15:19:20 UTC 2015


> On Apr 10, 2015, at 3:17 AM, Matej Zerovnik <matej at zunaj.si> wrote:
> 
> We are currently thinking of rebuilding our SAN, since we did some mistakes on the first build. But before we begin, we would like to plan accordingly, so I'm wondering how to measure some data(l2arc and zil usage, current iops,...) the right way.
> 
> We currently have a single raidz2 pool build out of 50 SATA drives(Seagate Constellation, 2x Intel S3700 100GB as ZIL and 2x Intel S3700 100GB as L2ARC.
> 
> For the new system, we plan to use a IBM 3550 M4 server with 256GB of memory and LSI SAS 9207-8e HBA. We will have around 70-80 SAS 4TB drives in JBOD cases and, if we need, some SSD's for ZIL and L2ARC.

[sidebar conversation: I've experienced bad results with WD Black 4TB SAS drives]

> 
> Questions:
> 
> 1.)
> How to measure average IOPS of the current system? 'zpool iostat poolname 1' gives me weird numbers saying current drives perform around 300 read ops and 100 write ops per second. Drives are 7200 SATA drives, so I know they can't perform that much IOPS.

Sure they can. The measurable peak will be closer to 20,000 IOPS for a 7,200 rpm drive at 512 bytes.
For HDDs, the biggest impact to response time is long seeks and the placement algorithms for ZFS
bias towards the outer cylinders. From outer to inner, bandwidth usually drops by 30% and random IOPS
becomes impacted by longer seeks. Also writes are cached in the drive, so you rarely see seek penalties
for writes. This leads to a false sense of performance. This is why we often use the rule of thumb that a 
HDD can deliver 100 IOPS @ 4KB and 100 MB/sec -- good enough for back-of-the-envelope capacity 
planning, but not as good as real, long-term measurement.

But the more interesting question is: where do you plan to measure the IOPS? The backend stats
as seen by iostat and zpool iostat are difficult to use because they do not account for caching and
the writes are coalesced. Write coalescing is particularly important for people who insist on counting
IOPS because, by default, 32 4KB random write IOPS will be coalesced into one 128KB write. Let's
take a closer look at your data...

> Output from iostat -vx (only some drives are pasted):
> Code:
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
> data   36621,9 25740,2 19288,6 66191,0 197,6 25,9    3,6  40  77 
> sd18    276,3  104,8  145,2   83,3  0,0  0,6    1,5   0  36 

This version of iostat doesn't show average sizes :-( but you can calculate them from the data :-)

For pool data (data written from the pool to disks, not data written into the pool):
average write size = 66191,0 / 25740,2 = 2.5 KB
average read size = 19288,6 / 36621,9 = 526 bytes

For sd18:
average write size = 83,3 / 104,8 = 794 bytes
average read size = 145,2 / 276,3 = 525 bytes

From this we can suggest:
1. avoid 4KB sector sized disks for this configuration and workload
2. look further up the stack to determine why such small physical I/Os are being used


> sd19    283,3  106,7  152,1   83,3  0,0  0,6    1,5   0  24 
> sd20    281,3  101,8  146,7   79,8  0,0  0,5    1,4   0  35 
> sd21    286,3  117,7  146,7   84,3  0,0  0,3    0,7   0  21 
> sd22    283,3   85,8  144,2   81,3  0,0  0,5    1,3   0  32 
> sd23    275,3  116,7  139,7   82,8  0,0  0,3    0,8   0  21 
> sd24    280,3  106,7  155,6   84,3  0,0  0,6    1,6   0  25 
> sd25    288,3  106,7  148,6   86,3  0,0  0,4    1,0   0  24 
> sd26    269,4  110,7  137,2   91,8  0,0  0,5    1,3   0  24 
> sd27    272,4   87,8  141,7   78,3  0,0  0,7    1,8   0  34 
> sd28    236,4  115,7  219,0   84,8  0,0  0,9    2,5   0  26 
> sd29    235,4  108,7  228,5   83,8  0,0  0,9    2,7   0  33
> Output of 'zpool iostat -v data 1 | grep drive_id'
> Code:
>                               capacity     operations    bandwidth
>                 pool                       alloc   free   read  write   read  write
>     c8t5000C5004FD18DE9d0      -      -    573    220   663K   607K
>     c8t5000C5004FD18DE9d0      -      -    563      0   318K      0
>     c8t5000C5004FD18DE9d0      -      -    586    314   361K   806K
>     c8t5000C5004FD18DE9d0      -      -    567    445   373K  1,02M
>     c8t5000C5004FD18DE9d0      -      -    464     25   299K  17,9K
>     c8t5000C5004FD18DE9d0      -      -    552      2   326K  3,68K
>     c8t5000C5004FD18DE9d0      -      -    421     41   249K  31,3K
>     c8t5000C5004FD18DE9d0      -      -    492    400   391K   944K
>     c8t5000C5004FD18DE9d0      -      -    313    148   242K   337K
>     c8t5000C5004FD18DE9d0      -      -    330    163   360K   390K
>     c8t5000C5004FD18DE9d0      -      -    655     23   577K  21,5K
> Is it just me, or are there too much IOPS for those drive to handle even in theory, let alone in practice? How to get the right measurement?

To measure IOPS written into the pool, look at fsstat for file systems. For iSCSI, this isn't
quite so easy to gather, so we use dtrace, see iscsisvrtop as an example.

> 
> 2.)
> Current ARC utilization on our system:
> Code:
> ARC Efficency:
>          Cache Access Total:             2134027465
>          Cache Hit Ratio:      64%       1381755042     [Defined State for buffer]
>          Cache Miss Ratio:     35%       752272423      [Undefined State for Buffer]
>          REAL Hit Ratio:       56%       1199175179     [MRU/MFU Hits Only]
> Code:
> ./arcstat.pl -f read,hits,miss,hit%,l2read,l2hits,l2miss,l2hit%,arcsz,l2size 1 2>/dev/null
> read  hits  miss  hit%  l2read  l2hits  l2miss  l2hit%  arcsz  l2size  
>    1     1     0   100       0       0       0       0   213G    235G  
> 4.8K  3.0K  1.9K    61    1.9K      40    1.8K       2   213G    235G  
> 4.3K  2.7K  1.6K    62    1.6K      35    1.5K       2   213G    235G  
> 2.5K   853  1.6K    34    1.6K      45    1.6K       2   213G    235G  
> 5.1K  3.0K  2.2K    57    2.2K      49    2.1K       2   213G    235G  
> 6.5K  4.4K  2.1K    68    2.1K      30    2.0K       1   213G    235G  
> 5.0K  2.5K  2.5K    49    2.5K      44    2.5K       1   213G    235G  
>  11K  8.5K  2.8K    75    2.8K      13    2.8K       0   213G    235G  
> 6.4K  4.8K  1.6K    74    1.6K      57    1.6K       3   213G    235G  
> 2.3K  1.1K  1.2K    46    1.2K      88    1.1K       7   213G    235G  
> 1.9K   532  1.3K    28    1.3K      83    1.2K       6   213G    235G
> As we can see, there are almost no L2ARC cache hits. What can be the reason for that? Is our L2ARC cache too small or are the data on our storage just too much random to be cached? I don't know what is on our iscsi shares, since they are for outside customers, but as far as I know, it's mostly backups and some live data.

Unfortunately, most versions of arcstat do not measure what you want to know. The measurement
you're looking for is the reason for eviction. These are measured as kstats:
# kstat -p ::arcstats:evict_l2\*
zfs:0:arcstats:evict_l2_cached  0
zfs:0:arcstats:evict_l2_eligible        2224128
zfs:0:arcstats:evict_l2_ineligible      4096

For this example system, you can see:
+  nothing is in the L2 cache (mostly, because there is no L2 :-)
+  2224128 ARC evictions were eligible to be satisfied from an L2 cache
+ 4096 ARC evictions were not eligible

This example system can benefit from an L2 cache.

> 
> 3.)
> As far as ZIL goes, do we need it?

From the data below, yes, it will help

> I think I read somewhere, that ZIL can only store 8k blocks and that you have to 'format' iscsi drives accordingly. Is that still the case?

This was never the case, where did you read it?

> Output from 'zilstat':
> Code:
>    N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops  <=4kB 4-32kB >=32kB
>          0          0          0          0          0          0      0      0      0      0
>          0          0          0          0          0          0      0      0      0      0
>     178352     178352     178352     262144     262144     262144      2      0      0      2
>  134823992  134823992  134823992  221380608  221380608  221380608   1689      0      0   1689
>  102893848  102893848  102893848  168427520  168427520  168427520   1285      0      0   1285
>          0          0          0          0          0          0      0      0      0      0
>       4472       4472       4472     131072     131072     131072      1      0      0      1
>          0          0          0          0          0          0      0      0      0      0
>      41904      41904      41904     262144     262144     262144      2      0      0      2
>  134963824  134963824  134963824  221511680  221511680  221511680   1690      0      0   1690
>          0          0          0          0          0          0      0      0      0      0
>          0          0          0          0          0          0      0      0      0      0
>          0          0          0          0          0          0      0      0      0      0
>          0          0          0          0          0          0      0      0      0      0
>   32789896   32789896   32789896   53346304   53346304   53346304    407      0      0    407
>   25467912   25467912   25467912   41811968   41811968   41811968    319      0      0    319
> Given the stats, is ZIL even necessary? When I'm running zilstat, I see big ops every 5s. Why is that? I know system is suppose to flush data from memory to spindles every 5s, but that shouldn't be seen as ZIL flush, is that correct?
> 
> 4.)
> How to put drives together, to get the best IOPS/capacity ratio out of them? We were thinking of 7 RAIDZ2 vdev's with 10 drives each. That way we would get around 224TB pool.

This depends on the workload. For more IOPS, use more vdevs.

> 
> 5.)
> In case we decide to go with 4 JBOD cases, would it be better to build 2 pools, just so that in case 1 pool has a hickup, we won't loose all data?

This is a common configuration: two SAS pools + two servers configured such that either server can 
serve the pool.
 -- richard

> 
> What else am I not considering?
> 
> Thanks, Matej
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss at lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150410/f555e988/attachment-0001.html>


More information about the OmniOS-discuss mailing list