[OmniOS-discuss] NVMe Performance

Fri Apr 8 03:35:33 UTC 2016

Hi all,

I just recently kitbashed a backup storage dump based on OmniOS, a couple
retired servers and a few new bits to improve it's perf.

- HP DL360 G6 with dual Xeon 5540s, 80GB RAM
- The onboard HP SAS is hosting the root pool on it's RAID 5 of SAS disks,
not ideal but the card doesn't have an IT mode.
- LSI 9208e for main storage
- - Chenbro SAS expander in an Addonics 20 bay SATA shelf
- - 5 x WD 6TB Red drives in a RAIDz3 pool
- 4 x onboard GigE, LAG'd together
- dual port QLogic 4GB FC card for later use with my VMWare farm

As I noted I was able to get a few new bits for the box, three Intel DC3600
400GB PCIe NVMe SSDs.  I figured I could use two for log in a mirror setup,
and one for cache and have a lovely setup.

Initial testing without the NVMe drives setup with dd shows I can slam a
sustained 300MB/s to the SATA pool when dumping to a file in one of my zfs
repositories:

# dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 37.3299 s, 288 MB/s

# dd if=/dev/zero of=bigfile bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.593213 s, 1.8 GB/s

For short bursts, the box caches nicely with the ram I was able to scrounge
up.  I forgot I had one zfs set for gzip-9 compression and was very
impressed to see it sustained 1.7 GB/s over the course of dumping 107 GB of
zeros, nice!

Setting up a simple pool using one NVMe card with one volume and repeating
the test on it I get:

# dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 16.1904 s, 663 MB/s

Not quite the blazing performance I was expecting.  I was hoping they could
sustain at least twice that.  If I make a zpool out of all three and bump
my test to 53GB it sustains 1.1 GB/s, so there is a little performance
scaling by spreading the work across all three but again no where near what
I was anticipating.

I also did a raw write to one of the units after making sure it wasn't in
use by any pools:
# dd if=/dev/zero of=/dev/rdsk/c3t1d0 bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 22.2461 s, 483 MB/s

If I get the system loaded enough, doing that results in transport errors
being logged for the unit while I'm hammering it with dd.

Digging around I found some FreeBSD discussions where they observed their
NVMe driver rolling back to legacy interrupt mode on systems with lots of
cpu cores as they ran out of MSIx slots.  Given I've got 8 physical cores
and I've got a lot of devices on the PCIe bus I don't know if that is a
possibility here or not.  I've not poked at the driver source yet as to be
honest I wouldn't know what I was looking for.

I also understand that the driver is pretty new to Illumos so I shouldn't
expect it to be a rocket yet.  I just figured I'd share what I've observed
so far to see if that matches what I should expect or if there is
additional testing work I can do to help improve the driver's performance
down the road.

Josh Coombs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20160407/d0918e6b/attachment.html>