[OmniOS-discuss] Status of TRIM support?

Wed May 28 06:37:45 UTC 2014

28 мая 2014 г. 3:11:07 CEST, Dan Swartzendruber <dswartz at druber.com> пишет:
>
>So I've been running with sync=disabled on my vsphere NFS datastore. 
>I've
>been willing to do so because I have a big-ass UPS, and do hourly
>backups.
> But, I'm thinking of going to an active/passive connection to my JBOD,
>using Saso's blog post on zfs zfs-create.blogspot.com.  Here's why I
>think
>I can't keep using sync=disabled (I would love to have my logic sanity
>checked.)  If you switch manually from host A to B, all is well, since
>before host A exports the pool, any pending writes will be completed
>(so
>even though we lied to vsphere, it's okay.)  On the other hand, if host
>A
>crashes/hangs and host B takes over, forcibly importing the pool, you
>could end up with the following scenario: vsphere issues writes for
>blocks
>A, B, C, D and E.  A and B have been written.  C and D were sent to
>host
>A, and ACKed, so vsphere thinks all is well.  Host A has not yet
>committed
>blocks C and D to disk.  Host B imports the pool, assumes the virtual
>IP
>for the NFS share and vsphere reconnects to the datastore.  Since it
>thinks it has written blocks A-D, it then issues a write for block E. 
>Host B commits that to disk.  vsphere thinks blocks A-E were written to
>disk, when in fact, blocks C and D were not.  Silent data corruption,
>and
>as far as I can tell, no way to know this happened, so if I ever did
>have
>a forced failover, I would have to rollback every single VM to the last
>known, good snapshot.  Anyway, I decided to see what would happen
>write-wise with an SLOG SSD.  I took a samsung 840PRO used for l2arc
>and
>made that a log device.  I ran crystaldiskmark before and after.  Prior
>to
>the SLOG, I was getting about 90MB/sec (gigabit enet), which is pretty
>good.  Afterward, it went down to 8MB/sec!  I pulled the SSD and
>plugged
>it into my windows 7 workstation, formatted it and deleted the
>partition,
>which should have TRIM'ed it.  I reinserted it as SLOG and re-ran the
>test.  50MB/sec.  Still not great, but this is after all an MLC device,
>not SLC, and that's probably 'good enough'.  Looking at open-zfs.org,
>it
>looks like out of illumos, freebsd and ZoL, only freebsd has TRIM now. 
>I
>don't want to have to re-TRIM the thing every few weeks (or however
>long
>it takes).  Does over-provisioning help?
>
>_______________________________________________
>OmniOS-discuss mailing list
>OmniOS-discuss at lists.omniti.com
>http://lists.omniti.com/mailman/listinfo/omnios-discuss

My couple of cents:
1) the l2arc and zil usecases are somewhat special since they write data as a ring buffer. Your logical lba's with neighboring addresses are unusually (for ssd) likely to land into same hardware pages during initial creation and during rewrites. So there would be relatively little fragmentation (and little if any cow data relocation by firmware to free up pages for reprogramming). This way overprovisioning can help since there are available pages, and those no longer actively used would now be reserved by the firmware for logical sectors known to be zero or unused, and it should be quick about erasing them.
On another hand, the l2arc is likely to use all of whatever range of storage you give it, and use it actively. Unlike an overestimated zil, there is nothing you could trim/unmap in advance from zfs. Though it might maybe help to mark the data ranges with trim before overwriting them, just so the ssd knows it can and should recycle the pages involved.
Writes to both are sequential (and maybe in large portions), while reads of l2arc are randomly sized and located and reads of zil are sequential and rare enough to not consider as a performance factor =)
2) did/can you rerun your tests with a manually overprovisioned ssd (with an empty space reserved by partitioning / slicing) and see if the results vary? Probably, the question is more about a change in tendencies rather than absolute numbers, if the latter degrade even after a hardware trim.
3) for failovers like those, did you consider a mirror over iscsi devices (possibly zvols, or even raw disks to avoid some zfs-in-zfs lockups recently discussed) exported from both boxes? This way writes into the application pool that stores your vm's would land onto both boxes, distributed onto the neighbor by the current head node and probably at a hit to latency - though maybe using dedicated direct networking for lower impact to performance. Failover would however rely on a really up-to-date version of the pool, possibly including a mirrored zil with pieces from both boxes. I wonder if you might (or should performance-wise) share and re-attach upon failover the l2arc's like that as well?
I think this was discussed while Saso was developing and publishing his solution, and maybe discarded for some reason, so search the zfs-discuss (probably) archives of 1-2 years back for more hints. Or perhaps he has some new insights and opinions developed during this time ;)

HTH,
//Jim Klimov 
--
Typos courtesy of K-9 Mail on my Samsung Android