<div dir="ltr">Be sure you have the following fix; without it I recall seeing spins from the ZPL similar to that stack trace. With only 1 cpu, if a kernel thread spins, it can be very hard to get other threads to run.<div> </div><div> commit e722410c49fe67cbf0f639cbcc288bd6cbcf7dd1 Author: Matthew Ahrens <<a href="mailto:mahrens@delphix.com">mahrens@delphix.com</a>> Date: Tue Nov 26 13:47:33 2013 -0500 4347 ZPL can use dmu_tx_assign(TXG_WAIT) Reviewed by: George Wilson <<a href="mailto:george.wilson@delphix.com">george.wilson@delphix.com</a>> Reviewed by: Adam Leventhal <<a href="mailto:ahl@delphix.com">ahl@delphix.com</a>> Reviewed by: Dan McDonald <<a href="mailto:danmcd@nexenta.com">danmcd@nexenta.com</a>> Reviewed by: Boris Protopopov <<a href="mailto:boris.protopopov@nexenta.com">boris.protopopov@nexenta.com</a>> Approved by: Dan McDonald <<a href="mailto:danmcd@nexenta.com">danmcd@nexenta.com</a>></div></div><div class="gmail_extra"> <div class="gmail_quote">On Thu, Dec 5, 2013 at 8:14 PM, Saso Kiselkov <<a href="mailto:skiselkov.ml@gmail.com" target="_blank">skiselkov.ml@gmail.com</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I'm investigating a bizarre hang situation which I noticed by accident on the latest stable omnios release. When I'm running in VMware Fusion on a 1-CPU VM and doing any significant write IO to the pool (e.g. just dd'ing something around is enough to trigger this), the VM will, with 100% certainty, hang. Console input works, but all userspace programs are stopped and nothing responds (e.g. attempting to telnet to sshd over the network establishes the socket, but then sshd doesn't print the version string). Using some dtrace foo and kmdb I was able to trace it (roughly, the exact stack trace changes between hangs, which is mighty weird in itself): atomic_dec_32_nv+8() dbuf_read+0x179(ffffff00d2393600, ffffff00c72f98f0, a) dmu_tx_check_ioerr+0x76(ffffff00c72f98f0, ffffff00d2279cf0, 0, 1e0) dmu_tx_count_write+0x395(ffffff00ce0536e0, 3c04000, 4000) dmu_tx_hold_write+0x5a(ffffff00d1a55300, 4009, 3c04000, 4000) zfs_write+0x3e3(ffffff00d09ef540, ffffff00028e7e60, 0, ffffff00cd511748, 0) fop_write+0x5b(ffffff00d09ef540, ffffff00028e7e60, 0, ffffff00cd511748, 0) write+0x250(1, 440660, 4000) sys_syscall+0x17a() (usually the trace is identical up to dmu_tx_hold_write) I can definitely confirm that this doesn't happen on omnios r151006 and it doesn't happen on my vanilla kernels either. My suspicion is that something got botched in the "OMNIOS#72 Integrate Joyent updated zone write throttle" commit, but I can't put my finger on it. Can somebody please confirm this? Cheers, -- Saso ------------------------------------------- illumos-zfs Archives: <a href="https://www.listbox.com/member/archive/182191/=now" target="_blank">https://www.listbox.com/member/archive/182191/=now</a> RSS Feed: <a href="https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460" target="_blank">https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460</a> Modify Your Subscription: <a href="https://www.listbox.com/member/?member_id=21635000&id_secret=21635000-73dc201a" target="_blank">https://www.listbox.com/member/?member_id=21635000&id_secret=21635000-73dc201a</a> Powered by Listbox: <a href="http://www.listbox.com" target="_blank">http://www.listbox.com</a> </blockquote></div> </div>