From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Weekes Subject: Re: OOM problems Date: Thu, 18 Nov 2010 23:27:10 -0800 Message-ID: <4CE626CE.7050408@nuclearfallout.net> References: <4CDE44E2.2060807@nuclearfallout.net> <4FA716B1526C7C4DB0375C6DADBC4EA38D80702C25@LONPMAILBOX01.citrite.net> <4CDE4C08.70309@nuclearfallout.net> <4FA716B1526C7C4DB0375C6DADBC4EA38D80702C2E@LONPMAILBOX01.citrite.net> <4CE1037402000078000222F0@vpn.id2.novell.com> <1289814037.21694.22.camel@ramone> <4CE1751F.9020202@nuclearfallout.net> <4CE2E163.2090809@nuclearfallout.net> <4FA716B1526C7C4DB0375C6DADBC4EA38D80702E0E@LONPMAILBOX01.citrite.net> <4CE450E7.9010508@nuclearfallout.net> <1290043433.11102.1742.camel@agari.van.xensource.com> <4CE49D98.2030402@nuclearfallout.net> <1290053337.18200.28.camel@agari.van.xensource.com> <4CE4D285.5060500@nuclearfallout.net> <1290076883.6481.178.camel@ramone> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1290076883.6481.178.camel@ramone> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Daniel Stodden Cc: Ian Pratt , "xen-devel@lists.xensource.com" , Jan Beulich List-Id: xen-devel@lists.xenproject.org Daniel, thank you for the help and in-depth information, as well as the test code off-list. The corruption problem with blktap2 O_DIRECT is easily reproducible for me on multiple machines, so I hope that we'll be able to nail this one down pretty quickly. To follow up on my question about the potential performance difference between blktap2 without O_DIRECT and loop (both of which use the page cache), I did some tests inside a sparse file-backed domU by timing copying a folder containing 7419 files and folders totalling 1.6 GB (of mixed sizes), and found that loop returned this: real 1m18.257s user 0m0.050s sys 0m6.550s While tapdisk2 aio w/o O_DIRECT clocked in at: real 0m55.373s user 0m0.050s sys 0m6.690s With each, I saw a few more seconds of disk activity on dom0, since dirty_ratio was set to 2. I ran the tests several times and dropped caches on dom0 between each one; all of the results were within a second or two of each other. This represents a significant ~41% performance bump for that particular workload. In light of this, I would recommend to anyone who is using "file:" that they try out tapdisk2 aio with a modified block-aio.c to remove O_DIRECT, and see how it goes. If you find results similar to mine, it might be worth modifying this into another blktap2 driver. -John On 11/18/2010 2:41 AM, Daniel Stodden wrote: > On Thu, 2010-11-18 at 02:15 -0500, John Weekes wrote: >>> I think [XCP blktap] should work fine, or wouldn't ask. If not, lemme know. >> k. >> >>>> In my last bit of troubleshooting, I took O_DIRECT out of the open call >>>> in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates >>>> that this might have eliminated the problem with corruption. I'm testing >>>> further now, but could there be an issue with alignment (since the >>>> kernel is apparently very strict about it with direct I/O)? >>> Nope. It is, but they're 4k-aligned all over the place. You'd see syslog >>> yelling quite miserably in cases like that. Keeping an eye on syslog >>> (the daemon and kern facilites) is a generally good idea btw. >> I've been doing that and haven't seen any unusual output so far, which I >> guess is good. >> >>>> (Removing >>>> this flag also brings back in use of the page cache, of course.) >>> I/O-wise it's not much different from the file:-path. Meaning it should >>> have carried you directly back into the Oom realm. >> Does it make a difference that it's not using "loop" and instead the CPU >> usage (and presumably some blocking) occurs in user-space? > It's certainly a different path taken. I just meant to say file access > has about the same properties, so you're likely back to the original > issue. > >> There's not >> too much information on this out there, but it seems at though the OOM >> issue might be at least somewhat loop device-specific. One document that >> references loop OOM problems that I found is this one: >> http://sources.redhat.com/lvm2/wiki/DMLoop. >> My initial take on it was >> that it might be saying that it mattered when these things were being >> done in the kernel, but now I'm not so certain -- >> >> ".. [their method and loop] submit[s] [I/O requests] via a kernel thread >> to the VFS layer using traditional I/O calls (read, write etc.). This >> has the advantage that it should work with any file system type >> supported by the Linux VFS (including networked file systems), but has >> some drawbacks that may affect performance and scalability. This is >> because it is hard to predict what a file system may attempt to do when >> an I/O request is submitted; for example, it may need to allocate memory >> to handle the request and the loopback driver has no control over this. >> Particularly under low-memory or intensive I/O scenarios this can lead >> to out of memory (OOM) problems or deadlocks as the kernel tries to make >> memory available to the VFS layer while satisfying a request from the >> block layer. " >> >> Would there be an advantage to using blktap/blktap2 over loop, if I >> leave off O_DIRECT? Would it be faster, or anything like that? > No, it's essentially the same thing. Both blktap and loopdevs sit on the > vfs in a similar fashion, without O_DIRECT even more so. The deadlocking > and OOM hazards are also the same, btw. > > Deadlocks are a fairly general problem whenever you layer two subsystems > depending on the same resource on top of each other. Both in the blktap > and loopback case the system has several opportunities to hang itself, > because there's even more stuff stacked than normal. The layers are, top > to bottom > > (1) potential caching of {tap/loop}dev writes (Xen doesn't do that) > (2) The block device, which needs some minimum amount of memory to run > its request queue > (3) Cached writes on the file layer > (4) The filesystem needs memory to launder those pages > (5) The disk's block device, equivalent to 2. > (6) The driver driver running the data transfers. > > The shared resource is memory. Now consider what happens when upper > layers in combination grab everything the lower layers need to make > progress. The upper layer can't roll back, so won't get off their memory > before that happened. So we're stuck. > > It shouldn't happen, the kernel has a bunch of mechanisms to prevent > that. It obviously doesn't quite work here. > > That's why I'm suggesting that the most obvious fix for your case is to > limit the cache dirtying rate. > >>> Just reducing the cpu count alone sounds like sth worth trying even on a >>> production box, if the current state of things already tends to take the >>> system down. Also, the dirty_ratio sysctl should be pretty safe to tweak >>> at runtime. >> That's good to hear. >> >>>> The default for dirty_ratio is 20. I tried halving that to 10, but it >>>> didn't help. >>> Still too much. That's meant to be %/task. Try 2, with 1.5G that's still >>> a decent 30M write cache and should block all out of 24 disks after some >>> 700M, worst case. Or so I think... >> Ah, ok. I was thinking that it was global. With a small per-process >> cache like that, it becomes much closer to AIO for writes, but at least >> the leftover memory could still be used for the read cache. > I agree it doesn't do what you want. I have no idea why there's no > global limit, seriously. > > Note that in theory, 24*2% would still approach the oom state you were > in with the log you sent. I think it's going to be less likely though. > With all guests going mad at the same time, it may still not be low > enough. In case that happens, you could resort to pumping even more > memory into dom0. > > Daniel > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel