From mboxrd@z Thu Jan  1 00:00:00 1970
From: John Weekes <lists.xen@nuclearfallout.net>
Subject: Re: OOM problems
Date: Thu, 18 Nov 2010 23:27:10 -0800
Message-ID: <4CE626CE.7050408@nuclearfallout.net>
References: <4CDE44E2.2060807@nuclearfallout.net>	<4FA716B1526C7C4DB0375C6DADBC4EA38D80702C25@LONPMAILBOX01.citrite.net>	<4CDE4C08.70309@nuclearfallout.net>	<4FA716B1526C7C4DB0375C6DADBC4EA38D80702C2E@LONPMAILBOX01.citrite.net>	<4CE1037402000078000222F0@vpn.id2.novell.com>	<1289814037.21694.22.camel@ramone>	<4CE1751F.9020202@nuclearfallout.net>	<4CE2E163.2090809@nuclearfallout.net>	<4FA716B1526C7C4DB0375C6DADBC4EA38D80702E0E@LONPMAILBOX01.citrite.net>	<4CE450E7.9010508@nuclearfallout.net>	<1290043433.11102.1742.camel@agari.van.xensource.com>	<4CE49D98.2030402@nuclearfallout.net>	<1290053337.18200.28.camel@agari.van.xensource.com>	<4CE4D285.5060500@nuclearfallout.net>
	<1290076883.6481.178.camel@ramone>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <1290076883.6481.178.camel@ramone>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Daniel Stodden <daniel.stodden@citrix.com>
Cc: Ian Pratt <Ian.Pratt@eu.citrix.com>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, Jan Beulich <JBeulich@novell.com>
List-Id: xen-devel@lists.xenproject.org

Daniel, thank you for the help and in-depth information, as well as the 
test code off-list. The corruption problem with blktap2 O_DIRECT is 
easily reproducible for me on multiple machines, so I hope that we'll be 
able to nail this one down pretty quickly.

To follow up on my question about the potential performance difference 
between blktap2 without O_DIRECT and loop (both of which use the page 
cache), I did some tests inside a sparse file-backed domU by timing 
copying a folder containing 7419 files and folders totalling 1.6 GB (of 
mixed sizes), and found that loop returned this:

real    1m18.257s
user    0m0.050s
sys     0m6.550s

While tapdisk2 aio w/o O_DIRECT clocked in at:

real    0m55.373s
user    0m0.050s
sys     0m6.690s

With each, I saw a few more seconds of disk activity on dom0, since 
dirty_ratio was set to 2. I ran the tests several times and dropped 
caches on dom0 between each one; all of the results were within a second 
or two of each other.

This represents a significant ~41% performance bump for that particular 
workload. In light of this, I would recommend to anyone who is using 
"file:" that they try out tapdisk2 aio with a modified block-aio.c to 
remove O_DIRECT, and see how it goes. If you find results similar to 
mine, it might be worth modifying this into another blktap2 driver.

-John

On 11/18/2010 2:41 AM, Daniel Stodden wrote:
> On Thu, 2010-11-18 at 02:15 -0500, John Weekes wrote:
>>> I think [XCP blktap] should work fine, or wouldn't ask. If not, lemme know.
>> k.
>>
>>>> In my last bit of troubleshooting, I took O_DIRECT out of the open call
>>>> in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates
>>>> that this might have eliminated the problem with corruption. I'm testing
>>>> further now, but could there be an issue with alignment (since the
>>>> kernel is apparently very strict about it with direct I/O)?
>>> Nope. It is, but they're 4k-aligned all over the place. You'd see syslog
>>> yelling quite miserably in cases like that. Keeping an eye on syslog
>>> (the daemon and kern facilites) is a generally good idea btw.
>> I've been doing that and haven't seen any unusual output so far, which I
>> guess is good.
>>
>>>> (Removing
>>>> this flag also brings back in use of the page cache, of course.)
>>> I/O-wise it's not much different from the file:-path. Meaning it should
>>> have carried you directly back into the Oom realm.
>> Does it make a difference that it's not using "loop" and instead the CPU
>> usage (and presumably some blocking) occurs in user-space?
> It's certainly a different path taken. I just meant to say file access
> has about the same properties, so you're likely back to the original
> issue.
>
>>   There's not
>> too much information on this out there, but it seems at though the OOM
>> issue might be at least somewhat loop device-specific. One document that
>> references loop OOM problems that I found is this one:
>> http://sources.redhat.com/lvm2/wiki/DMLoop.
>>   My initial take on it was
>> that it might be saying that it mattered when these things were being
>> done in the kernel, but now I'm not so certain --
>>
>> ".. [their method and loop] submit[s] [I/O requests] via a kernel thread
>> to the VFS layer using traditional I/O calls (read, write etc.). This
>> has the advantage that it should work with any file system type
>> supported by the Linux VFS (including networked file systems), but has
>> some drawbacks that may affect performance and scalability. This is
>> because it is hard to predict what a file system may attempt to do when
>> an I/O request is submitted; for example, it may need to allocate memory
>> to handle the request and the loopback driver has no control over this.
>> Particularly under low-memory or intensive I/O scenarios this can lead
>> to out of memory (OOM) problems or deadlocks as the kernel tries to make
>> memory available to the VFS layer while satisfying a request from the
>> block layer. "
>>
>> Would there be an advantage to using blktap/blktap2 over loop, if I
>> leave off O_DIRECT? Would it be faster, or anything like that?
> No, it's essentially the same thing. Both blktap and loopdevs sit on the
> vfs in a similar fashion, without O_DIRECT even more so. The deadlocking
> and OOM hazards are also the same, btw.
>
> Deadlocks are a fairly general problem whenever you layer two subsystems
> depending on the same resource on top of each other. Both in the blktap
> and loopback case the system has several opportunities to hang itself,
> because there's even more stuff stacked than normal. The layers are, top
> to bottom
>
>   (1) potential caching of {tap/loop}dev writes (Xen doesn't do that)
>   (2) The block device, which needs some minimum amount of memory to run
>       its request queue
>   (3) Cached writes on the file layer
>   (4) The filesystem needs memory to launder those pages
>   (5) The disk's block device, equivalent to 2.
>   (6) The driver driver running the data transfers.
>
> The shared resource is memory. Now consider what happens when upper
> layers in combination grab everything the lower layers need to make
> progress. The upper layer can't roll back, so won't get off their memory
> before that happened. So we're stuck.
>
> It shouldn't happen, the kernel has a bunch of mechanisms to prevent
> that. It obviously doesn't quite work here.
>
> That's why I'm suggesting that the most obvious fix for your case is to
> limit the cache dirtying rate.
>
>>> Just reducing the cpu count alone sounds like sth worth trying even on a
>>> production box, if the current state of things already tends to take the
>>> system down. Also, the dirty_ratio sysctl should be pretty safe to tweak
>>> at runtime.
>> That's good to hear.
>>
>>>> The default for dirty_ratio is 20. I tried halving that to 10, but it
>>>> didn't help.
>>> Still too much. That's meant to be %/task. Try 2, with 1.5G that's still
>>> a decent 30M write cache and should block all out of 24 disks after some
>>> 700M, worst case. Or so I think...
>> Ah, ok. I was thinking that it was global. With a small per-process
>> cache like that, it becomes much closer to AIO for writes, but at least
>> the leftover memory could still be used for the read cache.
> I agree it doesn't do what you want. I have no idea why there's no
> global limit, seriously.
>
> Note that in theory, 24*2% would still approach the oom state you were
> in with the log you sent. I think it's going to be less likely though.
> With all guests going mad at the same time, it may still not be low
> enough. In case that happens, you could resort to pumping even more
> memory into dom0.
>
> Daniel
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel