From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Weekes Subject: Re: OOM problems Date: Tue, 16 Nov 2010 11:54:11 -0800 Message-ID: <4CE2E163.2090809@nuclearfallout.net> References: <4CDE44E2.2060807@nuclearfallout.net> <4FA716B1526C7C4DB0375C6DADBC4EA38D80702C25@LONPMAILBOX01.citrite.net> <4CDE4C08.70309@nuclearfallout.net> <4FA716B1526C7C4DB0375C6DADBC4EA38D80702C2E@LONPMAILBOX01.citrite.net> <4CE1037402000078000222F0@vpn.id2.novell.com> <1289814037.21694.22.camel@ramone> <4CE1751F.9020202@nuclearfallout.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4CE1751F.9020202@nuclearfallout.net> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Daniel Stodden Cc: Ian Pratt , "xen-devel@lists.xensource.com" , Jan Beulich List-Id: xen-devel@lists.xenproject.org Performance is noticeably lower with aio on these bursty write workloads; I've been getting a number of complaints. I see that 2.6.36 has some page_writeback changes: http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2.6%2Fpatch-2.6.36.bz2;z=8379 . Any thoughts on whether these would make a difference for the problems with "file:"? I'm still trying to find a way to reproduce the issue in the lab, so I'd have to test the patch in production -- that's not a tantalizing prospect, unless there is a real chance that it will affect it. -John On 11/15/2010 9:59 AM, John Weekes wrote: > >> They are throttled, but the single control I'm aware of >> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only >> per process, not a global limit. Could well be that's part of the >> problem -- outwitting mm with just too many writers on too many cores? >> >> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes >> made it much easier than with e.g. 2.6.27 to drive everybody else into >> costly reclaims. >> >> The Oom shown here reports about ~650M in dirty pages. The fact alone >> that this counts as on oom condition doesn't sound quite right in >> itself. That qemu might just have dared to ask at the wrong point in >> time. >> >> Just to get an idea -- how many guests did this box carry? > > It carries about two dozen guests, with a mix of mostly HVMs (all > stubdom-based, some with PV-on-HVM drivers) and some PV. > > This problem occurred more often for me under 2.6.32 than 2.6.31, I > noticed. Since I made the switch to aio, I haven't seen a crash, but > it hasn't been long enough for that to mean much. > > Having extra caching in the dom0 is nice because it allows for domUs > to get away with having small amounts of free memory, while still > having very good (much faster than hardware) write performance. If you > have a large number of domUs that are all memory-constrained and use > the disk in infrequent, large bursts, this can work out pretty well, > since the big communal pool provides a better value proposition than > giving each domU a few more megabytes of RAM. > > If the OOM problem isn't something that can be fixed, it might be a > good idea to print out a warning to the user when a domain using > "file:" is started. Or, to go a step further and automatically run > "file" based domains as though "aio" was specified, possibly with a > warning and a way to override that behavior. It's not really intuitive > that "file" would cause crashes. > > -John > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel