From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Dingwall Subject: Re: Kernel 3.11 / 3.12 OOM killer and Xen ballooning Date: Wed, 15 Jan 2014 16:35:02 +0000 Message-ID: <52D6B8B6.5070302@zynstra.com> References: <52A602E5.3080300@zynstra.com> <20131209214816.GA3000@phenom.dumpdata.com> <52A72AB8.9060707@zynstra.com> <20131210152746.GF3184@phenom.dumpdata.com> <52A812B0.6060607@oracle.com> <52A89334.3090007@zynstra.com> <52B18F44.2030500@oracle.com> <52B3443F.5060704@zynstra.com> <52B3B6D7.50606@oracle.com> <52BBEBEF.8040509@zynstra.com> <52C50661.7060900@oracle.com> <52CBC700.1060602@zynstra.com> <52CE7E67.5080708@oracle.com> <52D64B87.6000400@zynstra.com> <52D69E0B.5020006@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <52D69E0B.5020006@oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Bob Liu Cc: xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org Bob Liu wrote: > On 01/15/2014 04:49 PM, James Dingwall wrote: >> Bob Liu wrote: >>> On 01/07/2014 05:21 PM, James Dingwall wrote: >>>> Bob Liu wrote: >>>>> Could you confirm that this problem doesn't exist if loading tmem with >>>>> selfshrinking=0 during compile gcc? It seems that you are compiling >>>>> difference packages during your testing. >>>>> This will help to figure out whether selfshrinking is the root cause. >>>> Got an oom with selfshrinking=0, again during a gcc compile. >>>> Unfortunately I don't have a single test case which demonstrates the >>>> problem but as I mentioned before it will generally show up under >>>> compiles of large packages such as glibc, kdelibs, gcc etc. >>>> >>> So the root cause is not because enabled selfshrinking. >>> Then what I can think of is that the xen-selfballoon driver was too >>> aggressive, too many pages were ballooned out which causeed heavy memory >>> pressure to guest OS. >>> And kswapd started to reclaim page until most of pages were >>> unreclaimable(all_unreclaimable=yes for all zones), then OOM Killer was >>> triggered. >>> In theory the balloon driver should give back ballooned out pages to >>> guest OS, but I'm afraid this procedure is not fast enough. >>> >>> My suggestion is reserve a min memory for your guest OS so that the >>> xen-selfballoon won't be so aggressive. >>> You can do it through parameters selfballoon_reserved_mb or >>> selfballoon_min_usable_mb. >> I am still getting OOM errors with both of these set to 32 so I'll try >> another bump to 64. I think that if I do find values which prevent it >> though then it is more of a work around than a fix because it still >> suggests that swap is not being used when ballooning is no longer > Yes, it's like a work around. But I don't think there is a better way. > > From the recent OOM log your reported: > [ 8212.940769] Free swap = 1925576kB > [ 8212.940770] Total swap = 2097148kB > > [504638.442136] Free swap = 1868108kB > [504638.442137] Total swap = 2097148kB > > 171572KB and 229040KB data are swapped out to swap disk, I think there > are already significantly values for guest OS has only ~300M usable memory. > guest OS can't find out pages suitable for swap any more after so many > pages are swapped out, although at that time the swap device still have > enough space. > > The OOM may not be triggered if the balloon driver can give back memory > to guest OS fast enough but I think it's unrealistic. > So the best way is reserve more memory to guest OS. > >> capable of satisfying the request. I've also got an Ubuntu Saucy (3.11 >> kernel) guest running on the dom0 with tmem activated so I'm going to >> see if I can find a comparable workload to see if I get the same issue >> with a different kernel configuration. >> I've done a bit more testing and seem to have found an extra condition which is affecting the OOM behaviour in my guests. All my Gentoo guests have swap space backed by a dm-crypt volume and if I remove this layer then things seem to be behaving much more reliably. In my Ubuntu guests I have plain swap space and so far I haven't been able to trigger an OOM condition. Is it possible that it is the dm-crypt layer failing to get working memory when swapping something in/out and causing the error? James