From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Dingwall Subject: Re: Kernel 3.11 / 3.12 OOM killer and Xen ballooning Date: Tue, 7 Jan 2014 09:21:04 +0000 Message-ID: <52CBC700.1060602@zynstra.com> References: <52A602E5.3080300@zynstra.com> <20131209214816.GA3000@phenom.dumpdata.com> <52A72AB8.9060707@zynstra.com> <20131210152746.GF3184@phenom.dumpdata.com> <52A812B0.6060607@oracle.com> <52A89334.3090007@zynstra.com> <52B18F44.2030500@oracle.com> <52B3443F.5060704@zynstra.com> <52B3B6D7.50606@oracle.com> <52BBEBEF.8040509@zynstra.com> <52C50661.7060900@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <52C50661.7060900@oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Bob Liu Cc: xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org Bob Liu wrote: > On 12/26/2013 04:42 PM, James Dingwall wrote: >> Bob Liu wrote: >>> On 12/20/2013 03:08 AM, James Dingwall wrote: >>>> Bob Liu wrote: >>>>> On 12/12/2013 12:30 AM, James Dingwall wrote: >>>>>> Bob Liu wrote: >>>>>>> On 12/10/2013 11:27 PM, Konrad Rzeszutek Wilk wrote: >>>>>>>> On Tue, Dec 10, 2013 at 02:52:40PM +0000, James Dingwall wrote: >>>>>>>>> Konrad Rzeszutek Wilk wrote: >>>>>>>>>> On Mon, Dec 09, 2013 at 05:50:29PM +0000, James Dingwall wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Since 3.11 I have noticed that the OOM killer quite frequently >>>>>>>>>>> triggers in my Xen guest domains which use ballooning to >>>>>>>>>>> increase/decrease their memory allocation according to their >>>>>>>>>>> requirements. One example domain I have has a maximum memory >>>>>>>>>>> setting of ~1.5Gb but it usually idles at ~300Mb, it is also >>>>>>>>>>> configured with 2Gb swap which is almost 100% free. >>>>>>>>>>> >>>>>>>>>>> # free >>>>>>>>>>> total used free shared buffers >>>>>>>>>>> cached >>>>>>>>>>> Mem: 272080 248108 23972 0 1448 >>>>>>>>>>> 63064 >>>>>>>>>>> -/+ buffers/cache: 183596 88484 >>>>>>>>>>> Swap: 2097148 8 2097140 >>>>>>>>>>> >>>>>>>>>>> There is plenty of available free memory in the hypervisor to >>>>>>>>>>> balloon to the maximum size: >>>>>>>>>>> # xl info | grep free_mem >>>>>>>>>>> free_memory : 14923 >>>>>>>>>>> >>>>>>>>>>> An example trace (they are always the same) from the oom >>>>>>>>>>> killer in >>>>>>>>>>> 3.12 is added below. So far I have not been able to reproduce >>>>>>>>>>> this >>>>>>>>>>> at will so it is difficult to start bisecting it to see if a >>>>>>>>>>> particular change introduced this. However it does seem that the >>>>>>>>>>> behaviour is wrong because a) ballooning could give the guest >>>>>>>>>>> more >>>>>>>>>>> memory, b) there is lots of swap available which could be used >>>>>>>>>>> as a >>>>>>>>>>> fallback. >>>>>>>> Keep in mind that swap with tmem is actually no more swap. Heh, that >>>>>>>> sounds odd -but basically pages that are destined for swap end up >>>>>>>> going in the tmem code which pipes them up to the hypervisor. >>>>>>>> >>>>>>>>>>> If other information could help or there are more tests that I >>>>>>>>>>> could >>>>>>>>>>> run then please let me know. >>>>>>>>>> I presume you have enabled 'tmem' both in the hypervisor and in >>>>>>>>>> the guest right? >>>>>>>>> Yes, domU and dom0 both have the tmem module loaded and tmem >>>>>>>>> tmem_dedup=on tmem_compress=on is given on the xen command line. >>>>>>>> Excellent. The odd thing is that your swap is not used that much, >>>>>>>> but >>>>>>>> it should be (as that is part of what the self-balloon is suppose to >>>>>>>> do). >>>>>>>> >>>>>>>> Bob, you had a patch for the logic of how self-balloon is suppose >>>>>>>> to account for the slab - would this be relevant to this problem? >>>>>>>> >>>>>>> Perhaps, I have attached the patch. >>>>>>> James, could you please apply it and try your application again? You >>>>>>> have to rebuild the guest kernel. >>>>>>> Oh, and also take a look at whether frontswap is in use, you can >>>>>>> check >>>>>>> it by watching "cat /sys/kernel/debug/frontswap/*". >>>>>> I have tested this patch with a workload where I have previously seen >>>>>> failures and so far so good. I'll try to keep a guest with it >>>>>> stressed >>>>>> to see if I do get any problems. I don't know if it is expected but I >>>>> By the way, besides longer time of kswapd, is this patch work well >>>>> during your stress testing? >>>>> >>>>> Have you seen the OOM killer triggered quite frequently again?(with >>>>> selfshrink=true) >>>>> >>>>> Thanks, >>>>> -Bob >>>> It was looking good until today (selfshrink=true). The trace below is >>>> during a compile of subversion, it looks like the memory has ballooned >>>> to almost the maximum permissible but even under pressure the swap disk >>>> has hardly come in to use. >>>> >>> So if without selfshrink the swap disk can be used a lot? >>> >>> If that's the case, I'm afraid the frontswap-selfshrink in >>> xen-selfballoon did something incorrect. >>> >>> Could you please try this patch which make the frontswap-selfshrink >>> slower and add a printk for debug. >>> Please still keep selfshrink=true in your test but can with or without >>> my previous patch. >>> Thanks a lot! >>> >> The oom trace below was triggered during a compile of gcc. I have the >> full dmesg from boot which shows all the printks, please let me know if >> you would like to see that. >> > Sorry for the later response. > Could you confirm that this problem doesn't exist if loading tmem with > selfshrinking=0 during compile gcc? It seems that you are compiling > difference packages during your testing. > This will help to figure out whether selfshrinking is the root cause. Got an oom with selfshrinking=0, again during a gcc compile. Unfortunately I don't have a single test case which demonstrates the problem but as I mentioned before it will generally show up under compiles of large packages such as glibc, kdelibs, gcc etc. I don't know if this is a separate or related issue but over the holidays I also had a problem with six of the guests on my system where kswapd was running at 100% and had clocked up >9000 minutes of cpu time even though there was otherwise no load on them. Of the guests I restarted yesterday in this state two have already got in to the same state again, they are running a kernel with the first patch that you sent. /sys/module/tmem/parameters/cleancache Y /sys/module/tmem/parameters/frontswap Y /sys/module/tmem/parameters/selfballooning Y /sys/module/tmem/parameters/selfshrinking N James [ 8212.940520] cc1plus invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 [ 8212.940529] CPU: 1 PID: 23678 Comm: cc1plus Tainted: G W 3.12.5 #88 [ 8212.940532] ffff88001e38cdf8 ffff88000094f968 ffffffff8148f200 ffff88001f90e8e8 [ 8212.940536] ffff88001e38c8c0 ffff88000094fa08 ffffffff8148ccf7 ffff88000094f9b8 [ 8212.940538] ffffffff810f8d97 ffff88000094f998 ffffffff81006dc8 ffff88000094f9a8 [ 8212.940542] Call Trace: [ 8212.940554] [] dump_stack+0x46/0x58 [ 8212.940558] [] dump_header.isra.9+0x6d/0x1cc [ 8212.940564] [] ? super_cache_count+0xa8/0xb8 [ 8212.940569] [] ? xen_clocksource_read+0x20/0x22 [ 8212.940573] [] ? xen_clocksource_get_cycles+0x9/0xb [ 8212.940578] [] ? _raw_spin_unlock_irqrestore+0x47/0x62 [ 8212.940583] [] ? ___ratelimit+0xcb/0xe8 [ 8212.940588] [] oom_kill_process+0x70/0x2fd [ 8212.940592] [] ? zone_reclaimable+0x11/0x1e [ 8212.940597] [] ? has_ns_capability_noaudit+0x12/0x19 [ 8212.940600] [] ? has_capability_noaudit+0x12/0x14 [ 8212.940603] [] out_of_memory+0x31b/0x34e [ 8212.940608] [] __alloc_pages_nodemask+0x65b/0x792 [ 8212.940612] [] alloc_pages_vma+0xd0/0x10c [ 8212.940617] [] read_swap_cache_async+0x70/0x120 [ 8212.940620] [] swapin_readahead+0x90/0xd4 [ 8212.940623] [] ? pte_mfn_to_pfn+0x59/0xcb [ 8212.940627] [] handle_mm_fault+0x8a4/0xd54 [ 8212.940630] [] ? xen_clocksource_read+0x20/0x22 [ 8212.940634] [] ? sched_clock+0x9/0xd [ 8212.940638] [] ? sched_clock_local+0x12/0x75 [ 8212.940641] [] ? arch_vtime_task_switch+0x81/0x86 [ 8212.940646] [] __do_page_fault+0x3d8/0x437 [ 8212.940649] [] ? xen_clocksource_read+0x20/0x22 [ 8212.940652] [] ? sched_clock+0x9/0xd [ 8212.940654] [] ? sched_clock_local+0x12/0x75 [ 8212.940658] [] ? __acct_update_integrals+0xb4/0xbf [ 8212.940661] [] ? acct_account_cputime+0x17/0x19 [ 8212.940663] [] ? account_user_time+0x67/0x92 [ 8212.940666] [] ? vtime_account_user+0x4d/0x52 [ 8212.940669] [] do_page_fault+0x1a/0x5a [ 8212.940674] [] ? rcu_user_enter+0xe/0x10 [ 8212.940677] [] page_fault+0x28/0x30 [ 8212.940679] Mem-Info: [ 8212.940681] Node 0 DMA per-cpu: [ 8212.940684] CPU 0: hi: 0, btch: 1 usd: 0 [ 8212.940685] CPU 1: hi: 0, btch: 1 usd: 0 [ 8212.940686] Node 0 DMA32 per-cpu: [ 8212.940688] CPU 0: hi: 186, btch: 31 usd: 116 [ 8212.940690] CPU 1: hi: 186, btch: 31 usd: 124 [ 8212.940691] Node 0 Normal per-cpu: [ 8212.940693] CPU 0: hi: 0, btch: 1 usd: 0 [ 8212.940694] CPU 1: hi: 0, btch: 1 usd: 0 [ 8212.940700] active_anon:105765 inactive_anon:105882 isolated_anon:0 active_file:8412 inactive_file:8612 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 free:1143 slab_reclaimable:3575 slab_unreclaimable:3464 mapped:3792 shmem:6 pagetables:2534 bounce:0 free_cma:0 totalram:246132 balloontarget:306242 [ 8212.940702] Node 0 DMA free:1964kB min:88kB low:108kB high:132kB active_anon:5092kB inactive_anon:5328kB active_file:416kB inactive_file:608kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15392kB mlocked:0kB dirty:0kB writeback:0kB mapped:320kB shmem:0kB slab_reclaimable:252kB slab_unreclaimable:492kB kernel_stack:120kB pagetables:252kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:26951 all_unreclaimable? yes [ 8212.940711] lowmem_reserve[]: 0 469 469 469 [ 8212.940715] Node 0 DMA32 free:2608kB min:2728kB low:3408kB high:4092kB active_anon:181456kB inactive_anon:181528kB active_file:22296kB inactive_file:22644kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:507904kB managed:466364kB mlocked:0kB dirty:0kB writeback:0kB mapped:8628kB shmem:20kB slab_reclaimable:10756kB slab_unreclaimable:12548kB kernel_stack:1688kB pagetables:8876kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:612393 all_unreclaimable? yes [ 8212.940722] lowmem_reserve[]: 0 0 0 0 [ 8212.940725] Node 0 Normal free:0kB min:0kB low:0kB high:0kB active_anon:236512kB inactive_anon:236672kB active_file:10936kB inactive_file:11196kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:524288kB managed:502772kB mlocked:0kB dirty:0kB writeback:0kB mapped:6220kB shmem:4kB slab_reclaimable:3292kB slab_unreclaimable:816kB kernel_stack:64kB pagetables:1008kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:745963 all_unreclaimable? yes [ 8212.940732] lowmem_reserve[]: 0 0 0 0 [ 8212.940735] Node 0 DMA: 1*4kB (R) 0*8kB 4*16kB (R) 1*32kB (R) 1*64kB (R) 2*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 1956kB [ 8212.940747] Node 0 DMA32: 652*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2608kB [ 8212.940756] Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 8212.940765] 16847 total pagecache pages [ 8212.940766] 8381 pages in swap cache [ 8212.940768] Swap cache stats: add 741397, delete 733016, find 250268/342284 [ 8212.940769] Free swap = 1925576kB [ 8212.940770] Total swap = 2097148kB [ 8212.951044] 262143 pages RAM [ 8212.951046] 11939 pages reserved [ 8212.951047] 540820 pages shared [ 8212.951048] 240248 pages non-shared [ 8212.951050] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 8212.951310] Out of memory: Kill process 23721 (cc1plus) score 119 or sacrifice child [ 8212.951313] Killed process 23721 (cc1plus) total-vm:530268kB, anon-rss:350980kB, file-rss:9408kB [54810.683658] kjournald starting. Commit interval 5 seconds [54810.684381] EXT3-fs (xvda1): using internal journal [54810.684402] EXT3-fs (xvda1): mounted filesystem with writeback data mode