From mboxrd@z Thu Jan  1 00:00:00 1970
From: James Dingwall <james.dingwall@zynstra.com>
Subject: Re: Kernel 3.11 / 3.12 OOM killer and Xen ballooning
Date: Tue, 7 Jan 2014 09:21:04 +0000
Message-ID: <52CBC700.1060602@zynstra.com>
References: <52A602E5.3080300@zynstra.com>
	<20131209214816.GA3000@phenom.dumpdata.com>
	<52A72AB8.9060707@zynstra.com>
	<20131210152746.GF3184@phenom.dumpdata.com>
	<52A812B0.6060607@oracle.com> <52A89334.3090007@zynstra.com>
	<52B18F44.2030500@oracle.com> <52B3443F.5060704@zynstra.com>
	<52B3B6D7.50606@oracle.com> <52BBEBEF.8040509@zynstra.com>
	<52C50661.7060900@oracle.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <52C50661.7060900@oracle.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Bob Liu <bob.liu@oracle.com>
Cc: xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org

Bob Liu wrote:
> On 12/26/2013 04:42 PM, James Dingwall wrote:
>> Bob Liu wrote:
>>> On 12/20/2013 03:08 AM, James Dingwall wrote:
>>>> Bob Liu wrote:
>>>>> On 12/12/2013 12:30 AM, James Dingwall wrote:
>>>>>> Bob Liu wrote:
>>>>>>> On 12/10/2013 11:27 PM, Konrad Rzeszutek Wilk wrote:
>>>>>>>> On Tue, Dec 10, 2013 at 02:52:40PM +0000, James Dingwall wrote:
>>>>>>>>> Konrad Rzeszutek Wilk wrote:
>>>>>>>>>> On Mon, Dec 09, 2013 at 05:50:29PM +0000, James Dingwall wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Since 3.11 I have noticed that the OOM killer quite frequently
>>>>>>>>>>> triggers in my Xen guest domains which use ballooning to
>>>>>>>>>>> increase/decrease their memory allocation according to their
>>>>>>>>>>> requirements.  One example domain I have has a maximum memory
>>>>>>>>>>> setting of ~1.5Gb but it usually idles at ~300Mb, it is also
>>>>>>>>>>> configured with 2Gb swap which is almost 100% free.
>>>>>>>>>>>
>>>>>>>>>>> # free
>>>>>>>>>>>                  total       used       free     shared    buffers
>>>>>>>>>>> cached
>>>>>>>>>>> Mem:        272080     248108      23972          0 1448
>>>>>>>>>>> 63064
>>>>>>>>>>> -/+ buffers/cache:     183596      88484
>>>>>>>>>>> Swap:      2097148          8    2097140
>>>>>>>>>>>
>>>>>>>>>>> There is plenty of available free memory in the hypervisor to
>>>>>>>>>>> balloon to the maximum size:
>>>>>>>>>>> # xl info | grep free_mem
>>>>>>>>>>> free_memory            : 14923
>>>>>>>>>>>
>>>>>>>>>>> An example trace (they are always the same) from the oom
>>>>>>>>>>> killer in
>>>>>>>>>>> 3.12 is added below.  So far I have not been able to reproduce
>>>>>>>>>>> this
>>>>>>>>>>> at will so it is difficult to start bisecting it to see if a
>>>>>>>>>>> particular change introduced this.  However it does seem that the
>>>>>>>>>>> behaviour is wrong because a) ballooning could give the guest
>>>>>>>>>>> more
>>>>>>>>>>> memory, b) there is lots of swap available which could be used
>>>>>>>>>>> as a
>>>>>>>>>>> fallback.
>>>>>>>> Keep in mind that swap with tmem is actually no more swap. Heh, that
>>>>>>>> sounds odd -but basically pages that are destined for swap end up
>>>>>>>> going in the tmem code which pipes them up to the hypervisor.
>>>>>>>>
>>>>>>>>>>> If other information could help or there are more tests that I
>>>>>>>>>>> could
>>>>>>>>>>> run then please let me know.
>>>>>>>>>> I presume you have enabled 'tmem' both in the hypervisor and in
>>>>>>>>>> the guest right?
>>>>>>>>> Yes, domU and dom0 both have the tmem module loaded and  tmem
>>>>>>>>> tmem_dedup=on tmem_compress=on is given on the xen command line.
>>>>>>>> Excellent. The odd thing is that your swap is not used that much,
>>>>>>>> but
>>>>>>>> it should be (as that is part of what the self-balloon is suppose to
>>>>>>>> do).
>>>>>>>>
>>>>>>>> Bob, you had a patch for the logic of how self-balloon is suppose
>>>>>>>> to account for the slab - would this be relevant to this problem?
>>>>>>>>
>>>>>>> Perhaps, I have attached the patch.
>>>>>>> James, could you please apply it and try your application again? You
>>>>>>> have to rebuild the guest kernel.
>>>>>>> Oh, and also take a look at whether frontswap is in use, you can
>>>>>>> check
>>>>>>> it by watching "cat /sys/kernel/debug/frontswap/*".
>>>>>> I have tested this patch with a workload where I have previously seen
>>>>>> failures and so far so good.  I'll try to keep a guest with it
>>>>>> stressed
>>>>>> to see if I do get any problems.  I don't know if it is expected but I
>>>>> By the way, besides longer time of kswapd, is this patch work well
>>>>> during your stress testing?
>>>>>
>>>>> Have you seen the OOM killer triggered quite frequently again?(with
>>>>> selfshrink=true)
>>>>>
>>>>> Thanks,
>>>>> -Bob
>>>> It was looking good until today (selfshrink=true).  The trace below is
>>>> during a compile of subversion, it looks like the memory has ballooned
>>>> to almost the maximum permissible but even under pressure the swap disk
>>>> has hardly come in to use.
>>>>
>>> So if without selfshrink the swap disk can be used a lot?
>>>
>>> If that's the case, I'm afraid the frontswap-selfshrink in
>>> xen-selfballoon did something incorrect.
>>>
>>> Could you please try this patch which make the frontswap-selfshrink
>>> slower and add a printk for debug.
>>> Please still keep selfshrink=true in your test but can with or without
>>> my previous patch.
>>> Thanks a lot!
>>>
>> The oom trace below was triggered during a compile of gcc.  I have the
>> full dmesg from boot which shows all the printks, please let me know if
>> you would like to see that.
>>
> Sorry for the later response.
> Could you confirm that this problem doesn't exist if loading tmem with
> selfshrinking=0 during compile gcc? It seems that you are compiling
> difference packages during your testing.
> This will help to figure out whether selfshrinking is the root cause.
Got an oom with selfshrinking=0, again during a gcc compile. 
Unfortunately I don't have a single test case which demonstrates the 
problem but as I mentioned before it will generally show up under 
compiles of large packages such as glibc, kdelibs, gcc etc.

I don't know if this is a separate or related issue but over the 
holidays I also had a problem with six of the guests on my system where 
kswapd was running at 100% and had clocked up >9000 minutes of cpu time 
even though there was otherwise no load on them.  Of the guests I 
restarted yesterday in this state two have already got in to the same 
state again, they are running a kernel with the first patch that you sent.

/sys/module/tmem/parameters/cleancache Y
/sys/module/tmem/parameters/frontswap Y
/sys/module/tmem/parameters/selfballooning Y
/sys/module/tmem/parameters/selfshrinking N

James

[ 8212.940520] cc1plus invoked oom-killer: gfp_mask=0x200da, order=0, 
oom_score_adj=0
[ 8212.940529] CPU: 1 PID: 23678 Comm: cc1plus Tainted: G W    3.12.5 #88
[ 8212.940532]  ffff88001e38cdf8 ffff88000094f968 ffffffff8148f200 
ffff88001f90e8e8
[ 8212.940536]  ffff88001e38c8c0 ffff88000094fa08 ffffffff8148ccf7 
ffff88000094f9b8
[ 8212.940538]  ffffffff810f8d97 ffff88000094f998 ffffffff81006dc8 
ffff88000094f9a8
[ 8212.940542] Call Trace:
[ 8212.940554]  [<ffffffff8148f200>] dump_stack+0x46/0x58
[ 8212.940558]  [<ffffffff8148ccf7>] dump_header.isra.9+0x6d/0x1cc
[ 8212.940564]  [<ffffffff810f8d97>] ? super_cache_count+0xa8/0xb8
[ 8212.940569]  [<ffffffff81006dc8>] ? xen_clocksource_read+0x20/0x22
[ 8212.940573]  [<ffffffff81006ea9>] ? xen_clocksource_get_cycles+0x9/0xb
[ 8212.940578]  [<ffffffff81494abe>] ? _raw_spin_unlock_irqrestore+0x47/0x62
[ 8212.940583]  [<ffffffff81296b27>] ? ___ratelimit+0xcb/0xe8
[ 8212.940588]  [<ffffffff810b2bbf>] oom_kill_process+0x70/0x2fd
[ 8212.940592]  [<ffffffff810bca0e>] ? zone_reclaimable+0x11/0x1e
[ 8212.940597]  [<ffffffff81048779>] ? has_ns_capability_noaudit+0x12/0x19
[ 8212.940600]  [<ffffffff81048792>] ? has_capability_noaudit+0x12/0x14
[ 8212.940603]  [<ffffffff810b32de>] out_of_memory+0x31b/0x34e
[ 8212.940608]  [<ffffffff810b7438>] __alloc_pages_nodemask+0x65b/0x792
[ 8212.940612]  [<ffffffff810e3da3>] alloc_pages_vma+0xd0/0x10c
[ 8212.940617]  [<ffffffff810dd5a4>] read_swap_cache_async+0x70/0x120
[ 8212.940620]  [<ffffffff810dd6e4>] swapin_readahead+0x90/0xd4
[ 8212.940623]  [<ffffffff81005b35>] ? pte_mfn_to_pfn+0x59/0xcb
[ 8212.940627]  [<ffffffff810cf99d>] handle_mm_fault+0x8a4/0xd54
[ 8212.940630]  [<ffffffff81006dc8>] ? xen_clocksource_read+0x20/0x22
[ 8212.940634]  [<ffffffff810115d2>] ? sched_clock+0x9/0xd
[ 8212.940638]  [<ffffffff8106772f>] ? sched_clock_local+0x12/0x75
[ 8212.940641]  [<ffffffff8106823b>] ? arch_vtime_task_switch+0x81/0x86
[ 8212.940646]  [<ffffffff81037f40>] __do_page_fault+0x3d8/0x437
[ 8212.940649]  [<ffffffff81006dc8>] ? xen_clocksource_read+0x20/0x22
[ 8212.940652]  [<ffffffff810115d2>] ? sched_clock+0x9/0xd
[ 8212.940654]  [<ffffffff8106772f>] ? sched_clock_local+0x12/0x75
[ 8212.940658]  [<ffffffff810a45cc>] ? __acct_update_integrals+0xb4/0xbf
[ 8212.940661]  [<ffffffff810a493f>] ? acct_account_cputime+0x17/0x19
[ 8212.940663]  [<ffffffff81067c28>] ? account_user_time+0x67/0x92
[ 8212.940666]  [<ffffffff8106811b>] ? vtime_account_user+0x4d/0x52
[ 8212.940669]  [<ffffffff81037fd8>] do_page_fault+0x1a/0x5a
[ 8212.940674]  [<ffffffff810a065f>] ? rcu_user_enter+0xe/0x10
[ 8212.940677]  [<ffffffff81495158>] page_fault+0x28/0x30
[ 8212.940679] Mem-Info:
[ 8212.940681] Node 0 DMA per-cpu:
[ 8212.940684] CPU    0: hi:    0, btch:   1 usd:   0
[ 8212.940685] CPU    1: hi:    0, btch:   1 usd:   0
[ 8212.940686] Node 0 DMA32 per-cpu:
[ 8212.940688] CPU    0: hi:  186, btch:  31 usd: 116
[ 8212.940690] CPU    1: hi:  186, btch:  31 usd: 124
[ 8212.940691] Node 0 Normal per-cpu:
[ 8212.940693] CPU    0: hi:    0, btch:   1 usd:   0
[ 8212.940694] CPU    1: hi:    0, btch:   1 usd:   0
[ 8212.940700] active_anon:105765 inactive_anon:105882 isolated_anon:0
  active_file:8412 inactive_file:8612 isolated_file:0
  unevictable:0 dirty:0 writeback:0 unstable:0
  free:1143 slab_reclaimable:3575 slab_unreclaimable:3464
  mapped:3792 shmem:6 pagetables:2534 bounce:0
  free_cma:0 totalram:246132 balloontarget:306242
[ 8212.940702] Node 0 DMA free:1964kB min:88kB low:108kB high:132kB 
active_anon:5092kB inactive_anon:5328kB active_file:416kB 
inactive_file:608kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:15996kB managed:15392kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:320kB shmem:0kB slab_reclaimable:252kB 
slab_unreclaimable:492kB kernel_stack:120kB pagetables:252kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:26951 all_unreclaimable? yes
[ 8212.940711] lowmem_reserve[]: 0 469 469 469
[ 8212.940715] Node 0 DMA32 free:2608kB min:2728kB low:3408kB 
high:4092kB active_anon:181456kB inactive_anon:181528kB 
active_file:22296kB inactive_file:22644kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:507904kB managed:466364kB 
mlocked:0kB dirty:0kB writeback:0kB mapped:8628kB shmem:20kB 
slab_reclaimable:10756kB slab_unreclaimable:12548kB kernel_stack:1688kB 
pagetables:8876kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:612393 all_unreclaimable? yes
[ 8212.940722] lowmem_reserve[]: 0 0 0 0
[ 8212.940725] Node 0 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:236512kB inactive_anon:236672kB active_file:10936kB 
inactive_file:11196kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:524288kB managed:502772kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:6220kB shmem:4kB slab_reclaimable:3292kB 
slab_unreclaimable:816kB kernel_stack:64kB pagetables:1008kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:745963 all_unreclaimable? yes
[ 8212.940732] lowmem_reserve[]: 0 0 0 0
[ 8212.940735] Node 0 DMA: 1*4kB (R) 0*8kB 4*16kB (R) 1*32kB (R) 1*64kB 
(R) 2*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 1956kB
[ 8212.940747] Node 0 DMA32: 652*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB 
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2608kB
[ 8212.940756] Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 8212.940765] 16847 total pagecache pages
[ 8212.940766] 8381 pages in swap cache
[ 8212.940768] Swap cache stats: add 741397, delete 733016, find 
250268/342284
[ 8212.940769] Free swap  = 1925576kB
[ 8212.940770] Total swap = 2097148kB
[ 8212.951044] 262143 pages RAM
[ 8212.951046] 11939 pages reserved
[ 8212.951047] 540820 pages shared
[ 8212.951048] 240248 pages non-shared
[ 8212.951050] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
<snip process list>
[ 8212.951310] Out of memory: Kill process 23721 (cc1plus) score 119 or 
sacrifice child
[ 8212.951313] Killed process 23721 (cc1plus) total-vm:530268kB, 
anon-rss:350980kB, file-rss:9408kB
[54810.683658] kjournald starting.  Commit interval 5 seconds
[54810.684381] EXT3-fs (xvda1): using internal journal
[54810.684402] EXT3-fs (xvda1): mounted filesystem with writeback data mode