From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f177.google.com (mail-pd0-f177.google.com [209.85.192.177]) by kanga.kvack.org (Postfix) with ESMTP id 988706B0036 for ; Tue, 2 Sep 2014 15:05:46 -0400 (EDT) Received: by mail-pd0-f177.google.com with SMTP id r10so9159861pdi.22 for ; Tue, 02 Sep 2014 12:05:46 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id o8si7284107pdr.159.2014.09.02.12.05.43 for ; Tue, 02 Sep 2014 12:05:43 -0700 (PDT) Message-ID: <54061505.8020500@sr71.net> Date: Tue, 02 Sep 2014 12:05:41 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: regression caused by cgroups optimization in 3.17-rc2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the memory cgroups code. This is on a kernel with cgroups enabled at compile time, but not _used_ for anything. See the green lines in the graph: https://www.sr71.net/~dave/intel/regression-from-05b843012.png The workload is a little parallel microbenchmark doing page faults: > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c The hardware is an 8-socket Westmere box with 160 hardware threads. For some reason, this does not affect the version of the microbenchmark which is doing completely anonymous page faults. I bisected it down to this commit: > commit 05b8430123359886ef6a4146fba384e30d771b3f > Author: Johannes Weiner > Date: Wed Aug 6 16:05:59 2014 -0700 > > mm: memcontrol: use root_mem_cgroup res_counter > > Due to an old optimization to keep expensive res_counter changes at a > minimum, the root_mem_cgroup res_counter is never charged; there is no > limit at that level anyway, and any statistics can be generated on > demand by summing up the counters of all other cgroups. > > However, with per-cpu charge caches, res_counter operations do not even > show up in profiles anymore, so this optimization is no longer > necessary. > > Remove it to simplify the code. It does not revert cleanly because of the hunks below. The code in those hunks was removed, so I tried running without properly merging them and it spews warnings because counter->usage is seen going negative. So, it doesn't appear we can quickly revert this. > --- mm/memcontrol.c > +++ mm/memcontrol.c > @@ -3943,7 +3947,7 @@ > * replacement page, so leave it alone when phasing out the > * page that is unused after the migration. > */ > - if (!end_migration) > + if (!end_migration && !mem_cgroup_is_root(memcg)) > mem_cgroup_do_uncharge(memcg, nr_pages, ctype); > > return memcg; > @@ -4076,7 +4080,8 @@ > * We uncharge this because swap is freed. This memcg can > * be obsolete one. We avoid calling css_tryget_online(). > */ > - res_counter_uncharge(&memcg->memsw, PAGE_SIZE); > + if (!mem_cgroup_is_root(memcg)) > + res_counter_uncharge(&memcg->memsw, PAGE_SIZE); > mem_cgroup_swap_statistics(memcg, false); > css_put(&memcg->css); > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f52.google.com (mail-pa0-f52.google.com [209.85.220.52]) by kanga.kvack.org (Postfix) with ESMTP id 89D9B6B0036 for ; Tue, 2 Sep 2014 16:19:04 -0400 (EDT) Received: by mail-pa0-f52.google.com with SMTP id eu11so15519635pac.25 for ; Tue, 02 Sep 2014 13:19:04 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id gp10si7326906pbd.244.2014.09.02.13.19.03 for ; Tue, 02 Sep 2014 13:19:03 -0700 (PDT) Message-ID: <5406262F.4050705@intel.com> Date: Tue, 02 Sep 2014 13:18:55 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> In-Reply-To: <54061505.8020500@sr71.net> Content-Type: multipart/mixed; boundary="------------020108010704030702030400" Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , Johannes Weiner , Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM This is a multi-part message in MIME format. --------------020108010704030702030400 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit On 09/02/2014 12:05 PM, Dave Hansen wrote: > It does not revert cleanly because of the hunks below. The code in > those hunks was removed, so I tried running without properly merging > them and it spews warnings because counter->usage is seen going negative. > > So, it doesn't appear we can quickly revert this. I'm fairly confident that I missed some of the cases (especially in the charge-moving code), but the attached patch does at least work around the regression for me. It restores the original performance, or at least gets _close_ to it. --------------020108010704030702030400 Content-Type: text/x-patch; name="try-partial-revert-of-root-charge-regression-patch.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename*0="try-partial-revert-of-root-charge-regression-patch.patch" --- b/mm/memcontrol.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff -puN mm/memcontrol.c~try-partial-revert-of-root-charge-regression-patch mm/memcontrol.c --- a/mm/memcontrol.c~try-partial-revert-of-root-charge-regression-patch 2014-09-02 12:20:11.209527453 -0700 +++ b/mm/memcontrol.c 2014-09-02 13:10:28.756736862 -0700 @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup unsigned long long size; int ret = 0; + if (mem_cgroup_is_root(memcg)) + goto done; retry: if (consume_stock(memcg, nr_pages)) goto done; @@ -2640,6 +2642,9 @@ static void __mem_cgroup_cancel_local_ch { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes); if (do_swap_account) res_counter_uncharge_until(&memcg->memsw, @@ -6440,6 +6445,9 @@ void mem_cgroup_commit_charge(struct pag VM_BUG_ON_PAGE(!page->mapping, page); VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page); + if (mem_cgroup_is_root(memcg)) + return; + if (mem_cgroup_disabled()) return; /* @@ -6484,6 +6492,9 @@ void mem_cgroup_cancel_charge(struct pag { unsigned int nr_pages = 1; + if (mem_cgroup_is_root(memcg)) + return; + if (mem_cgroup_disabled()) return; /* @@ -6509,6 +6520,9 @@ static void uncharge_batch(struct mem_cg { unsigned long flags; + if (mem_cgroup_is_root(memcg)) + return; + if (nr_mem) res_counter_uncharge(&memcg->res, nr_mem * PAGE_SIZE); if (nr_memsw) _ --------------020108010704030702030400-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id F2A686B0036 for ; Tue, 2 Sep 2014 16:57:28 -0400 (EDT) Received: by mail-pa0-f53.google.com with SMTP id fa1so15764021pad.12 for ; Tue, 02 Sep 2014 13:57:28 -0700 (PDT) Received: from blackbird.sr71.net ([2001:19d0:2:6:209:6bff:fe9a:902]) by mx.google.com with ESMTP id np1si7899516pdb.31.2014.09.02.13.57.24 for ; Tue, 02 Sep 2014 13:57:24 -0700 (PDT) Message-ID: <54062F32.5070504@sr71.net> Date: Tue, 02 Sep 2014 13:57:22 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> In-Reply-To: <5406262F.4050705@intel.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , Johannes Weiner , Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM I, of course, forgot to include the most important detail. This appears to be pretty run-of-the-mill spinlock contention in the resource counter code. Nearly 80% of the CPU is spent spinning in the charge or uncharge paths in the kernel. It is apparently spinning on res_counter->lock in both the charge and uncharge paths. It already does _some_ batching here on the free side, but that apparently breaks down after ~40 threads. It's a no-brainer since the patch in question removed an optimization skipping the charging, and now we're seeing overhead from the charging. Here's the first entry from perf top: 80.18% 80.18% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--66.59%-- res_counter_uncharge_until | res_counter_uncharge | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--90.12%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --9.88%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap | |--46.13%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--99.89%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.11%-- [...] | |--1.14%-- do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | testcase --8217937613.29%-- [...] -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f171.google.com (mail-lb0-f171.google.com [209.85.217.171]) by kanga.kvack.org (Postfix) with ESMTP id C81FD6B0036 for ; Tue, 2 Sep 2014 18:18:39 -0400 (EDT) Received: by mail-lb0-f171.google.com with SMTP id 10so113203lbg.16 for ; Tue, 02 Sep 2014 15:18:38 -0700 (PDT) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id l18si6620414lbg.26.2014.09.02.15.18.37 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 02 Sep 2014 15:18:37 -0700 (PDT) Date: Tue, 2 Sep 2014 18:18:14 -0400 From: Johannes Weiner Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140902221814.GA18069@cmpxchg.org> References: <54061505.8020500@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54061505.8020500@sr71.net> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM Hi Dave, On Tue, Sep 02, 2014 at 12:05:41PM -0700, Dave Hansen wrote: > I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the > memory cgroups code. This is on a kernel with cgroups enabled at > compile time, but not _used_ for anything. See the green lines in the > graph: > > https://www.sr71.net/~dave/intel/regression-from-05b843012.png > > The workload is a little parallel microbenchmark doing page faults: Ouch. > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c > > The hardware is an 8-socket Westmere box with 160 hardware threads. For > some reason, this does not affect the version of the microbenchmark > which is doing completely anonymous page faults. > > I bisected it down to this commit: > > > commit 05b8430123359886ef6a4146fba384e30d771b3f > > Author: Johannes Weiner > > Date: Wed Aug 6 16:05:59 2014 -0700 > > > > mm: memcontrol: use root_mem_cgroup res_counter > > > > Due to an old optimization to keep expensive res_counter changes at a > > minimum, the root_mem_cgroup res_counter is never charged; there is no > > limit at that level anyway, and any statistics can be generated on > > demand by summing up the counters of all other cgroups. > > > > However, with per-cpu charge caches, res_counter operations do not even > > show up in profiles anymore, so this optimization is no longer > > necessary. > > > > Remove it to simplify the code. Accounting new pages is buffered through per-cpu caches, but taking them off the counters on free is not, so I'm guessing that above a certain allocation rate the cost of locking and changing the counters takes over. Is there a chance you could profile this to see if locks and res_counter-related operations show up? I can't reproduce this complete breakdown on my smaller test gear, but I do see an improvement with the following patch: --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f169.google.com (mail-pd0-f169.google.com [209.85.192.169]) by kanga.kvack.org (Postfix) with ESMTP id C1E4A6B0036 for ; Tue, 2 Sep 2014 18:36:33 -0400 (EDT) Received: by mail-pd0-f169.google.com with SMTP id y10so240184pdj.14 for ; Tue, 02 Sep 2014 15:36:33 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id rk5si7785638pab.204.2014.09.02.15.36.32 for ; Tue, 02 Sep 2014 15:36:32 -0700 (PDT) Message-ID: <5406466D.1020000@sr71.net> Date: Tue, 02 Sep 2014 15:36:29 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> In-Reply-To: <20140902221814.GA18069@cmpxchg.org> Content-Type: multipart/mixed; boundary="------------090300080702060409030200" Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM This is a multi-part message in MIME format. --------------090300080702060409030200 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit On 09/02/2014 03:18 PM, Johannes Weiner wrote: > Accounting new pages is buffered through per-cpu caches, but taking > them off the counters on free is not, so I'm guessing that above a > certain allocation rate the cost of locking and changing the counters > takes over. Is there a chance you could profile this to see if locks > and res_counter-related operations show up? It looks pretty much the same, although it might have equalized the charge and uncharge sides a bit. Full 'perf top' output attached. --------------090300080702060409030200 Content-Type: text/plain; charset=UTF-8; name="perf-20140902-johannes-take1.txt" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="perf-20140902-johannes-take1.txt" PerfTop: 275580 irqs/sec kernel:98.0% exact: 0.0% [4000Hz cycles], (all, 160 CPUs) ------------------------------------------------------------------------------- 68.10% 68.10% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--57.35%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--99.93%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.07%-- [...] | |--53.93%-- res_counter_uncharge_until | res_counter_uncharge | refill_stock | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--98.62%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --1.38%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap | |--2.18%-- do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | testcase --9307219025.55%-- [...] 64.00% 1.34% page_fault2_processes [.] testcase | --- testcase 62.64% 0.37% [kernel] [k] page_fault | --- page_fault | |--114.21%-- testcase --10118485450.89%-- [...] 62.27% 0.01% [kernel] [k] do_page_fault | --- do_page_fault | |--114.28%-- page_fault | | | |--99.93%-- testcase | --0.07%-- [...] --10178525138.33%-- [...] 62.25% 0.07% [kernel] [k] __do_page_fault | --- __do_page_fault | |--114.27%-- do_page_fault | page_fault | | | |--99.93%-- testcase | --0.07%-- [...] --10182230022.41%-- [...] 62.08% 0.26% [kernel] [k] handle_mm_fault | --- handle_mm_fault | |--114.28%-- __do_page_fault | do_page_fault | page_fault | | | |--99.94%-- testcase | --0.06%-- [...] --10210709377.69%-- [...] 44.32% 0.22% [kernel] [k] do_cow_fault | --- do_cow_fault | |--114.28%-- handle_mm_fault | __do_page_fault | do_page_fault | page_fault | testcase --14302986980.21%-- [...] 35.23% 0.00% [kernel] [k] sys_munmap | --- sys_munmap system_call_fastpath __GI___munmap 34.84% 0.04% [kernel] [k] mem_cgroup_try_charge | --- mem_cgroup_try_charge | |--114.18%-- do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | testcase --18195666899.12%-- [...] 34.74% 0.49% [kernel] [k] unmap_single_vma | --- unmap_single_vma unmap_vmas unmap_region do_munmap vm_munmap sys_munmap system_call_fastpath __GI___munmap 34.66% 0.00% [kernel] [k] tlb_flush_mmu_free | --- tlb_flush_mmu_free | |--112.70%-- unmap_single_vma | unmap_vmas | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap | |--1.59%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap --18285481100.94%-- [...] 34.66% 0.12% [kernel] [k] free_pages_and_swap_cache | --- free_pages_and_swap_cache tlb_flush_mmu_free | |--112.70%-- unmap_single_vma | unmap_vmas | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap | |--1.59%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap --18286531934.97%-- [...] 34.61% 0.04% [kernel] [k] try_charge | --- try_charge | |--114.26%-- mem_cgroup_try_charge | | | |--99.93%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.07%-- [...] --18312751937.56%-- [...] 34.57% 0.00% [kernel] [k] res_counter_charge | --- res_counter_charge | |--114.27%-- try_charge | mem_cgroup_try_charge | | | |--99.93%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.07%-- [...] --18334689838.42%-- [...] 34.56% 0.08% [kernel] [k] release_pages | --- release_pages | |--114.21%-- free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--98.61%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --1.39%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap --18340488029.00%-- [...] 34.26% 0.08% [kernel] [k] __res_counter_charge | --- __res_counter_charge | |--114.28%-- res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--99.93%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.07%-- [...] --18502823676.22%-- [...] 33.45% 0.00% [kernel] [k] mem_cgroup_uncharge_list | --- mem_cgroup_uncharge_list | |--114.28%-- release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--98.62%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --1.38%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap --18949679654.78%-- [...] 33.45% 0.01% [kernel] [k] uncharge_list | --- uncharge_list | |--114.29%-- mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--98.62%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --1.38%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap --18951232568.52%-- [...] 33.43% 0.02% [kernel] [k] uncharge_batch | --- uncharge_batch | |--114.28%-- uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--98.62%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --1.38%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap --18962452122.61%-- [...] 32.43% 0.01% [kernel] [k] refill_stock | --- refill_stock | |--114.27%-- uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--98.62%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --1.38%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap --19543678319.29%-- [...] --------------090300080702060409030200-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f53.google.com (mail-la0-f53.google.com [209.85.215.53]) by kanga.kvack.org (Postfix) with ESMTP id 821C16B0036 for ; Tue, 2 Sep 2014 20:10:31 -0400 (EDT) Received: by mail-la0-f53.google.com with SMTP id q1so239564lam.12 for ; Tue, 02 Sep 2014 17:10:30 -0700 (PDT) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id y8si551530lag.85.2014.09.02.17.10.29 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 02 Sep 2014 17:10:29 -0700 (PDT) Date: Tue, 2 Sep 2014 20:10:09 -0400 From: Johannes Weiner Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140903001009.GA25970@cmpxchg.org> References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5406466D.1020000@sr71.net> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: > On 09/02/2014 03:18 PM, Johannes Weiner wrote: > > Accounting new pages is buffered through per-cpu caches, but taking > > them off the counters on free is not, so I'm guessing that above a > > certain allocation rate the cost of locking and changing the counters > > takes over. Is there a chance you could profile this to see if locks > > and res_counter-related operations show up? > > It looks pretty much the same, although it might have equalized the > charge and uncharge sides a bit. Full 'perf top' output attached. That looks like a partial profile, where did the page allocator, page zeroing etc. go? Because the distribution among these listed symbols doesn't seem all that crazy: > PerfTop: 275580 irqs/sec kernel:98.0% exact: 0.0% [4000Hz cycles], (all, 160 CPUs) > ------------------------------------------------------------------------------- > > 68.10% 68.10% [kernel] [k] _raw_spin_lock > | > --- _raw_spin_lock > | > |--57.35%-- __res_counter_charge > | res_counter_charge > | try_charge > | mem_cgroup_try_charge > | | > | |--99.93%-- do_cow_fault > | | handle_mm_fault > | | __do_page_fault > | | do_page_fault > | | page_fault > | | testcase > | --0.07%-- [...] > | > |--53.93%-- res_counter_uncharge_until > | res_counter_uncharge > | refill_stock > | uncharge_batch > | uncharge_list > | mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > | > |--2.18%-- do_cow_fault > | handle_mm_fault > | __do_page_fault > | do_page_fault > | page_fault > | testcase > --9307219025.55%-- [...] > > 64.00% 1.34% page_fault2_processes [.] testcase > | > --- testcase > > 62.64% 0.37% [kernel] [k] page_fault > | > --- page_fault > | > |--114.21%-- testcase > --10118485450.89%-- [...] > > 62.27% 0.01% [kernel] [k] do_page_fault > | > --- do_page_fault > | > |--114.28%-- page_fault > | | > | |--99.93%-- testcase > | --0.07%-- [...] > --10178525138.33%-- [...] > > 62.25% 0.07% [kernel] [k] __do_page_fault > | > --- __do_page_fault > | > |--114.27%-- do_page_fault > | page_fault > | | > | |--99.93%-- testcase > | --0.07%-- [...] > --10182230022.41%-- [...] > > 62.08% 0.26% [kernel] [k] handle_mm_fault > | > --- handle_mm_fault > | > |--114.28%-- __do_page_fault > | do_page_fault > | page_fault > | | > | |--99.94%-- testcase > | --0.06%-- [...] > --10210709377.69%-- [...] > > 44.32% 0.22% [kernel] [k] do_cow_fault > | > --- do_cow_fault > | > |--114.28%-- handle_mm_fault > | __do_page_fault > | do_page_fault > | page_fault > | testcase > --14302986980.21%-- [...] > > 35.23% 0.00% [kernel] [k] sys_munmap > | > --- sys_munmap > system_call_fastpath > __GI___munmap > > 34.84% 0.04% [kernel] [k] mem_cgroup_try_charge > | > --- mem_cgroup_try_charge > | > |--114.18%-- do_cow_fault > | handle_mm_fault > | __do_page_fault > | do_page_fault > | page_fault > | testcase > --18195666899.12%-- [...] > > 34.74% 0.49% [kernel] [k] unmap_single_vma > | > --- unmap_single_vma > unmap_vmas > unmap_region > do_munmap > vm_munmap > sys_munmap > system_call_fastpath > __GI___munmap > > 34.66% 0.00% [kernel] [k] tlb_flush_mmu_free > | > --- tlb_flush_mmu_free > | > |--112.70%-- unmap_single_vma > | unmap_vmas > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > | > |--1.59%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18285481100.94%-- [...] > > 34.66% 0.12% [kernel] [k] free_pages_and_swap_cache > | > --- free_pages_and_swap_cache > tlb_flush_mmu_free > | > |--112.70%-- unmap_single_vma > | unmap_vmas > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > | > |--1.59%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18286531934.97%-- [...] > > 34.61% 0.04% [kernel] [k] try_charge > | > --- try_charge > | > |--114.26%-- mem_cgroup_try_charge > | | > | |--99.93%-- do_cow_fault > | | handle_mm_fault > | | __do_page_fault > | | do_page_fault > | | page_fault > | | testcase > | --0.07%-- [...] > --18312751937.56%-- [...] > > 34.57% 0.00% [kernel] [k] res_counter_charge > | > --- res_counter_charge > | > |--114.27%-- try_charge > | mem_cgroup_try_charge > | | > | |--99.93%-- do_cow_fault > | | handle_mm_fault > | | __do_page_fault > | | do_page_fault > | | page_fault > | | testcase > | --0.07%-- [...] > --18334689838.42%-- [...] > > 34.56% 0.08% [kernel] [k] release_pages > | > --- release_pages > | > |--114.21%-- free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.61%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.39%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18340488029.00%-- [...] > > 34.26% 0.08% [kernel] [k] __res_counter_charge > | > --- __res_counter_charge > | > |--114.28%-- res_counter_charge > | try_charge > | mem_cgroup_try_charge > | | > | |--99.93%-- do_cow_fault > | | handle_mm_fault > | | __do_page_fault > | | do_page_fault > | | page_fault > | | testcase > | --0.07%-- [...] > --18502823676.22%-- [...] > > 33.45% 0.00% [kernel] [k] mem_cgroup_uncharge_list > | > --- mem_cgroup_uncharge_list > | > |--114.28%-- release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18949679654.78%-- [...] > > 33.45% 0.01% [kernel] [k] uncharge_list > | > --- uncharge_list > | > |--114.29%-- mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18951232568.52%-- [...] > > 33.43% 0.02% [kernel] [k] uncharge_batch > | > --- uncharge_batch > | > |--114.28%-- uncharge_list > | mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18962452122.61%-- [...] > > 32.43% 0.01% [kernel] [k] refill_stock > | > --- refill_stock > | > |--114.27%-- uncharge_batch > | uncharge_list > | mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --19543678319.29%-- [...] -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vc0-f178.google.com (mail-vc0-f178.google.com [209.85.220.178]) by kanga.kvack.org (Postfix) with ESMTP id 62FFF6B0036 for ; Tue, 2 Sep 2014 20:20:56 -0400 (EDT) Received: by mail-vc0-f178.google.com with SMTP id la4so8135834vcb.9 for ; Tue, 02 Sep 2014 17:20:56 -0700 (PDT) Received: from mail-vc0-x235.google.com (mail-vc0-x235.google.com [2607:f8b0:400c:c03::235]) by mx.google.com with ESMTPS id s10si254747vdv.68.2014.09.02.17.20.55 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 02 Sep 2014 17:20:55 -0700 (PDT) Received: by mail-vc0-f181.google.com with SMTP id ij19so7864746vcb.26 for ; Tue, 02 Sep 2014 17:20:55 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20140903001009.GA25970@cmpxchg.org> References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> Date: Tue, 2 Sep 2014 17:20:55 -0700 Message-ID: Subject: Re: regression caused by cgroups optimization in 3.17-rc2 From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Dave Hansen , Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Andrew Morton , LKML , Linux-MM On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner wrote: > > That looks like a partial profile, where did the page allocator, page > zeroing etc. go? Because the distribution among these listed symbols > doesn't seem all that crazy: Please argue this *after* the commit has been reverted. You guys can try to make the memcontrol batching actually work and scale later. It's not appropriate to argue against major regressions when reported and bisected by users. Showing the spinlock at the top of the profile is very much crazy (apparently taking 68% of all cpu time), when it's all useless make-believe work. I don't understand why you wouldn't call that crazy. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f41.google.com (mail-pa0-f41.google.com [209.85.220.41]) by kanga.kvack.org (Postfix) with ESMTP id F02306B0036 for ; Tue, 2 Sep 2014 20:30:40 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id lj1so15952146pab.14 for ; Tue, 02 Sep 2014 17:30:40 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id la16si8124380pab.171.2014.09.02.17.30.39 for ; Tue, 02 Sep 2014 17:30:40 -0700 (PDT) Message-ID: <5406612E.8040802@sr71.net> Date: Tue, 02 Sep 2014 17:30:38 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> In-Reply-To: <20140903001009.GA25970@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM On 09/02/2014 05:10 PM, Johannes Weiner wrote: > On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: >> On 09/02/2014 03:18 PM, Johannes Weiner wrote: >>> Accounting new pages is buffered through per-cpu caches, but taking >>> them off the counters on free is not, so I'm guessing that above a >>> certain allocation rate the cost of locking and changing the counters >>> takes over. Is there a chance you could profile this to see if locks >>> and res_counter-related operations show up? >> >> It looks pretty much the same, although it might have equalized the >> charge and uncharge sides a bit. Full 'perf top' output attached. > > That looks like a partial profile, where did the page allocator, page > zeroing etc. go? Because the distribution among these listed symbols > doesn't seem all that crazy: Perf was only outputting the top 20 functions. Believe it or not, page zeroing and the rest of the allocator path wasn't even in the path of the top 20 functions because there is so much lock contention. Here's a longer run of 'perf top' along with the top 100 functions: http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz you can at least see copy_page_rep in there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f51.google.com (mail-la0-f51.google.com [209.85.215.51]) by kanga.kvack.org (Postfix) with ESMTP id E508B6B0036 for ; Tue, 2 Sep 2014 21:33:37 -0400 (EDT) Received: by mail-la0-f51.google.com with SMTP id gl10so8910567lab.10 for ; Tue, 02 Sep 2014 18:33:37 -0700 (PDT) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id g9si6909706lbv.86.2014.09.02.18.33.35 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 02 Sep 2014 18:33:35 -0700 (PDT) Date: Tue, 2 Sep 2014 21:33:17 -0400 From: Johannes Weiner Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140903013317.GA26086@cmpxchg.org> References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Dave Hansen , Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Andrew Morton , LKML , Linux-MM On Tue, Sep 02, 2014 at 05:20:55PM -0700, Linus Torvalds wrote: > On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner wrote: > > > > That looks like a partial profile, where did the page allocator, page > > zeroing etc. go? Because the distribution among these listed symbols > > doesn't seem all that crazy: > > Please argue this *after* the commit has been reverted. You guys can > try to make the memcontrol batching actually work and scale later. > It's not appropriate to argue against major regressions when reported > and bisected by users. I'll send a clean revert later. > Showing the spinlock at the top of the profile is very much crazy > (apparently taking 68% of all cpu time), when it's all useless > make-believe work. I don't understand why you wouldn't call that > crazy. If you limit perf to a subset of symbols, it will show a relative distribution between them, i.e: perf top --symbols kfree,memset during some disk access: PerfTop: 1292 irqs/sec kernel:84.4% exact: 0.0% [4000Hz cycles], (all, 4 CPUs) ------------------------------------------------------------------------------- 56.23% [kernel] [k] kfree 41.86% [kernel] [k] memset 1.91% libc-2.19.so [.] memset kfree isn't eating 56% of "all cpu time" here, and it wasn't clear to me whether Dave filtered symbols from only memcontrol.o, memory.o, and mmap.o in a similar way. I'm not arguing against the regression, I'm just trying to make sense of the numbers from the *patched* kernel. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id A55286B0036 for ; Tue, 2 Sep 2014 23:15:07 -0400 (EDT) Received: by mail-pa0-f51.google.com with SMTP id rd3so16433103pab.10 for ; Tue, 02 Sep 2014 20:15:07 -0700 (PDT) Received: from blackbird.sr71.net (www.sr71.net. [198.145.64.142]) by mx.google.com with ESMTP id ci1si8596883pad.150.2014.09.02.20.15.06 for ; Tue, 02 Sep 2014 20:15:06 -0700 (PDT) Message-ID: <540687B9.7070305@sr71.net> Date: Tue, 02 Sep 2014 20:15:05 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> <20140903013317.GA26086@cmpxchg.org> In-Reply-To: <20140903013317.GA26086@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , Linus Torvalds Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Andrew Morton , LKML , Linux-MM On 09/02/2014 06:33 PM, Johannes Weiner wrote: > kfree isn't eating 56% of "all cpu time" here, and it wasn't clear to > me whether Dave filtered symbols from only memcontrol.o, memory.o, and > mmap.o in a similar way. I'm not arguing against the regression, I'm > just trying to make sense of the numbers from the *patched* kernel. I guess I could have included it in the description, but that was a pretty vanilla run: perf top --call-graph=fp --stdio > foo.txt I didn't use any filtering. I didn't even know I _could_ filter. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com [209.85.212.169]) by kanga.kvack.org (Postfix) with ESMTP id 324D86B0035 for ; Thu, 4 Sep 2014 10:27:28 -0400 (EDT) Received: by mail-wi0-f169.google.com with SMTP id n3so7655125wiv.2 for ; Thu, 04 Sep 2014 07:27:26 -0700 (PDT) Received: from mail-wg0-x232.google.com (mail-wg0-x232.google.com [2a00:1450:400c:c00::232]) by mx.google.com with ESMTPS id u12si2597768wiv.33.2014.09.04.07.27.24 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 04 Sep 2014 07:27:25 -0700 (PDT) Received: by mail-wg0-f50.google.com with SMTP id x12so10361067wgg.9 for ; Thu, 04 Sep 2014 07:27:23 -0700 (PDT) Date: Thu, 4 Sep 2014 16:27:21 +0200 From: Michal Hocko Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140904142721.GB14548@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54062F32.5070504@sr71.net> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Johannes Weiner , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML [Sorry to reply so late] On Tue 02-09-14 13:57:22, Dave Hansen wrote: > I, of course, forgot to include the most important detail. This appears > to be pretty run-of-the-mill spinlock contention in the resource counter > code. Nearly 80% of the CPU is spent spinning in the charge or uncharge > paths in the kernel. It is apparently spinning on res_counter->lock in > both the charge and uncharge paths. > > It already does _some_ batching here on the free side, but that > apparently breaks down after ~40 threads. > > It's a no-brainer since the patch in question removed an optimization > skipping the charging, and now we're seeing overhead from the charging. > > Here's the first entry from perf top: > > 80.18% 80.18% [kernel] [k] _raw_spin_lock > | > --- _raw_spin_lock > | > |--66.59%-- res_counter_uncharge_until > | res_counter_uncharge > | uncharge_batch > | uncharge_list > | mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. --- diff --git a/mm/swap_state.c b/mm/swap_state.c index ef1f39139b71..154444918685 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page) void free_pages_and_swap_cache(struct page **pages, int nr) { struct page **pagep = pages; + int i; lru_add_drain(); - while (nr) { - int todo = min(nr, PAGEVEC_SIZE); - int i; - - for (i = 0; i < todo; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, todo, false); - pagep += todo; - nr -= todo; - } + for (i = 0; i < nr; i++) + free_swap_cache(pagep[i]); + release_pages(pagep, nr, false); } /* -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f180.google.com (mail-lb0-f180.google.com [209.85.217.180]) by kanga.kvack.org (Postfix) with ESMTP id 8D2936B0035 for ; Thu, 4 Sep 2014 11:09:04 -0400 (EDT) Received: by mail-lb0-f180.google.com with SMTP id w7so11529221lbi.25 for ; Thu, 04 Sep 2014 08:09:03 -0700 (PDT) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id df2si7871443lac.21.2014.09.04.08.09.01 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 04 Sep 2014 08:09:02 -0700 (PDT) Date: Thu, 4 Sep 2014 11:08:46 -0400 From: Johannes Weiner Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140904150846.GA10794@cmpxchg.org> References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> <5406612E.8040802@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5406612E.8040802@sr71.net> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM On Tue, Sep 02, 2014 at 05:30:38PM -0700, Dave Hansen wrote: > On 09/02/2014 05:10 PM, Johannes Weiner wrote: > > On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: > >> On 09/02/2014 03:18 PM, Johannes Weiner wrote: > >>> Accounting new pages is buffered through per-cpu caches, but taking > >>> them off the counters on free is not, so I'm guessing that above a > >>> certain allocation rate the cost of locking and changing the counters > >>> takes over. Is there a chance you could profile this to see if locks > >>> and res_counter-related operations show up? > >> > >> It looks pretty much the same, although it might have equalized the > >> charge and uncharge sides a bit. Full 'perf top' output attached. > > > > That looks like a partial profile, where did the page allocator, page > > zeroing etc. go? Because the distribution among these listed symbols > > doesn't seem all that crazy: > > Perf was only outputting the top 20 functions. Believe it or not, page > zeroing and the rest of the allocator path wasn't even in the path of > the top 20 functions because there is so much lock contention. > > Here's a longer run of 'perf top' along with the top 100 functions: > > http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz > > you can at least see copy_page_rep in there. Thanks for the clarification, that is truly horrible. Does the following revert restore performance in your case? --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f179.google.com (mail-pd0-f179.google.com [209.85.192.179]) by kanga.kvack.org (Postfix) with ESMTP id 754356B0036 for ; Thu, 4 Sep 2014 16:28:09 -0400 (EDT) Received: by mail-pd0-f179.google.com with SMTP id g10so2092370pdj.38 for ; Thu, 04 Sep 2014 13:28:09 -0700 (PDT) Received: from blackbird.sr71.net ([2001:19d0:2:6:209:6bff:fe9a:902]) by mx.google.com with ESMTP id ss2si563497pbc.160.2014.09.04.13.27.32 for ; Thu, 04 Sep 2014 13:27:33 -0700 (PDT) Message-ID: <5408CB2E.3080101@sr71.net> Date: Thu, 04 Sep 2014 13:27:26 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> In-Reply-To: <20140904142721.GB14548@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Johannes Weiner , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On 09/04/2014 07:27 AM, Michal Hocko wrote: > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > because it reduces it to PAGEVEC_SIZE batches. > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > already batching on tlb_gather layer. That one is limited so I think > the below should be safe but I have to think about this some more. There > is a risk of prolonged lru_lock wait times but the number of pages is > limited to 10k and the heavy work is done outside of the lock. If this > is really a problem then we can tear LRU part and the actual > freeing/uncharging into a separate functions in this path. > > Could you test with this half baked patch, please? I didn't get to test > it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. The top spinlock contention in the kernel is still from the resource counter code via mem_cgroup_commit_charge(), though. I'm running Johannes' patch now. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f180.google.com (mail-pd0-f180.google.com [209.85.192.180]) by kanga.kvack.org (Postfix) with ESMTP id 0F4ED6B0037 for ; Thu, 4 Sep 2014 16:50:39 -0400 (EDT) Received: by mail-pd0-f180.google.com with SMTP id p10so14280600pdj.25 for ; Thu, 04 Sep 2014 13:50:39 -0700 (PDT) Received: from blackbird.sr71.net ([2001:19d0:2:6:209:6bff:fe9a:902]) by mx.google.com with ESMTP id g3si67879pdi.100.2014.09.04.13.50.37 for ; Thu, 04 Sep 2014 13:50:37 -0700 (PDT) Message-ID: <5408D09A.5030000@sr71.net> Date: Thu, 04 Sep 2014 13:50:34 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> <5406612E.8040802@sr71.net> <20140904150846.GA10794@cmpxchg.org> In-Reply-To: <20140904150846.GA10794@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM On 09/04/2014 08:08 AM, Johannes Weiner wrote: > Dave Hansen reports a massive scalability regression in an uncontained > page fault benchmark with more than 30 concurrent threads, which he > bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup > res_counter") and pin-pointed on res_counter spinlock contention. > > That change relied on the per-cpu charge caches to mostly swallow the > res_counter costs, but it's apparent that the caches don't scale yet. > > Revert memcg back to bypassing res_counters on the root level in order > to restore performance for uncontained workloads. A quick sniff test shows performance coming back to what it was around 3.16 with this patch. I'll run a more thorough set of tests and verify that it's working well. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f52.google.com (mail-pa0-f52.google.com [209.85.220.52]) by kanga.kvack.org (Postfix) with ESMTP id 8BF546B0036 for ; Thu, 4 Sep 2014 18:55:50 -0400 (EDT) Received: by mail-pa0-f52.google.com with SMTP id eu11so21079755pac.11 for ; Thu, 04 Sep 2014 15:55:50 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id qa6si564992pdb.55.2014.09.04.15.55.48 for ; Thu, 04 Sep 2014 15:55:49 -0700 (PDT) Message-ID: <5408ED7A.5010908@intel.com> Date: Thu, 04 Sep 2014 15:53:46 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> In-Reply-To: <5408CB2E.3080101@sr71.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , Michal Hocko Cc: Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On 09/04/2014 01:27 PM, Dave Hansen wrote: > On 09/04/2014 07:27 AM, Michal Hocko wrote: >> Ouch. free_pages_and_swap_cache completely kills the uncharge batching >> because it reduces it to PAGEVEC_SIZE batches. >> >> I think we really do not need PAGEVEC_SIZE batching anymore. We are >> already batching on tlb_gather layer. That one is limited so I think >> the below should be safe but I have to think about this some more. There >> is a risk of prolonged lru_lock wait times but the number of pages is >> limited to 10k and the heavy work is done outside of the lock. If this >> is really a problem then we can tear LRU part and the actual >> freeing/uncharging into a separate functions in this path. >> >> Could you test with this half baked patch, please? I didn't get to test >> it myself unfortunately. > > 3.16 settled out at about 11.5M faults/sec before the regression. This > patch gets it back up to about 10.5M, which is good. The top spinlock > contention in the kernel is still from the resource counter code via > mem_cgroup_commit_charge(), though. > > I'm running Johannes' patch now. This looks pretty good. The area where it plateaus (above 80 threads where hyperthreading kicks in) might be a bit slower than it was in 3.16, but that could easily be from other things. > https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/&2=3.17.0-rc3-g57b252f Feel free to add my Tested-by: -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f174.google.com (mail-we0-f174.google.com [74.125.82.174]) by kanga.kvack.org (Postfix) with ESMTP id E7BBB6B0037 for ; Fri, 5 Sep 2014 04:04:08 -0400 (EDT) Received: by mail-we0-f174.google.com with SMTP id u57so11346085wes.19 for ; Fri, 05 Sep 2014 01:04:08 -0700 (PDT) Received: from mail-wi0-x236.google.com (mail-wi0-x236.google.com [2a00:1450:400c:c05::236]) by mx.google.com with ESMTPS id jw2si790785wjc.74.2014.09.05.01.04.06 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 05 Sep 2014 01:04:07 -0700 (PDT) Received: by mail-wi0-f182.google.com with SMTP id z2so2466225wiv.9 for ; Fri, 05 Sep 2014 01:04:06 -0700 (PDT) Date: Fri, 5 Sep 2014 10:04:04 +0200 From: Michal Hocko Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905080404.GA26243@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> <5406612E.8040802@sr71.net> <20140904150846.GA10794@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140904150846.GA10794@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Dave Hansen , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On Thu 04-09-14 11:08:46, Johannes Weiner wrote: [...] > From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Thu, 4 Sep 2014 10:04:34 -0400 > Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter > > Dave Hansen reports a massive scalability regression in an uncontained > page fault benchmark with more than 30 concurrent threads, which he > bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup > res_counter") and pin-pointed on res_counter spinlock contention. > > That change relied on the per-cpu charge caches to mostly swallow the > res_counter costs, but it's apparent that the caches don't scale yet. > > Revert memcg back to bypassing res_counters on the root level in order > to restore performance for uncontained workloads. > > Reported-by: Dave Hansen > Signed-off-by: Johannes Weiner The revert looks good to me. Acked-by: Michal Hocko > --- > mm/memcontrol.c | 103 ++++++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 78 insertions(+), 25 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ec4dcf1b9562..085dc6d2f876 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > unsigned long long size; > int ret = 0; > > + if (mem_cgroup_is_root(memcg)) > + goto done; > retry: > if (consume_stock(memcg, nr_pages)) > goto done; > @@ -2611,9 +2613,7 @@ nomem: > if (!(gfp_mask & __GFP_NOFAIL)) > return -ENOMEM; > bypass: > - memcg = root_mem_cgroup; > - ret = -EINTR; > - goto retry; > + return -EINTR; > > done_restock: > if (batch > nr_pages) > @@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) > { > unsigned long bytes = nr_pages * PAGE_SIZE; > > + if (mem_cgroup_is_root(memcg)) > + return; > + > res_counter_uncharge(&memcg->res, bytes); > if (do_swap_account) > res_counter_uncharge(&memcg->memsw, bytes); > @@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg, > { > unsigned long bytes = nr_pages * PAGE_SIZE; > > + if (mem_cgroup_is_root(memcg)) > + return; > + > res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes); > if (do_swap_account) > res_counter_uncharge_until(&memcg->memsw, > @@ -4093,6 +4099,46 @@ out: > return retval; > } > > +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg, > + enum mem_cgroup_stat_index idx) > +{ > + struct mem_cgroup *iter; > + long val = 0; > + > + /* Per-cpu values can be negative, use a signed accumulator */ > + for_each_mem_cgroup_tree(iter, memcg) > + val += mem_cgroup_read_stat(iter, idx); > + > + if (val < 0) /* race ? */ > + val = 0; > + return val; > +} > + > +static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) > +{ > + u64 val; > + > + if (!mem_cgroup_is_root(memcg)) { > + if (!swap) > + return res_counter_read_u64(&memcg->res, RES_USAGE); > + else > + return res_counter_read_u64(&memcg->memsw, RES_USAGE); > + } > + > + /* > + * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS > + * as well as in MEM_CGROUP_STAT_RSS_HUGE. > + */ > + val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE); > + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS); > + > + if (swap) > + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP); > + > + return val << PAGE_SHIFT; > +} > + > + > static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > struct cftype *cft) > { > @@ -4102,8 +4148,12 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > > switch (type) { > case _MEM: > + if (name == RES_USAGE) > + return mem_cgroup_usage(memcg, false); > return res_counter_read_u64(&memcg->res, name); > case _MEMSWAP: > + if (name == RES_USAGE) > + return mem_cgroup_usage(memcg, true); > return res_counter_read_u64(&memcg->memsw, name); > case _KMEM: > return res_counter_read_u64(&memcg->kmem, name); > @@ -4572,10 +4622,7 @@ static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap) > if (!t) > goto unlock; > > - if (!swap) > - usage = res_counter_read_u64(&memcg->res, RES_USAGE); > - else > - usage = res_counter_read_u64(&memcg->memsw, RES_USAGE); > + usage = mem_cgroup_usage(memcg, swap); > > /* > * current_threshold points to threshold just below or equal to usage. > @@ -4673,10 +4720,10 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg, > > if (type == _MEM) { > thresholds = &memcg->thresholds; > - usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + usage = mem_cgroup_usage(memcg, false); > } else if (type == _MEMSWAP) { > thresholds = &memcg->memsw_thresholds; > - usage = res_counter_read_u64(&memcg->memsw, RES_USAGE); > + usage = mem_cgroup_usage(memcg, true); > } else > BUG(); > > @@ -4762,10 +4809,10 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg, > > if (type == _MEM) { > thresholds = &memcg->thresholds; > - usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + usage = mem_cgroup_usage(memcg, false); > } else if (type == _MEMSWAP) { > thresholds = &memcg->memsw_thresholds; > - usage = res_counter_read_u64(&memcg->memsw, RES_USAGE); > + usage = mem_cgroup_usage(memcg, true); > } else > BUG(); > > @@ -5525,9 +5572,9 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > * core guarantees its existence. > */ > } else { > - res_counter_init(&memcg->res, &root_mem_cgroup->res); > - res_counter_init(&memcg->memsw, &root_mem_cgroup->memsw); > - res_counter_init(&memcg->kmem, &root_mem_cgroup->kmem); > + res_counter_init(&memcg->res, NULL); > + res_counter_init(&memcg->memsw, NULL); > + res_counter_init(&memcg->kmem, NULL); > /* > * Deeper hierachy with use_hierarchy == false doesn't make > * much sense so let cgroup subsystem know about this > @@ -5969,8 +6016,9 @@ static void __mem_cgroup_clear_mc(void) > /* we must fixup refcnts and charges */ > if (mc.moved_swap) { > /* uncharge swap account from the old cgroup */ > - res_counter_uncharge(&mc.from->memsw, > - PAGE_SIZE * mc.moved_swap); > + if (!mem_cgroup_is_root(mc.from)) > + res_counter_uncharge(&mc.from->memsw, > + PAGE_SIZE * mc.moved_swap); > > for (i = 0; i < mc.moved_swap; i++) > css_put(&mc.from->css); > @@ -5979,8 +6027,9 @@ static void __mem_cgroup_clear_mc(void) > * we charged both to->res and to->memsw, so we should > * uncharge to->res. > */ > - res_counter_uncharge(&mc.to->res, > - PAGE_SIZE * mc.moved_swap); > + if (!mem_cgroup_is_root(mc.to)) > + res_counter_uncharge(&mc.to->res, > + PAGE_SIZE * mc.moved_swap); > /* we've already done css_get(mc.to) */ > mc.moved_swap = 0; > } > @@ -6345,7 +6394,8 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry) > rcu_read_lock(); > memcg = mem_cgroup_lookup(id); > if (memcg) { > - res_counter_uncharge(&memcg->memsw, PAGE_SIZE); > + if (!mem_cgroup_is_root(memcg)) > + res_counter_uncharge(&memcg->memsw, PAGE_SIZE); > mem_cgroup_swap_statistics(memcg, false); > css_put(&memcg->css); > } > @@ -6509,12 +6559,15 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, > { > unsigned long flags; > > - if (nr_mem) > - res_counter_uncharge(&memcg->res, nr_mem * PAGE_SIZE); > - if (nr_memsw) > - res_counter_uncharge(&memcg->memsw, nr_memsw * PAGE_SIZE); > - > - memcg_oom_recover(memcg); > + if (!mem_cgroup_is_root(memcg)) { > + if (nr_mem) > + res_counter_uncharge(&memcg->res, > + nr_mem * PAGE_SIZE); > + if (nr_memsw) > + res_counter_uncharge(&memcg->memsw, > + nr_memsw * PAGE_SIZE); > + memcg_oom_recover(memcg); > + } > > local_irq_save(flags); > __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_anon); > -- > 2.0.4 > > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id 7A4AB6B0036 for ; Fri, 5 Sep 2014 05:25:42 -0400 (EDT) Received: by mail-wg0-f52.google.com with SMTP id m15so11342751wgh.23 for ; Fri, 05 Sep 2014 02:25:41 -0700 (PDT) Received: from mail-wi0-x229.google.com (mail-wi0-x229.google.com [2a00:1450:400c:c05::229]) by mx.google.com with ESMTPS id hj19si2169361wib.36.2014.09.05.02.25.40 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 05 Sep 2014 02:25:40 -0700 (PDT) Received: by mail-wi0-f169.google.com with SMTP id n3so461819wiv.2 for ; Fri, 05 Sep 2014 02:25:40 -0700 (PDT) Date: Fri, 5 Sep 2014 11:25:37 +0200 From: Michal Hocko Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905092537.GC26243@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5408CB2E.3080101@sr71.net> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Johannes Weiner , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On Thu 04-09-14 13:27:26, Dave Hansen wrote: > On 09/04/2014 07:27 AM, Michal Hocko wrote: > > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > > because it reduces it to PAGEVEC_SIZE batches. > > > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > > already batching on tlb_gather layer. That one is limited so I think > > the below should be safe but I have to think about this some more. There > > is a risk of prolonged lru_lock wait times but the number of pages is > > limited to 10k and the heavy work is done outside of the lock. If this > > is really a problem then we can tear LRU part and the actual > > freeing/uncharging into a separate functions in this path. > > > > Could you test with this half baked patch, please? I didn't get to test > > it myself unfortunately. > > 3.16 settled out at about 11.5M faults/sec before the regression. This > patch gets it back up to about 10.5M, which is good. Dave, would you be willing to test the following patch as well? I do not have a huge machine at hand right now. It would be great if you could run the same load within a !root memcg. We have basically the same sub-optimality there as well. The root bypass will be re-introduced for now but I think we can make the regular memcg load better regardless and this would be also preparation for later root bypass removal again. --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by kanga.kvack.org (Postfix) with ESMTP id D6B3A6B0037 for ; Fri, 5 Sep 2014 05:28:46 -0400 (EDT) Received: by mail-wi0-f181.google.com with SMTP id e4so2618556wiv.8 for ; Fri, 05 Sep 2014 02:28:46 -0700 (PDT) Received: from mail-wg0-x22a.google.com (mail-wg0-x22a.google.com [2a00:1450:400c:c00::22a]) by mx.google.com with ESMTPS id s20si2144015wiv.55.2014.09.05.02.28.43 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 05 Sep 2014 02:28:44 -0700 (PDT) Received: by mail-wg0-f42.google.com with SMTP id b13so11336884wgh.25 for ; Fri, 05 Sep 2014 02:28:43 -0700 (PDT) Date: Fri, 5 Sep 2014 11:28:41 +0200 From: Michal Hocko Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905092841.GD26243@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <5408ED7A.5010908@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5408ED7A.5010908@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Dave Hansen , Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On Thu 04-09-14 15:53:46, Dave Hansen wrote: > On 09/04/2014 01:27 PM, Dave Hansen wrote: > > On 09/04/2014 07:27 AM, Michal Hocko wrote: > >> Ouch. free_pages_and_swap_cache completely kills the uncharge batching > >> because it reduces it to PAGEVEC_SIZE batches. > >> > >> I think we really do not need PAGEVEC_SIZE batching anymore. We are > >> already batching on tlb_gather layer. That one is limited so I think > >> the below should be safe but I have to think about this some more. There > >> is a risk of prolonged lru_lock wait times but the number of pages is > >> limited to 10k and the heavy work is done outside of the lock. If this > >> is really a problem then we can tear LRU part and the actual > >> freeing/uncharging into a separate functions in this path. > >> > >> Could you test with this half baked patch, please? I didn't get to test > >> it myself unfortunately. > > > > 3.16 settled out at about 11.5M faults/sec before the regression. This > > patch gets it back up to about 10.5M, which is good. The top spinlock > > contention in the kernel is still from the resource counter code via > > mem_cgroup_commit_charge(), though. > > > > I'm running Johannes' patch now. > > This looks pretty good. The area where it plateaus (above 80 threads > where hyperthreading kicks in) might be a bit slower than it was in > 3.16, but that could easily be from other things. Good news indeed. But I think it would be safer to apply Johannes' revert for now. Both changes are still worth having anyway because they have potential to improve memcg case. > > https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/&2=3.17.0-rc3-g57b252f > > Feel free to add my Tested-by: Thanks a lot! I have posted another patch which reduces the batching for LRU handling because this would be too risky. So I haven't added your Tested-by yet. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f177.google.com (mail-lb0-f177.google.com [209.85.217.177]) by kanga.kvack.org (Postfix) with ESMTP id 78FA86B0036 for ; Fri, 5 Sep 2014 08:35:40 -0400 (EDT) Received: by mail-lb0-f177.google.com with SMTP id l4so855826lbv.8 for ; Fri, 05 Sep 2014 05:35:39 -0700 (PDT) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id o7si2814005lbi.59.2014.09.05.05.35.34 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 05 Sep 2014 05:35:34 -0700 (PDT) Date: Fri, 5 Sep 2014 08:35:17 -0400 From: Johannes Weiner Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905123517.GA21208@cmpxchg.org> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5408CB2E.3080101@sr71.net> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Michal Hocko , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: > On 09/04/2014 07:27 AM, Michal Hocko wrote: > > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > > because it reduces it to PAGEVEC_SIZE batches. > > > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > > already batching on tlb_gather layer. That one is limited so I think > > the below should be safe but I have to think about this some more. There > > is a risk of prolonged lru_lock wait times but the number of pages is > > limited to 10k and the heavy work is done outside of the lock. If this > > is really a problem then we can tear LRU part and the actual > > freeing/uncharging into a separate functions in this path. > > > > Could you test with this half baked patch, please? I didn't get to test > > it myself unfortunately. > > 3.16 settled out at about 11.5M faults/sec before the regression. This > patch gets it back up to about 10.5M, which is good. The top spinlock > contention in the kernel is still from the resource counter code via > mem_cgroup_commit_charge(), though. Thanks for testing, that looks a lot better. But commit doesn't touch resource counters - did you mean try_charge() or uncharge() by any chance? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com [209.85.217.182]) by kanga.kvack.org (Postfix) with ESMTP id D6BF96B0036 for ; Fri, 5 Sep 2014 10:47:44 -0400 (EDT) Received: by mail-lb0-f182.google.com with SMTP id u10so13655344lbd.13 for ; Fri, 05 Sep 2014 07:47:41 -0700 (PDT) Received: from zene.cmpxchg.org (zene.cmpxchg.org. [2a01:238:4224:fa00:ca1f:9ef3:caee:a2bd]) by mx.google.com with ESMTPS id ab3si3381200lbc.52.2014.09.05.07.47.38 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 05 Sep 2014 07:47:39 -0700 (PDT) Date: Fri, 5 Sep 2014 10:47:23 -0400 From: Johannes Weiner Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905144723.GB13392@cmpxchg.org> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905092537.GC26243@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140905092537.GC26243@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Dave Hansen , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote: > @@ -900,10 +900,10 @@ void lru_add_drain_all(void) > * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() > * will free it. > */ > -void release_pages(struct page **pages, int nr, bool cold) > +static void release_lru_pages(struct page **pages, int nr, > + struct list_head *pages_to_free) > { > int i; > - LIST_HEAD(pages_to_free); > struct zone *zone = NULL; > struct lruvec *lruvec; > unsigned long uninitialized_var(flags); > @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool cold) > /* Clear Active bit in case of parallel mark_page_accessed */ > __ClearPageActive(page); > > - list_add(&page->lru, &pages_to_free); > + list_add(&page->lru, pages_to_free); > } > if (zone) > spin_unlock_irqrestore(&zone->lru_lock, flags); > +} > +/* > + * Batched page_cache_release(). Frees and uncharges all given pages > + * for which the reference count drops to 0. > + */ > +void release_pages(struct page **pages, int nr, bool cold) > +{ > + LIST_HEAD(pages_to_free); > > + while (nr) { > + int batch = min(nr, PAGEVEC_SIZE); > + > + release_lru_pages(pages, batch, &pages_to_free); > + pages += batch; > + nr -= batch; > + } We might be able to process a lot more pages in one go if nobody else needs the lock or the CPU. Can't we just cycle the lock or reschedule if necessary? diff --git a/mm/swap.c b/mm/swap.c index 6b2dc3897cd5..ee0cf21dd521 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool cold) __ClearPageActive(page); list_add(&page->lru, &pages_to_free); + + if (should_resched() || + (zone && spin_needbreak(&zone->lru_lock))) { + if (zone) { + spin_unlock_irqrestore(&zone->lru_lock, flags); + zone = NULL; + } + cond_resched(); + } } if (zone) spin_unlock_irqrestore(&zone->lru_lock, flags); diff --git a/mm/swap_state.c b/mm/swap_state.c index 3e0ec83d000c..c487ca4682a4 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page) */ void free_pages_and_swap_cache(struct page **pages, int nr) { - struct page **pagep = pages; + int i; lru_add_drain(); - while (nr) { - int todo = min(nr, PAGEVEC_SIZE); - int i; - - for (i = 0; i < todo; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, todo, false); - pagep += todo; - nr -= todo; - } + for (i = 0; i < nr; i++) + free_swap_cache(pages[i]); + release_pages(pages, nr, false); } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id D9E7E6B0036 for ; Fri, 5 Sep 2014 11:39:52 -0400 (EDT) Received: by mail-wg0-f52.google.com with SMTP id m15so11851132wgh.35 for ; Fri, 05 Sep 2014 08:39:52 -0700 (PDT) Received: from mail-we0-x22e.google.com (mail-we0-x22e.google.com [2a00:1450:400c:c03::22e]) by mx.google.com with ESMTPS id r6si4014014wif.64.2014.09.05.08.39.51 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 05 Sep 2014 08:39:51 -0700 (PDT) Received: by mail-we0-f174.google.com with SMTP id w61so541399wes.33 for ; Fri, 05 Sep 2014 08:39:50 -0700 (PDT) Date: Fri, 5 Sep 2014 17:39:48 +0200 From: Michal Hocko Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905153948.GH26243@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905092537.GC26243@dhcp22.suse.cz> <20140905144723.GB13392@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140905144723.GB13392@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Dave Hansen , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On Fri 05-09-14 10:47:23, Johannes Weiner wrote: > On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote: > > @@ -900,10 +900,10 @@ void lru_add_drain_all(void) > > * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() > > * will free it. > > */ > > -void release_pages(struct page **pages, int nr, bool cold) > > +static void release_lru_pages(struct page **pages, int nr, > > + struct list_head *pages_to_free) > > { > > int i; > > - LIST_HEAD(pages_to_free); > > struct zone *zone = NULL; > > struct lruvec *lruvec; > > unsigned long uninitialized_var(flags); > > @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool cold) > > /* Clear Active bit in case of parallel mark_page_accessed */ > > __ClearPageActive(page); > > > > - list_add(&page->lru, &pages_to_free); > > + list_add(&page->lru, pages_to_free); > > } > > if (zone) > > spin_unlock_irqrestore(&zone->lru_lock, flags); > > +} > > +/* > > + * Batched page_cache_release(). Frees and uncharges all given pages > > + * for which the reference count drops to 0. > > + */ > > +void release_pages(struct page **pages, int nr, bool cold) > > +{ > > + LIST_HEAD(pages_to_free); > > > > + while (nr) { > > + int batch = min(nr, PAGEVEC_SIZE); > > + > > + release_lru_pages(pages, batch, &pages_to_free); > > + pages += batch; > > + nr -= batch; > > + } > > We might be able to process a lot more pages in one go if nobody else > needs the lock or the CPU. Can't we just cycle the lock or reschedule > if necessary? Is it safe to cond_resched here for all callers? I hope it is but there are way too many callers to check so I am not 100% sure. Besides that spin_needbreak doesn't seem to be available for all architectures. git grep "arch_spin_is_contended(" -- arch/ arch/arm/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/arm64/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/ia64/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/mips/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/x86/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) Moreover it doesn't seem to do anything for !CONFIG_PREEMPT but this should be trivial to fix. I am also not sure this will work well in all cases. If we have a heavy reclaim activity on other CPUs then this path might be interrupted too often resulting in too much lock bouncing. So I guess we want at least few pages to be processed in one run. On the other hand if the lock is not contended then doing batches and retake the lock shouldn't add too much overhead, no? > diff --git a/mm/swap.c b/mm/swap.c > index 6b2dc3897cd5..ee0cf21dd521 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool cold) > __ClearPageActive(page); > > list_add(&page->lru, &pages_to_free); > + > + if (should_resched() || > + (zone && spin_needbreak(&zone->lru_lock))) { > + if (zone) { > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + zone = NULL; > + } > + cond_resched(); > + } > } > if (zone) > spin_unlock_irqrestore(&zone->lru_lock, flags); > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 3e0ec83d000c..c487ca4682a4 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page) > */ > void free_pages_and_swap_cache(struct page **pages, int nr) > { > - struct page **pagep = pages; > + int i; > > lru_add_drain(); > - while (nr) { > - int todo = min(nr, PAGEVEC_SIZE); > - int i; > - > - for (i = 0; i < todo; i++) > - free_swap_cache(pagep[i]); > - release_pages(pagep, todo, false); > - pagep += todo; > - nr -= todo; > - } > + for (i = 0; i < nr; i++) > + free_swap_cache(pages[i]); > + release_pages(pages, nr, false); > } > > /* -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f44.google.com (mail-pa0-f44.google.com [209.85.220.44]) by kanga.kvack.org (Postfix) with ESMTP id E67C56B0036 for ; Mon, 8 Sep 2014 11:48:25 -0400 (EDT) Received: by mail-pa0-f44.google.com with SMTP id kx10so3362456pab.3 for ; Mon, 08 Sep 2014 08:48:25 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id bs9si18541973pdb.145.2014.09.08.08.48.23 for ; Mon, 08 Sep 2014 08:48:23 -0700 (PDT) Message-ID: <540DCF99.2070900@intel.com> Date: Mon, 08 Sep 2014 08:47:37 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905123517.GA21208@cmpxchg.org> In-Reply-To: <20140905123517.GA21208@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , Dave Hansen Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On 09/05/2014 05:35 AM, Johannes Weiner wrote: > On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: >> On 09/04/2014 07:27 AM, Michal Hocko wrote: >>> Ouch. free_pages_and_swap_cache completely kills the uncharge batching >>> because it reduces it to PAGEVEC_SIZE batches. >>> >>> I think we really do not need PAGEVEC_SIZE batching anymore. We are >>> already batching on tlb_gather layer. That one is limited so I think >>> the below should be safe but I have to think about this some more. There >>> is a risk of prolonged lru_lock wait times but the number of pages is >>> limited to 10k and the heavy work is done outside of the lock. If this >>> is really a problem then we can tear LRU part and the actual >>> freeing/uncharging into a separate functions in this path. >>> >>> Could you test with this half baked patch, please? I didn't get to test >>> it myself unfortunately. >> >> 3.16 settled out at about 11.5M faults/sec before the regression. This >> patch gets it back up to about 10.5M, which is good. The top spinlock >> contention in the kernel is still from the resource counter code via >> mem_cgroup_commit_charge(), though. > > Thanks for testing, that looks a lot better. > > But commit doesn't touch resource counters - did you mean try_charge() > or uncharge() by any chance? I don't have the perf output that I was looking at when I said this, but here's the path that I think I was referring to. The inlining makes this non-obvious, but this memcg_check_events() calls mem_cgroup_update_tree() which is contending on mctz->lock. So, you were right, it's not the resource counters code, it's a lock in 'struct mem_cgroup_tree_per_zone'. But, the contention isn't _that_ high (2% of CPU) in this case. But, that is 2% that we didn't see before. > 1.87% 1.87% [kernel] [k] _raw_spin_lock_irqsave > | > --- _raw_spin_lock_irqsave > | > |--107.09%-- memcg_check_events > | | > | |--79.98%-- mem_cgroup_commit_charge > | | | > | | |--99.81%-- do_cow_fault > | | | handle_mm_fault > | | | __do_page_fault > | | | do_page_fault > | | | page_fault > | | | testcase > | | --0.19%-- [...] -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f48.google.com (mail-wg0-f48.google.com [74.125.82.48]) by kanga.kvack.org (Postfix) with ESMTP id 495C36B0044 for ; Tue, 9 Sep 2014 10:50:58 -0400 (EDT) Received: by mail-wg0-f48.google.com with SMTP id m15so3154874wgh.7 for ; Tue, 09 Sep 2014 07:50:57 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id s6si18328100wiy.42.2014.09.09.07.50.56 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 Sep 2014 07:50:56 -0700 (PDT) Date: Tue, 9 Sep 2014 10:50:44 -0400 From: Johannes Weiner Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140909145044.GA16027@cmpxchg.org> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905123517.GA21208@cmpxchg.org> <540DCF99.2070900@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <540DCF99.2070900@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Dave Hansen , Michal Hocko , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On Mon, Sep 08, 2014 at 08:47:37AM -0700, Dave Hansen wrote: > On 09/05/2014 05:35 AM, Johannes Weiner wrote: > > On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: > >> On 09/04/2014 07:27 AM, Michal Hocko wrote: > >>> Ouch. free_pages_and_swap_cache completely kills the uncharge batching > >>> because it reduces it to PAGEVEC_SIZE batches. > >>> > >>> I think we really do not need PAGEVEC_SIZE batching anymore. We are > >>> already batching on tlb_gather layer. That one is limited so I think > >>> the below should be safe but I have to think about this some more. There > >>> is a risk of prolonged lru_lock wait times but the number of pages is > >>> limited to 10k and the heavy work is done outside of the lock. If this > >>> is really a problem then we can tear LRU part and the actual > >>> freeing/uncharging into a separate functions in this path. > >>> > >>> Could you test with this half baked patch, please? I didn't get to test > >>> it myself unfortunately. > >> > >> 3.16 settled out at about 11.5M faults/sec before the regression. This > >> patch gets it back up to about 10.5M, which is good. The top spinlock > >> contention in the kernel is still from the resource counter code via > >> mem_cgroup_commit_charge(), though. > > > > Thanks for testing, that looks a lot better. > > > > But commit doesn't touch resource counters - did you mean try_charge() > > or uncharge() by any chance? > > I don't have the perf output that I was looking at when I said this, but > here's the path that I think I was referring to. The inlining makes > this non-obvious, but this memcg_check_events() calls > mem_cgroup_update_tree() which is contending on mctz->lock. > > So, you were right, it's not the resource counters code, it's a lock in > 'struct mem_cgroup_tree_per_zone'. But, the contention isn't _that_ > high (2% of CPU) in this case. But, that is 2% that we didn't see before. > > > 1.87% 1.87% [kernel] [k] _raw_spin_lock_irqsave > > | > > --- _raw_spin_lock_irqsave > > | > > |--107.09%-- memcg_check_events > > | | > > | |--79.98%-- mem_cgroup_commit_charge > > | | | > > | | |--99.81%-- do_cow_fault > > | | | handle_mm_fault > > | | | __do_page_fault > > | | | do_page_fault > > | | | page_fault > > | | | testcase > > | | --0.19%-- [...] The mctz->lock is only taken when there is, or has been, soft limit excess. However, the soft limit defaults to infinity, so unless you set it explicitly on the root level, I can't see how this could be mctz->lock contention. It's more plausible that this is the res_counter lock for testing soft limit excess - for me, both these locks get inlined into check_events, could you please double check you got the right lock? As the limit defaults to infinity, and really doesn't mean anything on the root level it's idiotic to test it, we can easily eliminate that. With the patch below, I don't have that trace show up in the profile anymore. Could you please give it a try? You also said that this cost hasn't been there before, but I do see that trace in both v3.16 and v3.17-rc3 with roughly the same impact (although my machines show less contention than yours). Could you please double check that this is in fact a regression independent of 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter")? Thanks! --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id 403BD6B0096 for ; Tue, 9 Sep 2014 14:23:24 -0400 (EDT) Received: by mail-pa0-f51.google.com with SMTP id kx10so5930507pab.10 for ; Tue, 09 Sep 2014 11:23:23 -0700 (PDT) Received: from blackbird.sr71.net ([2001:19d0:2:6:209:6bff:fe9a:902]) by mx.google.com with ESMTP id t5si24542612pdd.122.2014.09.09.11.23.17 for ; Tue, 09 Sep 2014 11:23:17 -0700 (PDT) Message-ID: <540F4592.9030408@sr71.net> Date: Tue, 09 Sep 2014 11:23:14 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905123517.GA21208@cmpxchg.org> <540DCF99.2070900@intel.com> <20140909145044.GA16027@cmpxchg.org> In-Reply-To: <20140909145044.GA16027@cmpxchg.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , Dave Hansen Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On 09/09/2014 07:50 AM, Johannes Weiner wrote: > The mctz->lock is only taken when there is, or has been, soft limit > excess. However, the soft limit defaults to infinity, so unless you > set it explicitly on the root level, I can't see how this could be > mctz->lock contention. > > It's more plausible that this is the res_counter lock for testing soft > limit excess - for me, both these locks get inlined into check_events, > could you please double check you got the right lock? I got the wrong lock. Here's how it looks after mainline, plus your free_pages_and_swap_cache() patch: Samples: 2M of event 'cycles', Event count (approx.): 51647128377 + 60.60% 1.33% page_fault2_processes [.] testcase a?? + 59.14% 0.41% [kernel] [k] page_fault a?? + 58.72% 0.01% [kernel] [k] do_page_fault a?? + 58.70% 0.08% [kernel] [k] __do_page_fault a?? + 58.50% 0.29% [kernel] [k] handle_mm_fault a?? + 40.14% 0.28% [kernel] [k] do_cow_fault a?? - 34.56% 34.56% [kernel] [k] _raw_spin_lock a?? - _raw_spin_lock a?? - 78.11% __res_counter_charge a?? res_counter_charge a?? try_charge a?? - mem_cgroup_try_charge a?? + 99.99% do_cow_fault a?? - 10.30% res_counter_uncharge_until a?? res_counter_uncharge a?? uncharge_batch a?? uncharge_list a?? mem_cgroup_uncharge_list a?? release_pages a?? + 4.75% free_pcppages_bulk a?? + 3.65% do_cow_fault a?? + 2.24% get_page_from_freelist a?? > You also said that this cost hasn't been there before, but I do see > that trace in both v3.16 and v3.17-rc3 with roughly the same impact > (although my machines show less contention than yours). Could you > please double check that this is in fact a regression independent of > 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter")? Here's the same workload on the same machine with only Johannes' revert applied: - 35.92% 35.92% [kernel] [k] _raw_spin_lock a?? - _raw_spin_lock a?? - 49.09% get_page_from_freelist a?? - __alloc_pages_nodemask a?? + 99.90% alloc_pages_vma a?? - 43.67% free_pcppages_bulk a?? - 100.00% free_hot_cold_page a?? + 99.93% free_hot_cold_page_list a?? - 7.08% do_cow_fault a?? handle_mm_fault a?? __do_page_fault a?? do_page_fault a?? page_fault a?? testcase a?? So I think it's probably part of the same regression. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id A329B6B0036 for ; Wed, 10 Sep 2014 12:29:50 -0400 (EDT) Received: by mail-wi0-f171.google.com with SMTP id hi2so6628363wib.4 for ; Wed, 10 Sep 2014 09:29:50 -0700 (PDT) Received: from mail-wg0-x22c.google.com (mail-wg0-x22c.google.com [2a00:1450:400c:c00::22c]) by mx.google.com with ESMTPS id ev6si21188898wjb.171.2014.09.10.09.29.38 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 10 Sep 2014 09:29:39 -0700 (PDT) Received: by mail-wg0-f44.google.com with SMTP id y10so4401139wgg.15 for ; Wed, 10 Sep 2014 09:29:38 -0700 (PDT) Date: Wed, 10 Sep 2014 18:29:36 +0200 From: Michal Hocko Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140910162936.GI25219@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905092537.GC26243@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140905092537.GC26243@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Johannes Weiner , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On Fri 05-09-14 11:25:37, Michal Hocko wrote: > On Thu 04-09-14 13:27:26, Dave Hansen wrote: > > On 09/04/2014 07:27 AM, Michal Hocko wrote: > > > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > > > because it reduces it to PAGEVEC_SIZE batches. > > > > > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > > > already batching on tlb_gather layer. That one is limited so I think > > > the below should be safe but I have to think about this some more. There > > > is a risk of prolonged lru_lock wait times but the number of pages is > > > limited to 10k and the heavy work is done outside of the lock. If this > > > is really a problem then we can tear LRU part and the actual > > > freeing/uncharging into a separate functions in this path. > > > > > > Could you test with this half baked patch, please? I didn't get to test > > > it myself unfortunately. > > > > 3.16 settled out at about 11.5M faults/sec before the regression. This > > patch gets it back up to about 10.5M, which is good. > > Dave, would you be willing to test the following patch as well? I do not > have a huge machine at hand right now. It would be great if you could I was playing with 48CPU with 32G of RAM machine but the res_counter lock didn't show up in the traces much (this was with 96 processes doing mmap (256M private file, faul, unmap in parallel): |--0.75%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--81.56%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault [...] | | | --18.44%-- __add_to_page_cache_locked | add_to_page_cache_lru | mpage_readpages | ext4_readpages | __do_page_cache_readahead | ondemand_readahead | page_cache_async_readahead | filemap_fault | __do_fault | do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault Nothing really changed in that regards when I reduced mmap size to 128M and run with 4*CPUs. I do not have a bigger machine to play with unfortunately. I think the patch makes sense on its own. I would really appreciate if you could give it a try on your machine with !root memcg case to see how much it helped. I would expect similar results to your previous testing without the revert and Johannes' patch. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f178.google.com (mail-pd0-f178.google.com [209.85.192.178]) by kanga.kvack.org (Postfix) with ESMTP id 315D96B0036 for ; Wed, 10 Sep 2014 12:58:12 -0400 (EDT) Received: by mail-pd0-f178.google.com with SMTP id p10so8632426pdj.37 for ; Wed, 10 Sep 2014 09:58:11 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id p9si27950485pdr.169.2014.09.10.09.58.10 for ; Wed, 10 Sep 2014 09:58:11 -0700 (PDT) Message-ID: <54108314.80400@intel.com> Date: Wed, 10 Sep 2014 09:57:56 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905092537.GC26243@dhcp22.suse.cz> <20140910162936.GI25219@dhcp22.suse.cz> In-Reply-To: <20140910162936.GI25219@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko , Dave Hansen Cc: Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On 09/10/2014 09:29 AM, Michal Hocko wrote: > I do not have a bigger machine to play with unfortunately. I think the > patch makes sense on its own. I would really appreciate if you could > give it a try on your machine with !root memcg case to see how much it > helped. I would expect similar results to your previous testing without > the revert and Johannes' patch. So you want to see before/after this patch: Subject: [PATCH] mm, memcg: Do not kill release batching in free_pages_and_swap_cache And you want it on top of a kernel with the revert or without? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f176.google.com (mail-we0-f176.google.com [74.125.82.176]) by kanga.kvack.org (Postfix) with ESMTP id E33776B003B for ; Wed, 10 Sep 2014 13:05:10 -0400 (EDT) Received: by mail-we0-f176.google.com with SMTP id q58so5554576wes.7 for ; Wed, 10 Sep 2014 10:05:10 -0700 (PDT) Received: from mail-wi0-x22d.google.com (mail-wi0-x22d.google.com [2a00:1450:400c:c05::22d]) by mx.google.com with ESMTPS id gq8si21489329wjc.23.2014.09.10.10.05.09 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 10 Sep 2014 10:05:09 -0700 (PDT) Received: by mail-wi0-f173.google.com with SMTP id em10so2563594wid.0 for ; Wed, 10 Sep 2014 10:05:09 -0700 (PDT) Date: Wed, 10 Sep 2014 19:05:06 +0200 From: Michal Hocko Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140910170506.GL25219@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905092537.GC26243@dhcp22.suse.cz> <20140910162936.GI25219@dhcp22.suse.cz> <54108314.80400@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54108314.80400@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: Dave Hansen , Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML On Wed 10-09-14 09:57:56, Dave Hansen wrote: > On 09/10/2014 09:29 AM, Michal Hocko wrote: > > I do not have a bigger machine to play with unfortunately. I think the > > patch makes sense on its own. I would really appreciate if you could > > give it a try on your machine with !root memcg case to see how much it > > helped. I would expect similar results to your previous testing without > > the revert and Johannes' patch. > > So you want to see before/after this patch: > > Subject: [PATCH] mm, memcg: Do not kill release batching in > free_pages_and_swap_cache > > And you want it on top of a kernel with the revert or without? Revert doesn't make any difference if you run the load inside a memcg (without any limit set). So just before and after the patch would be sufficient. Thanks a lot Dave! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755063AbaIBTFo (ORCPT ); Tue, 2 Sep 2014 15:05:44 -0400 Received: from www.sr71.net ([198.145.64.142]:52537 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754781AbaIBTFn (ORCPT ); Tue, 2 Sep 2014 15:05:43 -0400 Message-ID: <54061505.8020500@sr71.net> Date: Tue, 02 Sep 2014 12:05:41 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Johannes Weiner , Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM Subject: regression caused by cgroups optimization in 3.17-rc2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the memory cgroups code. This is on a kernel with cgroups enabled at compile time, but not _used_ for anything. See the green lines in the graph: https://www.sr71.net/~dave/intel/regression-from-05b843012.png The workload is a little parallel microbenchmark doing page faults: > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c The hardware is an 8-socket Westmere box with 160 hardware threads. For some reason, this does not affect the version of the microbenchmark which is doing completely anonymous page faults. I bisected it down to this commit: > commit 05b8430123359886ef6a4146fba384e30d771b3f > Author: Johannes Weiner > Date: Wed Aug 6 16:05:59 2014 -0700 > > mm: memcontrol: use root_mem_cgroup res_counter > > Due to an old optimization to keep expensive res_counter changes at a > minimum, the root_mem_cgroup res_counter is never charged; there is no > limit at that level anyway, and any statistics can be generated on > demand by summing up the counters of all other cgroups. > > However, with per-cpu charge caches, res_counter operations do not even > show up in profiles anymore, so this optimization is no longer > necessary. > > Remove it to simplify the code. It does not revert cleanly because of the hunks below. The code in those hunks was removed, so I tried running without properly merging them and it spews warnings because counter->usage is seen going negative. So, it doesn't appear we can quickly revert this. > --- mm/memcontrol.c > +++ mm/memcontrol.c > @@ -3943,7 +3947,7 @@ > * replacement page, so leave it alone when phasing out the > * page that is unused after the migration. > */ > - if (!end_migration) > + if (!end_migration && !mem_cgroup_is_root(memcg)) > mem_cgroup_do_uncharge(memcg, nr_pages, ctype); > > return memcg; > @@ -4076,7 +4080,8 @@ > * We uncharge this because swap is freed. This memcg can > * be obsolete one. We avoid calling css_tryget_online(). > */ > - res_counter_uncharge(&memcg->memsw, PAGE_SIZE); > + if (!mem_cgroup_is_root(memcg)) > + res_counter_uncharge(&memcg->memsw, PAGE_SIZE); > mem_cgroup_swap_statistics(memcg, false); > css_put(&memcg->css); > } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755248AbaIBU5Z (ORCPT ); Tue, 2 Sep 2014 16:57:25 -0400 Received: from www.sr71.net ([198.145.64.142]:53384 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755169AbaIBU5Y (ORCPT ); Tue, 2 Sep 2014 16:57:24 -0400 Message-ID: <54062F32.5070504@sr71.net> Date: Tue, 02 Sep 2014 13:57:22 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Dave Hansen , Johannes Weiner , Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> In-Reply-To: <5406262F.4050705@intel.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I, of course, forgot to include the most important detail. This appears to be pretty run-of-the-mill spinlock contention in the resource counter code. Nearly 80% of the CPU is spent spinning in the charge or uncharge paths in the kernel. It is apparently spinning on res_counter->lock in both the charge and uncharge paths. It already does _some_ batching here on the free side, but that apparently breaks down after ~40 threads. It's a no-brainer since the patch in question removed an optimization skipping the charging, and now we're seeing overhead from the charging. Here's the first entry from perf top: 80.18% 80.18% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--66.59%-- res_counter_uncharge_until | res_counter_uncharge | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--90.12%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --9.88%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap | |--46.13%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--99.89%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault | | testcase | --0.11%-- [...] | |--1.14%-- do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | testcase --8217937613.29%-- [...] From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755463AbaIBWSh (ORCPT ); Tue, 2 Sep 2014 18:18:37 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:59918 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751326AbaIBWSf (ORCPT ); Tue, 2 Sep 2014 18:18:35 -0400 Date: Tue, 2 Sep 2014 18:18:14 -0400 From: Johannes Weiner To: Dave Hansen Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140902221814.GA18069@cmpxchg.org> References: <54061505.8020500@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54061505.8020500@sr71.net> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Dave, On Tue, Sep 02, 2014 at 12:05:41PM -0700, Dave Hansen wrote: > I'm seeing a pretty large regression in 3.17-rc2 vs 3.16 coming from the > memory cgroups code. This is on a kernel with cgroups enabled at > compile time, but not _used_ for anything. See the green lines in the > graph: > > https://www.sr71.net/~dave/intel/regression-from-05b843012.png > > The workload is a little parallel microbenchmark doing page faults: Ouch. > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c > > The hardware is an 8-socket Westmere box with 160 hardware threads. For > some reason, this does not affect the version of the microbenchmark > which is doing completely anonymous page faults. > > I bisected it down to this commit: > > > commit 05b8430123359886ef6a4146fba384e30d771b3f > > Author: Johannes Weiner > > Date: Wed Aug 6 16:05:59 2014 -0700 > > > > mm: memcontrol: use root_mem_cgroup res_counter > > > > Due to an old optimization to keep expensive res_counter changes at a > > minimum, the root_mem_cgroup res_counter is never charged; there is no > > limit at that level anyway, and any statistics can be generated on > > demand by summing up the counters of all other cgroups. > > > > However, with per-cpu charge caches, res_counter operations do not even > > show up in profiles anymore, so this optimization is no longer > > necessary. > > > > Remove it to simplify the code. Accounting new pages is buffered through per-cpu caches, but taking them off the counters on free is not, so I'm guessing that above a certain allocation rate the cost of locking and changing the counters takes over. Is there a chance you could profile this to see if locks and res_counter-related operations show up? I can't reproduce this complete breakdown on my smaller test gear, but I do see an improvement with the following patch: --- >>From 29a51326c24b7bb45d17e9c864c34506f10868f6 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 2 Sep 2014 18:11:39 -0400 Subject: [patch] mm: memcontrol: use per-cpu caches for uncharging --- mm/memcontrol.c | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ec4dcf1b9562..cb79ecff399d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2365,6 +2365,7 @@ static void drain_stock(struct memcg_stock_pcp *stock) res_counter_uncharge(&old->res, bytes); if (do_swap_account) res_counter_uncharge(&old->memsw, bytes); + memcg_oom_recover(old); stock->nr_pages = 0; } stock->cached = NULL; @@ -2405,6 +2406,13 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) stock->cached = memcg; } stock->nr_pages += nr_pages; + if (stock->nr_pages > CHARGE_BATCH * 4) { + res_counter_uncharge(&memcg->res, CHARGE_BATCH); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, CHARGE_BATCH); + memcg_oom_recover(memcg); + stock->nr_pages -= CHARGE_BATCH; + } put_cpu_var(memcg_stock); } @@ -6509,12 +6517,20 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, { unsigned long flags; - if (nr_mem) - res_counter_uncharge(&memcg->res, nr_mem * PAGE_SIZE); - if (nr_memsw) - res_counter_uncharge(&memcg->memsw, nr_memsw * PAGE_SIZE); - - memcg_oom_recover(memcg); + /* + * The percpu caches count both memory and memsw charges in a + * single conuter, but there might be less memsw charges when + * some of the pages have been swapped out. + */ + if (nr_mem == nr_memsw) + refill_stock(memcg, nr_mem); + else { + if (nr_mem) + res_counter_uncharge(&memcg->res, nr_mem * PAGE_SIZE); + if (nr_memsw) + res_counter_uncharge(&memcg->memsw, nr_memsw * PAGE_SIZE); + memcg_oom_recover(memcg); + } local_irq_save(flags); __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_anon); -- 2.0.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754288AbaICAKb (ORCPT ); Tue, 2 Sep 2014 20:10:31 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:59936 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753940AbaICAKa (ORCPT ); Tue, 2 Sep 2014 20:10:30 -0400 Date: Tue, 2 Sep 2014 20:10:09 -0400 From: Johannes Weiner To: Dave Hansen Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140903001009.GA25970@cmpxchg.org> References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5406466D.1020000@sr71.net> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: > On 09/02/2014 03:18 PM, Johannes Weiner wrote: > > Accounting new pages is buffered through per-cpu caches, but taking > > them off the counters on free is not, so I'm guessing that above a > > certain allocation rate the cost of locking and changing the counters > > takes over. Is there a chance you could profile this to see if locks > > and res_counter-related operations show up? > > It looks pretty much the same, although it might have equalized the > charge and uncharge sides a bit. Full 'perf top' output attached. That looks like a partial profile, where did the page allocator, page zeroing etc. go? Because the distribution among these listed symbols doesn't seem all that crazy: > PerfTop: 275580 irqs/sec kernel:98.0% exact: 0.0% [4000Hz cycles], (all, 160 CPUs) > ------------------------------------------------------------------------------- > > 68.10% 68.10% [kernel] [k] _raw_spin_lock > | > --- _raw_spin_lock > | > |--57.35%-- __res_counter_charge > | res_counter_charge > | try_charge > | mem_cgroup_try_charge > | | > | |--99.93%-- do_cow_fault > | | handle_mm_fault > | | __do_page_fault > | | do_page_fault > | | page_fault > | | testcase > | --0.07%-- [...] > | > |--53.93%-- res_counter_uncharge_until > | res_counter_uncharge > | refill_stock > | uncharge_batch > | uncharge_list > | mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > | > |--2.18%-- do_cow_fault > | handle_mm_fault > | __do_page_fault > | do_page_fault > | page_fault > | testcase > --9307219025.55%-- [...] > > 64.00% 1.34% page_fault2_processes [.] testcase > | > --- testcase > > 62.64% 0.37% [kernel] [k] page_fault > | > --- page_fault > | > |--114.21%-- testcase > --10118485450.89%-- [...] > > 62.27% 0.01% [kernel] [k] do_page_fault > | > --- do_page_fault > | > |--114.28%-- page_fault > | | > | |--99.93%-- testcase > | --0.07%-- [...] > --10178525138.33%-- [...] > > 62.25% 0.07% [kernel] [k] __do_page_fault > | > --- __do_page_fault > | > |--114.27%-- do_page_fault > | page_fault > | | > | |--99.93%-- testcase > | --0.07%-- [...] > --10182230022.41%-- [...] > > 62.08% 0.26% [kernel] [k] handle_mm_fault > | > --- handle_mm_fault > | > |--114.28%-- __do_page_fault > | do_page_fault > | page_fault > | | > | |--99.94%-- testcase > | --0.06%-- [...] > --10210709377.69%-- [...] > > 44.32% 0.22% [kernel] [k] do_cow_fault > | > --- do_cow_fault > | > |--114.28%-- handle_mm_fault > | __do_page_fault > | do_page_fault > | page_fault > | testcase > --14302986980.21%-- [...] > > 35.23% 0.00% [kernel] [k] sys_munmap > | > --- sys_munmap > system_call_fastpath > __GI___munmap > > 34.84% 0.04% [kernel] [k] mem_cgroup_try_charge > | > --- mem_cgroup_try_charge > | > |--114.18%-- do_cow_fault > | handle_mm_fault > | __do_page_fault > | do_page_fault > | page_fault > | testcase > --18195666899.12%-- [...] > > 34.74% 0.49% [kernel] [k] unmap_single_vma > | > --- unmap_single_vma > unmap_vmas > unmap_region > do_munmap > vm_munmap > sys_munmap > system_call_fastpath > __GI___munmap > > 34.66% 0.00% [kernel] [k] tlb_flush_mmu_free > | > --- tlb_flush_mmu_free > | > |--112.70%-- unmap_single_vma > | unmap_vmas > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > | > |--1.59%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18285481100.94%-- [...] > > 34.66% 0.12% [kernel] [k] free_pages_and_swap_cache > | > --- free_pages_and_swap_cache > tlb_flush_mmu_free > | > |--112.70%-- unmap_single_vma > | unmap_vmas > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > | > |--1.59%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18286531934.97%-- [...] > > 34.61% 0.04% [kernel] [k] try_charge > | > --- try_charge > | > |--114.26%-- mem_cgroup_try_charge > | | > | |--99.93%-- do_cow_fault > | | handle_mm_fault > | | __do_page_fault > | | do_page_fault > | | page_fault > | | testcase > | --0.07%-- [...] > --18312751937.56%-- [...] > > 34.57% 0.00% [kernel] [k] res_counter_charge > | > --- res_counter_charge > | > |--114.27%-- try_charge > | mem_cgroup_try_charge > | | > | |--99.93%-- do_cow_fault > | | handle_mm_fault > | | __do_page_fault > | | do_page_fault > | | page_fault > | | testcase > | --0.07%-- [...] > --18334689838.42%-- [...] > > 34.56% 0.08% [kernel] [k] release_pages > | > --- release_pages > | > |--114.21%-- free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.61%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.39%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18340488029.00%-- [...] > > 34.26% 0.08% [kernel] [k] __res_counter_charge > | > --- __res_counter_charge > | > |--114.28%-- res_counter_charge > | try_charge > | mem_cgroup_try_charge > | | > | |--99.93%-- do_cow_fault > | | handle_mm_fault > | | __do_page_fault > | | do_page_fault > | | page_fault > | | testcase > | --0.07%-- [...] > --18502823676.22%-- [...] > > 33.45% 0.00% [kernel] [k] mem_cgroup_uncharge_list > | > --- mem_cgroup_uncharge_list > | > |--114.28%-- release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18949679654.78%-- [...] > > 33.45% 0.01% [kernel] [k] uncharge_list > | > --- uncharge_list > | > |--114.29%-- mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18951232568.52%-- [...] > > 33.43% 0.02% [kernel] [k] uncharge_batch > | > --- uncharge_batch > | > |--114.28%-- uncharge_list > | mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --18962452122.61%-- [...] > > 32.43% 0.01% [kernel] [k] refill_stock > | > --- refill_stock > | > |--114.27%-- uncharge_batch > | uncharge_list > | mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache > | tlb_flush_mmu_free > | | > | |--98.62%-- unmap_single_vma > | | unmap_vmas > | | unmap_region > | | do_munmap > | | vm_munmap > | | sys_munmap > | | system_call_fastpath > | | __GI___munmap > | | > | --1.38%-- tlb_flush_mmu > | tlb_finish_mmu > | unmap_region > | do_munmap > | vm_munmap > | sys_munmap > | system_call_fastpath > | __GI___munmap > --19543678319.29%-- [...] From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754805AbaICAU5 (ORCPT ); Tue, 2 Sep 2014 20:20:57 -0400 Received: from mail-vc0-f182.google.com ([209.85.220.182]:47480 "EHLO mail-vc0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752248AbaICAU4 (ORCPT ); Tue, 2 Sep 2014 20:20:56 -0400 MIME-Version: 1.0 In-Reply-To: <20140903001009.GA25970@cmpxchg.org> References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> Date: Tue, 2 Sep 2014 17:20:55 -0700 X-Google-Sender-Auth: obvHnoT1tha5C2YVCpjaDJZ30uE Message-ID: Subject: Re: regression caused by cgroups optimization in 3.17-rc2 From: Linus Torvalds To: Johannes Weiner Cc: Dave Hansen , Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Andrew Morton , LKML , Linux-MM Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner wrote: > > That looks like a partial profile, where did the page allocator, page > zeroing etc. go? Because the distribution among these listed symbols > doesn't seem all that crazy: Please argue this *after* the commit has been reverted. You guys can try to make the memcontrol batching actually work and scale later. It's not appropriate to argue against major regressions when reported and bisected by users. Showing the spinlock at the top of the profile is very much crazy (apparently taking 68% of all cpu time), when it's all useless make-believe work. I don't understand why you wouldn't call that crazy. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755851AbaICAak (ORCPT ); Tue, 2 Sep 2014 20:30:40 -0400 Received: from www.sr71.net ([198.145.64.142]:55002 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754491AbaICAaj (ORCPT ); Tue, 2 Sep 2014 20:30:39 -0400 Message-ID: <5406612E.8040802@sr71.net> Date: Tue, 02 Sep 2014 17:30:38 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Johannes Weiner CC: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> In-Reply-To: <20140903001009.GA25970@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/02/2014 05:10 PM, Johannes Weiner wrote: > On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: >> On 09/02/2014 03:18 PM, Johannes Weiner wrote: >>> Accounting new pages is buffered through per-cpu caches, but taking >>> them off the counters on free is not, so I'm guessing that above a >>> certain allocation rate the cost of locking and changing the counters >>> takes over. Is there a chance you could profile this to see if locks >>> and res_counter-related operations show up? >> >> It looks pretty much the same, although it might have equalized the >> charge and uncharge sides a bit. Full 'perf top' output attached. > > That looks like a partial profile, where did the page allocator, page > zeroing etc. go? Because the distribution among these listed symbols > doesn't seem all that crazy: Perf was only outputting the top 20 functions. Believe it or not, page zeroing and the rest of the allocator path wasn't even in the path of the top 20 functions because there is so much lock contention. Here's a longer run of 'perf top' along with the top 100 functions: http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz you can at least see copy_page_rep in there. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755561AbaICBdg (ORCPT ); Tue, 2 Sep 2014 21:33:36 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:59947 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751302AbaICBde (ORCPT ); Tue, 2 Sep 2014 21:33:34 -0400 Date: Tue, 2 Sep 2014 21:33:17 -0400 From: Johannes Weiner To: Linus Torvalds Cc: Dave Hansen , Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Andrew Morton , LKML , Linux-MM Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140903013317.GA26086@cmpxchg.org> References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 02, 2014 at 05:20:55PM -0700, Linus Torvalds wrote: > On Tue, Sep 2, 2014 at 5:10 PM, Johannes Weiner wrote: > > > > That looks like a partial profile, where did the page allocator, page > > zeroing etc. go? Because the distribution among these listed symbols > > doesn't seem all that crazy: > > Please argue this *after* the commit has been reverted. You guys can > try to make the memcontrol batching actually work and scale later. > It's not appropriate to argue against major regressions when reported > and bisected by users. I'll send a clean revert later. > Showing the spinlock at the top of the profile is very much crazy > (apparently taking 68% of all cpu time), when it's all useless > make-believe work. I don't understand why you wouldn't call that > crazy. If you limit perf to a subset of symbols, it will show a relative distribution between them, i.e: perf top --symbols kfree,memset during some disk access: PerfTop: 1292 irqs/sec kernel:84.4% exact: 0.0% [4000Hz cycles], (all, 4 CPUs) ------------------------------------------------------------------------------- 56.23% [kernel] [k] kfree 41.86% [kernel] [k] memset 1.91% libc-2.19.so [.] memset kfree isn't eating 56% of "all cpu time" here, and it wasn't clear to me whether Dave filtered symbols from only memcontrol.o, memory.o, and mmap.o in a similar way. I'm not arguing against the regression, I'm just trying to make sense of the numbers from the *patched* kernel. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754184AbaICDPJ (ORCPT ); Tue, 2 Sep 2014 23:15:09 -0400 Received: from www.sr71.net ([198.145.64.142]:55792 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751730AbaICDPG (ORCPT ); Tue, 2 Sep 2014 23:15:06 -0400 Message-ID: <540687B9.7070305@sr71.net> Date: Tue, 02 Sep 2014 20:15:05 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Johannes Weiner , Linus Torvalds CC: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Andrew Morton , LKML , Linux-MM Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> <20140903013317.GA26086@cmpxchg.org> In-Reply-To: <20140903013317.GA26086@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/02/2014 06:33 PM, Johannes Weiner wrote: > kfree isn't eating 56% of "all cpu time" here, and it wasn't clear to > me whether Dave filtered symbols from only memcontrol.o, memory.o, and > mmap.o in a similar way. I'm not arguing against the regression, I'm > just trying to make sense of the numbers from the *patched* kernel. I guess I could have included it in the description, but that was a pretty vanilla run: perf top --call-graph=fp --stdio > foo.txt I didn't use any filtering. I didn't even know I _could_ filter. :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753207AbaIDO13 (ORCPT ); Thu, 4 Sep 2014 10:27:29 -0400 Received: from mail-we0-f176.google.com ([74.125.82.176]:62829 "EHLO mail-we0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752923AbaIDO11 (ORCPT ); Thu, 4 Sep 2014 10:27:27 -0400 Date: Thu, 4 Sep 2014 16:27:21 +0200 From: Michal Hocko To: Dave Hansen Cc: Johannes Weiner , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140904142721.GB14548@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54062F32.5070504@sr71.net> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [Sorry to reply so late] On Tue 02-09-14 13:57:22, Dave Hansen wrote: > I, of course, forgot to include the most important detail. This appears > to be pretty run-of-the-mill spinlock contention in the resource counter > code. Nearly 80% of the CPU is spent spinning in the charge or uncharge > paths in the kernel. It is apparently spinning on res_counter->lock in > both the charge and uncharge paths. > > It already does _some_ batching here on the free side, but that > apparently breaks down after ~40 threads. > > It's a no-brainer since the patch in question removed an optimization > skipping the charging, and now we're seeing overhead from the charging. > > Here's the first entry from perf top: > > 80.18% 80.18% [kernel] [k] _raw_spin_lock > | > --- _raw_spin_lock > | > |--66.59%-- res_counter_uncharge_until > | res_counter_uncharge > | uncharge_batch > | uncharge_list > | mem_cgroup_uncharge_list > | release_pages > | free_pages_and_swap_cache Ouch. free_pages_and_swap_cache completely kills the uncharge batching because it reduces it to PAGEVEC_SIZE batches. I think we really do not need PAGEVEC_SIZE batching anymore. We are already batching on tlb_gather layer. That one is limited so I think the below should be safe but I have to think about this some more. There is a risk of prolonged lru_lock wait times but the number of pages is limited to 10k and the heavy work is done outside of the lock. If this is really a problem then we can tear LRU part and the actual freeing/uncharging into a separate functions in this path. Could you test with this half baked patch, please? I didn't get to test it myself unfortunately. --- diff --git a/mm/swap_state.c b/mm/swap_state.c index ef1f39139b71..154444918685 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page) void free_pages_and_swap_cache(struct page **pages, int nr) { struct page **pagep = pages; + int i; lru_add_drain(); - while (nr) { - int todo = min(nr, PAGEVEC_SIZE); - int i; - - for (i = 0; i < todo; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, todo, false); - pagep += todo; - nr -= todo; - } + for (i = 0; i < nr; i++) + free_swap_cache(pagep[i]); + release_pages(pagep, nr, false); } /* -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752656AbaIDPJE (ORCPT ); Thu, 4 Sep 2014 11:09:04 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:60033 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750742AbaIDPJC (ORCPT ); Thu, 4 Sep 2014 11:09:02 -0400 Date: Thu, 4 Sep 2014 11:08:46 -0400 From: Johannes Weiner To: Dave Hansen Cc: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140904150846.GA10794@cmpxchg.org> References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> <5406612E.8040802@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5406612E.8040802@sr71.net> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 02, 2014 at 05:30:38PM -0700, Dave Hansen wrote: > On 09/02/2014 05:10 PM, Johannes Weiner wrote: > > On Tue, Sep 02, 2014 at 03:36:29PM -0700, Dave Hansen wrote: > >> On 09/02/2014 03:18 PM, Johannes Weiner wrote: > >>> Accounting new pages is buffered through per-cpu caches, but taking > >>> them off the counters on free is not, so I'm guessing that above a > >>> certain allocation rate the cost of locking and changing the counters > >>> takes over. Is there a chance you could profile this to see if locks > >>> and res_counter-related operations show up? > >> > >> It looks pretty much the same, although it might have equalized the > >> charge and uncharge sides a bit. Full 'perf top' output attached. > > > > That looks like a partial profile, where did the page allocator, page > > zeroing etc. go? Because the distribution among these listed symbols > > doesn't seem all that crazy: > > Perf was only outputting the top 20 functions. Believe it or not, page > zeroing and the rest of the allocator path wasn't even in the path of > the top 20 functions because there is so much lock contention. > > Here's a longer run of 'perf top' along with the top 100 functions: > > http://www.sr71.net/~dave/intel/perf-top-1409702817.txt.gz > > you can at least see copy_page_rep in there. Thanks for the clarification, that is truly horrible. Does the following revert restore performance in your case? --- >>From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 4 Sep 2014 10:04:34 -0400 Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter Dave Hansen reports a massive scalability regression in an uncontained page fault benchmark with more than 30 concurrent threads, which he bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter") and pin-pointed on res_counter spinlock contention. That change relied on the per-cpu charge caches to mostly swallow the res_counter costs, but it's apparent that the caches don't scale yet. Revert memcg back to bypassing res_counters on the root level in order to restore performance for uncontained workloads. Reported-by: Dave Hansen Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 103 ++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 78 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ec4dcf1b9562..085dc6d2f876 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned long long size; int ret = 0; + if (mem_cgroup_is_root(memcg)) + goto done; retry: if (consume_stock(memcg, nr_pages)) goto done; @@ -2611,9 +2613,7 @@ nomem: if (!(gfp_mask & __GFP_NOFAIL)) return -ENOMEM; bypass: - memcg = root_mem_cgroup; - ret = -EINTR; - goto retry; + return -EINTR; done_restock: if (batch > nr_pages) @@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge(&memcg->res, bytes); if (do_swap_account) res_counter_uncharge(&memcg->memsw, bytes); @@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg, { unsigned long bytes = nr_pages * PAGE_SIZE; + if (mem_cgroup_is_root(memcg)) + return; + res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes); if (do_swap_account) res_counter_uncharge_until(&memcg->memsw, @@ -4093,6 +4099,46 @@ out: return retval; } +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx) +{ + struct mem_cgroup *iter; + long val = 0; + + /* Per-cpu values can be negative, use a signed accumulator */ + for_each_mem_cgroup_tree(iter, memcg) + val += mem_cgroup_read_stat(iter, idx); + + if (val < 0) /* race ? */ + val = 0; + return val; +} + +static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) +{ + u64 val; + + if (!mem_cgroup_is_root(memcg)) { + if (!swap) + return res_counter_read_u64(&memcg->res, RES_USAGE); + else + return res_counter_read_u64(&memcg->memsw, RES_USAGE); + } + + /* + * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS + * as well as in MEM_CGROUP_STAT_RSS_HUGE. + */ + val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE); + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS); + + if (swap) + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP); + + return val << PAGE_SHIFT; +} + + static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -4102,8 +4148,12 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, switch (type) { case _MEM: + if (name == RES_USAGE) + return mem_cgroup_usage(memcg, false); return res_counter_read_u64(&memcg->res, name); case _MEMSWAP: + if (name == RES_USAGE) + return mem_cgroup_usage(memcg, true); return res_counter_read_u64(&memcg->memsw, name); case _KMEM: return res_counter_read_u64(&memcg->kmem, name); @@ -4572,10 +4622,7 @@ static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap) if (!t) goto unlock; - if (!swap) - usage = res_counter_read_u64(&memcg->res, RES_USAGE); - else - usage = res_counter_read_u64(&memcg->memsw, RES_USAGE); + usage = mem_cgroup_usage(memcg, swap); /* * current_threshold points to threshold just below or equal to usage. @@ -4673,10 +4720,10 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg, if (type == _MEM) { thresholds = &memcg->thresholds; - usage = res_counter_read_u64(&memcg->res, RES_USAGE); + usage = mem_cgroup_usage(memcg, false); } else if (type == _MEMSWAP) { thresholds = &memcg->memsw_thresholds; - usage = res_counter_read_u64(&memcg->memsw, RES_USAGE); + usage = mem_cgroup_usage(memcg, true); } else BUG(); @@ -4762,10 +4809,10 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg, if (type == _MEM) { thresholds = &memcg->thresholds; - usage = res_counter_read_u64(&memcg->res, RES_USAGE); + usage = mem_cgroup_usage(memcg, false); } else if (type == _MEMSWAP) { thresholds = &memcg->memsw_thresholds; - usage = res_counter_read_u64(&memcg->memsw, RES_USAGE); + usage = mem_cgroup_usage(memcg, true); } else BUG(); @@ -5525,9 +5572,9 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) * core guarantees its existence. */ } else { - res_counter_init(&memcg->res, &root_mem_cgroup->res); - res_counter_init(&memcg->memsw, &root_mem_cgroup->memsw); - res_counter_init(&memcg->kmem, &root_mem_cgroup->kmem); + res_counter_init(&memcg->res, NULL); + res_counter_init(&memcg->memsw, NULL); + res_counter_init(&memcg->kmem, NULL); /* * Deeper hierachy with use_hierarchy == false doesn't make * much sense so let cgroup subsystem know about this @@ -5969,8 +6016,9 @@ static void __mem_cgroup_clear_mc(void) /* we must fixup refcnts and charges */ if (mc.moved_swap) { /* uncharge swap account from the old cgroup */ - res_counter_uncharge(&mc.from->memsw, - PAGE_SIZE * mc.moved_swap); + if (!mem_cgroup_is_root(mc.from)) + res_counter_uncharge(&mc.from->memsw, + PAGE_SIZE * mc.moved_swap); for (i = 0; i < mc.moved_swap; i++) css_put(&mc.from->css); @@ -5979,8 +6027,9 @@ static void __mem_cgroup_clear_mc(void) * we charged both to->res and to->memsw, so we should * uncharge to->res. */ - res_counter_uncharge(&mc.to->res, - PAGE_SIZE * mc.moved_swap); + if (!mem_cgroup_is_root(mc.to)) + res_counter_uncharge(&mc.to->res, + PAGE_SIZE * mc.moved_swap); /* we've already done css_get(mc.to) */ mc.moved_swap = 0; } @@ -6345,7 +6394,8 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry) rcu_read_lock(); memcg = mem_cgroup_lookup(id); if (memcg) { - res_counter_uncharge(&memcg->memsw, PAGE_SIZE); + if (!mem_cgroup_is_root(memcg)) + res_counter_uncharge(&memcg->memsw, PAGE_SIZE); mem_cgroup_swap_statistics(memcg, false); css_put(&memcg->css); } @@ -6509,12 +6559,15 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, { unsigned long flags; - if (nr_mem) - res_counter_uncharge(&memcg->res, nr_mem * PAGE_SIZE); - if (nr_memsw) - res_counter_uncharge(&memcg->memsw, nr_memsw * PAGE_SIZE); - - memcg_oom_recover(memcg); + if (!mem_cgroup_is_root(memcg)) { + if (nr_mem) + res_counter_uncharge(&memcg->res, + nr_mem * PAGE_SIZE); + if (nr_memsw) + res_counter_uncharge(&memcg->memsw, + nr_memsw * PAGE_SIZE); + memcg_oom_recover(memcg); + } local_irq_save(flags); __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_anon); -- 2.0.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755288AbaIDUuh (ORCPT ); Thu, 4 Sep 2014 16:50:37 -0400 Received: from www.sr71.net ([198.145.64.142]:48236 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753271AbaIDUug (ORCPT ); Thu, 4 Sep 2014 16:50:36 -0400 Message-ID: <5408D09A.5030000@sr71.net> Date: Thu, 04 Sep 2014 13:50:34 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Johannes Weiner CC: Michal Hocko , Hugh Dickins , Tejun Heo , Vladimir Davydov , Linus Torvalds , Andrew Morton , LKML , Linux-MM Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> <5406612E.8040802@sr71.net> <20140904150846.GA10794@cmpxchg.org> In-Reply-To: <20140904150846.GA10794@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/04/2014 08:08 AM, Johannes Weiner wrote: > Dave Hansen reports a massive scalability regression in an uncontained > page fault benchmark with more than 30 concurrent threads, which he > bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup > res_counter") and pin-pointed on res_counter spinlock contention. > > That change relied on the per-cpu charge caches to mostly swallow the > res_counter costs, but it's apparent that the caches don't scale yet. > > Revert memcg back to bypassing res_counters on the root level in order > to restore performance for uncontained workloads. A quick sniff test shows performance coming back to what it was around 3.16 with this patch. I'll run a more thorough set of tests and verify that it's working well. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755419AbaIDU13 (ORCPT ); Thu, 4 Sep 2014 16:27:29 -0400 Received: from www.sr71.net ([198.145.64.142]:47993 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752920AbaIDU12 (ORCPT ); Thu, 4 Sep 2014 16:27:28 -0400 Message-ID: <5408CB2E.3080101@sr71.net> Date: Thu, 04 Sep 2014 13:27:26 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Michal Hocko CC: Johannes Weiner , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> In-Reply-To: <20140904142721.GB14548@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/04/2014 07:27 AM, Michal Hocko wrote: > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > because it reduces it to PAGEVEC_SIZE batches. > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > already batching on tlb_gather layer. That one is limited so I think > the below should be safe but I have to think about this some more. There > is a risk of prolonged lru_lock wait times but the number of pages is > limited to 10k and the heavy work is done outside of the lock. If this > is really a problem then we can tear LRU part and the actual > freeing/uncharging into a separate functions in this path. > > Could you test with this half baked patch, please? I didn't get to test > it myself unfortunately. 3.16 settled out at about 11.5M faults/sec before the regression. This patch gets it back up to about 10.5M, which is good. The top spinlock contention in the kernel is still from the resource counter code via mem_cgroup_commit_charge(), though. I'm running Johannes' patch now. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756137AbaIDWzv (ORCPT ); Thu, 4 Sep 2014 18:55:51 -0400 Received: from mga09.intel.com ([134.134.136.24]:47890 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751520AbaIDWzt (ORCPT ); Thu, 4 Sep 2014 18:55:49 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.04,469,1406617200"; d="scan'208";a="598206660" Message-ID: <5408ED7A.5010908@intel.com> Date: Thu, 04 Sep 2014 15:53:46 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Dave Hansen , Michal Hocko CC: Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> In-Reply-To: <5408CB2E.3080101@sr71.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/04/2014 01:27 PM, Dave Hansen wrote: > On 09/04/2014 07:27 AM, Michal Hocko wrote: >> Ouch. free_pages_and_swap_cache completely kills the uncharge batching >> because it reduces it to PAGEVEC_SIZE batches. >> >> I think we really do not need PAGEVEC_SIZE batching anymore. We are >> already batching on tlb_gather layer. That one is limited so I think >> the below should be safe but I have to think about this some more. There >> is a risk of prolonged lru_lock wait times but the number of pages is >> limited to 10k and the heavy work is done outside of the lock. If this >> is really a problem then we can tear LRU part and the actual >> freeing/uncharging into a separate functions in this path. >> >> Could you test with this half baked patch, please? I didn't get to test >> it myself unfortunately. > > 3.16 settled out at about 11.5M faults/sec before the regression. This > patch gets it back up to about 10.5M, which is good. The top spinlock > contention in the kernel is still from the resource counter code via > mem_cgroup_commit_charge(), though. > > I'm running Johannes' patch now. This looks pretty good. The area where it plateaus (above 80 threads where hyperthreading kicks in) might be a bit slower than it was in 3.16, but that could easily be from other things. > https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/&2=3.17.0-rc3-g57b252f Feel free to add my Tested-by: From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756378AbaIEIEP (ORCPT ); Fri, 5 Sep 2014 04:04:15 -0400 Received: from mail-we0-f171.google.com ([74.125.82.171]:64230 "EHLO mail-we0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756063AbaIEIEI (ORCPT ); Fri, 5 Sep 2014 04:04:08 -0400 Date: Fri, 5 Sep 2014 10:04:04 +0200 From: Michal Hocko To: Johannes Weiner Cc: Dave Hansen , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905080404.GA26243@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <20140902221814.GA18069@cmpxchg.org> <5406466D.1020000@sr71.net> <20140903001009.GA25970@cmpxchg.org> <5406612E.8040802@sr71.net> <20140904150846.GA10794@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140904150846.GA10794@cmpxchg.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 04-09-14 11:08:46, Johannes Weiner wrote: [...] > From 6fa7599054868cd0df940d7b0973dd64f8acb0b5 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Thu, 4 Sep 2014 10:04:34 -0400 > Subject: [patch] mm: memcontrol: revert use of root_mem_cgroup res_counter > > Dave Hansen reports a massive scalability regression in an uncontained > page fault benchmark with more than 30 concurrent threads, which he > bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup > res_counter") and pin-pointed on res_counter spinlock contention. > > That change relied on the per-cpu charge caches to mostly swallow the > res_counter costs, but it's apparent that the caches don't scale yet. > > Revert memcg back to bypassing res_counters on the root level in order > to restore performance for uncontained workloads. > > Reported-by: Dave Hansen > Signed-off-by: Johannes Weiner The revert looks good to me. Acked-by: Michal Hocko > --- > mm/memcontrol.c | 103 ++++++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 78 insertions(+), 25 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ec4dcf1b9562..085dc6d2f876 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2534,6 +2534,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > unsigned long long size; > int ret = 0; > > + if (mem_cgroup_is_root(memcg)) > + goto done; > retry: > if (consume_stock(memcg, nr_pages)) > goto done; > @@ -2611,9 +2613,7 @@ nomem: > if (!(gfp_mask & __GFP_NOFAIL)) > return -ENOMEM; > bypass: > - memcg = root_mem_cgroup; > - ret = -EINTR; > - goto retry; > + return -EINTR; > > done_restock: > if (batch > nr_pages) > @@ -2626,6 +2626,9 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) > { > unsigned long bytes = nr_pages * PAGE_SIZE; > > + if (mem_cgroup_is_root(memcg)) > + return; > + > res_counter_uncharge(&memcg->res, bytes); > if (do_swap_account) > res_counter_uncharge(&memcg->memsw, bytes); > @@ -2640,6 +2643,9 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg, > { > unsigned long bytes = nr_pages * PAGE_SIZE; > > + if (mem_cgroup_is_root(memcg)) > + return; > + > res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes); > if (do_swap_account) > res_counter_uncharge_until(&memcg->memsw, > @@ -4093,6 +4099,46 @@ out: > return retval; > } > > +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg, > + enum mem_cgroup_stat_index idx) > +{ > + struct mem_cgroup *iter; > + long val = 0; > + > + /* Per-cpu values can be negative, use a signed accumulator */ > + for_each_mem_cgroup_tree(iter, memcg) > + val += mem_cgroup_read_stat(iter, idx); > + > + if (val < 0) /* race ? */ > + val = 0; > + return val; > +} > + > +static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) > +{ > + u64 val; > + > + if (!mem_cgroup_is_root(memcg)) { > + if (!swap) > + return res_counter_read_u64(&memcg->res, RES_USAGE); > + else > + return res_counter_read_u64(&memcg->memsw, RES_USAGE); > + } > + > + /* > + * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS > + * as well as in MEM_CGROUP_STAT_RSS_HUGE. > + */ > + val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE); > + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS); > + > + if (swap) > + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP); > + > + return val << PAGE_SHIFT; > +} > + > + > static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > struct cftype *cft) > { > @@ -4102,8 +4148,12 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, > > switch (type) { > case _MEM: > + if (name == RES_USAGE) > + return mem_cgroup_usage(memcg, false); > return res_counter_read_u64(&memcg->res, name); > case _MEMSWAP: > + if (name == RES_USAGE) > + return mem_cgroup_usage(memcg, true); > return res_counter_read_u64(&memcg->memsw, name); > case _KMEM: > return res_counter_read_u64(&memcg->kmem, name); > @@ -4572,10 +4622,7 @@ static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap) > if (!t) > goto unlock; > > - if (!swap) > - usage = res_counter_read_u64(&memcg->res, RES_USAGE); > - else > - usage = res_counter_read_u64(&memcg->memsw, RES_USAGE); > + usage = mem_cgroup_usage(memcg, swap); > > /* > * current_threshold points to threshold just below or equal to usage. > @@ -4673,10 +4720,10 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg, > > if (type == _MEM) { > thresholds = &memcg->thresholds; > - usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + usage = mem_cgroup_usage(memcg, false); > } else if (type == _MEMSWAP) { > thresholds = &memcg->memsw_thresholds; > - usage = res_counter_read_u64(&memcg->memsw, RES_USAGE); > + usage = mem_cgroup_usage(memcg, true); > } else > BUG(); > > @@ -4762,10 +4809,10 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg, > > if (type == _MEM) { > thresholds = &memcg->thresholds; > - usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + usage = mem_cgroup_usage(memcg, false); > } else if (type == _MEMSWAP) { > thresholds = &memcg->memsw_thresholds; > - usage = res_counter_read_u64(&memcg->memsw, RES_USAGE); > + usage = mem_cgroup_usage(memcg, true); > } else > BUG(); > > @@ -5525,9 +5572,9 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > * core guarantees its existence. > */ > } else { > - res_counter_init(&memcg->res, &root_mem_cgroup->res); > - res_counter_init(&memcg->memsw, &root_mem_cgroup->memsw); > - res_counter_init(&memcg->kmem, &root_mem_cgroup->kmem); > + res_counter_init(&memcg->res, NULL); > + res_counter_init(&memcg->memsw, NULL); > + res_counter_init(&memcg->kmem, NULL); > /* > * Deeper hierachy with use_hierarchy == false doesn't make > * much sense so let cgroup subsystem know about this > @@ -5969,8 +6016,9 @@ static void __mem_cgroup_clear_mc(void) > /* we must fixup refcnts and charges */ > if (mc.moved_swap) { > /* uncharge swap account from the old cgroup */ > - res_counter_uncharge(&mc.from->memsw, > - PAGE_SIZE * mc.moved_swap); > + if (!mem_cgroup_is_root(mc.from)) > + res_counter_uncharge(&mc.from->memsw, > + PAGE_SIZE * mc.moved_swap); > > for (i = 0; i < mc.moved_swap; i++) > css_put(&mc.from->css); > @@ -5979,8 +6027,9 @@ static void __mem_cgroup_clear_mc(void) > * we charged both to->res and to->memsw, so we should > * uncharge to->res. > */ > - res_counter_uncharge(&mc.to->res, > - PAGE_SIZE * mc.moved_swap); > + if (!mem_cgroup_is_root(mc.to)) > + res_counter_uncharge(&mc.to->res, > + PAGE_SIZE * mc.moved_swap); > /* we've already done css_get(mc.to) */ > mc.moved_swap = 0; > } > @@ -6345,7 +6394,8 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry) > rcu_read_lock(); > memcg = mem_cgroup_lookup(id); > if (memcg) { > - res_counter_uncharge(&memcg->memsw, PAGE_SIZE); > + if (!mem_cgroup_is_root(memcg)) > + res_counter_uncharge(&memcg->memsw, PAGE_SIZE); > mem_cgroup_swap_statistics(memcg, false); > css_put(&memcg->css); > } > @@ -6509,12 +6559,15 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, > { > unsigned long flags; > > - if (nr_mem) > - res_counter_uncharge(&memcg->res, nr_mem * PAGE_SIZE); > - if (nr_memsw) > - res_counter_uncharge(&memcg->memsw, nr_memsw * PAGE_SIZE); > - > - memcg_oom_recover(memcg); > + if (!mem_cgroup_is_root(memcg)) { > + if (nr_mem) > + res_counter_uncharge(&memcg->res, > + nr_mem * PAGE_SIZE); > + if (nr_memsw) > + res_counter_uncharge(&memcg->memsw, > + nr_memsw * PAGE_SIZE); > + memcg_oom_recover(memcg); > + } > > local_irq_save(flags); > __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_anon); > -- > 2.0.4 > > -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756581AbaIEJZo (ORCPT ); Fri, 5 Sep 2014 05:25:44 -0400 Received: from mail-wg0-f41.google.com ([74.125.82.41]:34769 "EHLO mail-wg0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756045AbaIEJZl (ORCPT ); Fri, 5 Sep 2014 05:25:41 -0400 Date: Fri, 5 Sep 2014 11:25:37 +0200 From: Michal Hocko To: Dave Hansen Cc: Johannes Weiner , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905092537.GC26243@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5408CB2E.3080101@sr71.net> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 04-09-14 13:27:26, Dave Hansen wrote: > On 09/04/2014 07:27 AM, Michal Hocko wrote: > > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > > because it reduces it to PAGEVEC_SIZE batches. > > > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > > already batching on tlb_gather layer. That one is limited so I think > > the below should be safe but I have to think about this some more. There > > is a risk of prolonged lru_lock wait times but the number of pages is > > limited to 10k and the heavy work is done outside of the lock. If this > > is really a problem then we can tear LRU part and the actual > > freeing/uncharging into a separate functions in this path. > > > > Could you test with this half baked patch, please? I didn't get to test > > it myself unfortunately. > > 3.16 settled out at about 11.5M faults/sec before the regression. This > patch gets it back up to about 10.5M, which is good. Dave, would you be willing to test the following patch as well? I do not have a huge machine at hand right now. It would be great if you could run the same load within a !root memcg. We have basically the same sub-optimality there as well. The root bypass will be re-introduced for now but I think we can make the regular memcg load better regardless and this would be also preparation for later root bypass removal again. --- >>From 17c4c8710f3ec37fe04866bd18dbc68a0f47b560 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 5 Sep 2014 11:16:17 +0200 Subject: [PATCH] mm, memcg: Do not kill release batching in free_pages_and_swap_cache free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks. This is not a big deal for the normal release path but it completely kills memcg uncharge batching which reduces res_counter spin_lock contention. Dave has noticed this with his page fault scalability test case on a large machine when the lock was basically dominating on all CPUs: 80.18% 80.18% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--66.59%-- res_counter_uncharge_until | res_counter_uncharge | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--90.12%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --9.88%-- tlb_flush_mmu | tlb_finish_mmu | unmap_region | do_munmap | vm_munmap | sys_munmap | system_call_fastpath | __GI___munmap In his case the load was running in the root memcg and that part has been handled by reverting 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter") because this is a clear regression. But it makes sense to help loads which are running within a memcg as well. So this is basically a performance optimization. There is no reason to limit release_pages to PAGEVEC_SIZE batches other than lru_lock held times. This logic, however, can be moved inside the function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not hold any lock for the whole pages_to_free list so it is safe to call them in a single run. Page reference count and LRU handling is moved to release_lru_pages and that is run in PAGEVEC_SIZE batches. Reported-by: Dave Hansen Signed-off-by: Michal Hocko --- mm/swap.c | 27 +++++++++++++++++++++------ mm/swap_state.c | 14 ++++---------- 2 files changed, 25 insertions(+), 16 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 6b2dc3897cd5..8af99dd68dd2 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -888,9 +888,9 @@ void lru_add_drain_all(void) } /* - * Batched page_cache_release(). Decrement the reference count on all the - * passed pages. If it fell to zero then remove the page from the LRU and - * free it. + * Batched lru release. Decrement the reference count on all the passed pages. + * If it fell to zero then remove the page from the LRU and add it to the given + * list to be freed by the caller. * * Avoid taking zone->lru_lock if possible, but if it is taken, retain it * for the remainder of the operation. @@ -900,10 +900,10 @@ void lru_add_drain_all(void) * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() * will free it. */ -void release_pages(struct page **pages, int nr, bool cold) +static void release_lru_pages(struct page **pages, int nr, + struct list_head *pages_to_free) { int i; - LIST_HEAD(pages_to_free); struct zone *zone = NULL; struct lruvec *lruvec; unsigned long uninitialized_var(flags); @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool cold) /* Clear Active bit in case of parallel mark_page_accessed */ __ClearPageActive(page); - list_add(&page->lru, &pages_to_free); + list_add(&page->lru, pages_to_free); } if (zone) spin_unlock_irqrestore(&zone->lru_lock, flags); +} +/* + * Batched page_cache_release(). Frees and uncharges all given pages + * for which the reference count drops to 0. + */ +void release_pages(struct page **pages, int nr, bool cold) +{ + LIST_HEAD(pages_to_free); + while (nr) { + int batch = min(nr, PAGEVEC_SIZE); + + release_lru_pages(pages, batch, &pages_to_free); + pages += batch; + nr -= batch; + } mem_cgroup_uncharge_list(&pages_to_free); free_hot_cold_page_list(&pages_to_free, cold); } diff --git a/mm/swap_state.c b/mm/swap_state.c index ef1f39139b71..154444918685 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page) void free_pages_and_swap_cache(struct page **pages, int nr) { struct page **pagep = pages; + int i; lru_add_drain(); - while (nr) { - int todo = min(nr, PAGEVEC_SIZE); - int i; - - for (i = 0; i < todo; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, todo, false); - pagep += todo; - nr -= todo; - } + for (i = 0; i < nr; i++) + free_swap_cache(pagep[i]); + release_pages(pagep, nr, false); } /* -- 2.1.0 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932104AbaIEJ2s (ORCPT ); Fri, 5 Sep 2014 05:28:48 -0400 Received: from mail-we0-f179.google.com ([74.125.82.179]:56428 "EHLO mail-we0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756183AbaIEJ2p (ORCPT ); Fri, 5 Sep 2014 05:28:45 -0400 Date: Fri, 5 Sep 2014 11:28:41 +0200 From: Michal Hocko To: Dave Hansen Cc: Dave Hansen , Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905092841.GD26243@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <5408ED7A.5010908@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5408ED7A.5010908@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 04-09-14 15:53:46, Dave Hansen wrote: > On 09/04/2014 01:27 PM, Dave Hansen wrote: > > On 09/04/2014 07:27 AM, Michal Hocko wrote: > >> Ouch. free_pages_and_swap_cache completely kills the uncharge batching > >> because it reduces it to PAGEVEC_SIZE batches. > >> > >> I think we really do not need PAGEVEC_SIZE batching anymore. We are > >> already batching on tlb_gather layer. That one is limited so I think > >> the below should be safe but I have to think about this some more. There > >> is a risk of prolonged lru_lock wait times but the number of pages is > >> limited to 10k and the heavy work is done outside of the lock. If this > >> is really a problem then we can tear LRU part and the actual > >> freeing/uncharging into a separate functions in this path. > >> > >> Could you test with this half baked patch, please? I didn't get to test > >> it myself unfortunately. > > > > 3.16 settled out at about 11.5M faults/sec before the regression. This > > patch gets it back up to about 10.5M, which is good. The top spinlock > > contention in the kernel is still from the resource counter code via > > mem_cgroup_commit_charge(), though. > > > > I'm running Johannes' patch now. > > This looks pretty good. The area where it plateaus (above 80 threads > where hyperthreading kicks in) might be a bit slower than it was in > 3.16, but that could easily be from other things. Good news indeed. But I think it would be safer to apply Johannes' revert for now. Both changes are still worth having anyway because they have potential to improve memcg case. > > https://www.sr71.net/~dave/intel/bb.html?1=3.16.0-rc4-g67b9d76/&2=3.17.0-rc3-g57b252f > > Feel free to add my Tested-by: Thanks a lot! I have posted another patch which reduces the batching for LRU handling because this would be too risky. So I haven't added your Tested-by yet. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932288AbaIEMfo (ORCPT ); Fri, 5 Sep 2014 08:35:44 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:60095 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757037AbaIEMfl (ORCPT ); Fri, 5 Sep 2014 08:35:41 -0400 Date: Fri, 5 Sep 2014 08:35:17 -0400 From: Johannes Weiner To: Dave Hansen Cc: Michal Hocko , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905123517.GA21208@cmpxchg.org> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5408CB2E.3080101@sr71.net> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: > On 09/04/2014 07:27 AM, Michal Hocko wrote: > > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > > because it reduces it to PAGEVEC_SIZE batches. > > > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > > already batching on tlb_gather layer. That one is limited so I think > > the below should be safe but I have to think about this some more. There > > is a risk of prolonged lru_lock wait times but the number of pages is > > limited to 10k and the heavy work is done outside of the lock. If this > > is really a problem then we can tear LRU part and the actual > > freeing/uncharging into a separate functions in this path. > > > > Could you test with this half baked patch, please? I didn't get to test > > it myself unfortunately. > > 3.16 settled out at about 11.5M faults/sec before the regression. This > patch gets it back up to about 10.5M, which is good. The top spinlock > contention in the kernel is still from the resource counter code via > mem_cgroup_commit_charge(), though. Thanks for testing, that looks a lot better. But commit doesn't touch resource counters - did you mean try_charge() or uncharge() by any chance? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757150AbaIEOrn (ORCPT ); Fri, 5 Sep 2014 10:47:43 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:60123 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751479AbaIEOrm (ORCPT ); Fri, 5 Sep 2014 10:47:42 -0400 Date: Fri, 5 Sep 2014 10:47:23 -0400 From: Johannes Weiner To: Michal Hocko Cc: Dave Hansen , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905144723.GB13392@cmpxchg.org> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905092537.GC26243@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140905092537.GC26243@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote: > @@ -900,10 +900,10 @@ void lru_add_drain_all(void) > * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() > * will free it. > */ > -void release_pages(struct page **pages, int nr, bool cold) > +static void release_lru_pages(struct page **pages, int nr, > + struct list_head *pages_to_free) > { > int i; > - LIST_HEAD(pages_to_free); > struct zone *zone = NULL; > struct lruvec *lruvec; > unsigned long uninitialized_var(flags); > @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool cold) > /* Clear Active bit in case of parallel mark_page_accessed */ > __ClearPageActive(page); > > - list_add(&page->lru, &pages_to_free); > + list_add(&page->lru, pages_to_free); > } > if (zone) > spin_unlock_irqrestore(&zone->lru_lock, flags); > +} > +/* > + * Batched page_cache_release(). Frees and uncharges all given pages > + * for which the reference count drops to 0. > + */ > +void release_pages(struct page **pages, int nr, bool cold) > +{ > + LIST_HEAD(pages_to_free); > > + while (nr) { > + int batch = min(nr, PAGEVEC_SIZE); > + > + release_lru_pages(pages, batch, &pages_to_free); > + pages += batch; > + nr -= batch; > + } We might be able to process a lot more pages in one go if nobody else needs the lock or the CPU. Can't we just cycle the lock or reschedule if necessary? diff --git a/mm/swap.c b/mm/swap.c index 6b2dc3897cd5..ee0cf21dd521 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool cold) __ClearPageActive(page); list_add(&page->lru, &pages_to_free); + + if (should_resched() || + (zone && spin_needbreak(&zone->lru_lock))) { + if (zone) { + spin_unlock_irqrestore(&zone->lru_lock, flags); + zone = NULL; + } + cond_resched(); + } } if (zone) spin_unlock_irqrestore(&zone->lru_lock, flags); diff --git a/mm/swap_state.c b/mm/swap_state.c index 3e0ec83d000c..c487ca4682a4 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page) */ void free_pages_and_swap_cache(struct page **pages, int nr) { - struct page **pagep = pages; + int i; lru_add_drain(); - while (nr) { - int todo = min(nr, PAGEVEC_SIZE); - int i; - - for (i = 0; i < todo; i++) - free_swap_cache(pagep[i]); - release_pages(pagep, todo, false); - pagep += todo; - nr -= todo; - } + for (i = 0; i < nr; i++) + free_swap_cache(pages[i]); + release_pages(pages, nr, false); } /* From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752017AbaIEPjx (ORCPT ); Fri, 5 Sep 2014 11:39:53 -0400 Received: from mail-we0-f170.google.com ([74.125.82.170]:42178 "EHLO mail-we0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750832AbaIEPjw (ORCPT ); Fri, 5 Sep 2014 11:39:52 -0400 Date: Fri, 5 Sep 2014 17:39:48 +0200 From: Michal Hocko To: Johannes Weiner Cc: Dave Hansen , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140905153948.GH26243@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905092537.GC26243@dhcp22.suse.cz> <20140905144723.GB13392@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140905144723.GB13392@cmpxchg.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 05-09-14 10:47:23, Johannes Weiner wrote: > On Fri, Sep 05, 2014 at 11:25:37AM +0200, Michal Hocko wrote: > > @@ -900,10 +900,10 @@ void lru_add_drain_all(void) > > * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() > > * will free it. > > */ > > -void release_pages(struct page **pages, int nr, bool cold) > > +static void release_lru_pages(struct page **pages, int nr, > > + struct list_head *pages_to_free) > > { > > int i; > > - LIST_HEAD(pages_to_free); > > struct zone *zone = NULL; > > struct lruvec *lruvec; > > unsigned long uninitialized_var(flags); > > @@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool cold) > > /* Clear Active bit in case of parallel mark_page_accessed */ > > __ClearPageActive(page); > > > > - list_add(&page->lru, &pages_to_free); > > + list_add(&page->lru, pages_to_free); > > } > > if (zone) > > spin_unlock_irqrestore(&zone->lru_lock, flags); > > +} > > +/* > > + * Batched page_cache_release(). Frees and uncharges all given pages > > + * for which the reference count drops to 0. > > + */ > > +void release_pages(struct page **pages, int nr, bool cold) > > +{ > > + LIST_HEAD(pages_to_free); > > > > + while (nr) { > > + int batch = min(nr, PAGEVEC_SIZE); > > + > > + release_lru_pages(pages, batch, &pages_to_free); > > + pages += batch; > > + nr -= batch; > > + } > > We might be able to process a lot more pages in one go if nobody else > needs the lock or the CPU. Can't we just cycle the lock or reschedule > if necessary? Is it safe to cond_resched here for all callers? I hope it is but there are way too many callers to check so I am not 100% sure. Besides that spin_needbreak doesn't seem to be available for all architectures. git grep "arch_spin_is_contended(" -- arch/ arch/arm/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/arm64/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/ia64/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/mips/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) arch/x86/include/asm/spinlock.h:static inline int arch_spin_is_contended(arch_spinlock_t *lock) Moreover it doesn't seem to do anything for !CONFIG_PREEMPT but this should be trivial to fix. I am also not sure this will work well in all cases. If we have a heavy reclaim activity on other CPUs then this path might be interrupted too often resulting in too much lock bouncing. So I guess we want at least few pages to be processed in one run. On the other hand if the lock is not contended then doing batches and retake the lock shouldn't add too much overhead, no? > diff --git a/mm/swap.c b/mm/swap.c > index 6b2dc3897cd5..ee0cf21dd521 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -944,6 +944,15 @@ void release_pages(struct page **pages, int nr, bool cold) > __ClearPageActive(page); > > list_add(&page->lru, &pages_to_free); > + > + if (should_resched() || > + (zone && spin_needbreak(&zone->lru_lock))) { > + if (zone) { > + spin_unlock_irqrestore(&zone->lru_lock, flags); > + zone = NULL; > + } > + cond_resched(); > + } > } > if (zone) > spin_unlock_irqrestore(&zone->lru_lock, flags); > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 3e0ec83d000c..c487ca4682a4 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -262,19 +262,12 @@ void free_page_and_swap_cache(struct page *page) > */ > void free_pages_and_swap_cache(struct page **pages, int nr) > { > - struct page **pagep = pages; > + int i; > > lru_add_drain(); > - while (nr) { > - int todo = min(nr, PAGEVEC_SIZE); > - int i; > - > - for (i = 0; i < todo; i++) > - free_swap_cache(pagep[i]); > - release_pages(pagep, todo, false); > - pagep += todo; > - nr -= todo; > - } > + for (i = 0; i < nr; i++) > + free_swap_cache(pages[i]); > + release_pages(pages, nr, false); > } > > /* -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754271AbaIHPsZ (ORCPT ); Mon, 8 Sep 2014 11:48:25 -0400 Received: from mga09.intel.com ([134.134.136.24]:29526 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753797AbaIHPsX (ORCPT ); Mon, 8 Sep 2014 11:48:23 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.04,486,1406617200"; d="scan'208";a="570039486" Message-ID: <540DCF99.2070900@intel.com> Date: Mon, 08 Sep 2014 08:47:37 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Johannes Weiner , Dave Hansen CC: Michal Hocko , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905123517.GA21208@cmpxchg.org> In-Reply-To: <20140905123517.GA21208@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/05/2014 05:35 AM, Johannes Weiner wrote: > On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: >> On 09/04/2014 07:27 AM, Michal Hocko wrote: >>> Ouch. free_pages_and_swap_cache completely kills the uncharge batching >>> because it reduces it to PAGEVEC_SIZE batches. >>> >>> I think we really do not need PAGEVEC_SIZE batching anymore. We are >>> already batching on tlb_gather layer. That one is limited so I think >>> the below should be safe but I have to think about this some more. There >>> is a risk of prolonged lru_lock wait times but the number of pages is >>> limited to 10k and the heavy work is done outside of the lock. If this >>> is really a problem then we can tear LRU part and the actual >>> freeing/uncharging into a separate functions in this path. >>> >>> Could you test with this half baked patch, please? I didn't get to test >>> it myself unfortunately. >> >> 3.16 settled out at about 11.5M faults/sec before the regression. This >> patch gets it back up to about 10.5M, which is good. The top spinlock >> contention in the kernel is still from the resource counter code via >> mem_cgroup_commit_charge(), though. > > Thanks for testing, that looks a lot better. > > But commit doesn't touch resource counters - did you mean try_charge() > or uncharge() by any chance? I don't have the perf output that I was looking at when I said this, but here's the path that I think I was referring to. The inlining makes this non-obvious, but this memcg_check_events() calls mem_cgroup_update_tree() which is contending on mctz->lock. So, you were right, it's not the resource counters code, it's a lock in 'struct mem_cgroup_tree_per_zone'. But, the contention isn't _that_ high (2% of CPU) in this case. But, that is 2% that we didn't see before. > 1.87% 1.87% [kernel] [k] _raw_spin_lock_irqsave > | > --- _raw_spin_lock_irqsave > | > |--107.09%-- memcg_check_events > | | > | |--79.98%-- mem_cgroup_commit_charge > | | | > | | |--99.81%-- do_cow_fault > | | | handle_mm_fault > | | | __do_page_fault > | | | do_page_fault > | | | page_fault > | | | testcase > | | --0.19%-- [...] From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757142AbaIIOvS (ORCPT ); Tue, 9 Sep 2014 10:51:18 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:48605 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757010AbaIIOvO (ORCPT ); Tue, 9 Sep 2014 10:51:14 -0400 Date: Tue, 9 Sep 2014 10:50:44 -0400 From: Johannes Weiner To: Dave Hansen Cc: Dave Hansen , Michal Hocko , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140909145044.GA16027@cmpxchg.org> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905123517.GA21208@cmpxchg.org> <540DCF99.2070900@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <540DCF99.2070900@intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 08, 2014 at 08:47:37AM -0700, Dave Hansen wrote: > On 09/05/2014 05:35 AM, Johannes Weiner wrote: > > On Thu, Sep 04, 2014 at 01:27:26PM -0700, Dave Hansen wrote: > >> On 09/04/2014 07:27 AM, Michal Hocko wrote: > >>> Ouch. free_pages_and_swap_cache completely kills the uncharge batching > >>> because it reduces it to PAGEVEC_SIZE batches. > >>> > >>> I think we really do not need PAGEVEC_SIZE batching anymore. We are > >>> already batching on tlb_gather layer. That one is limited so I think > >>> the below should be safe but I have to think about this some more. There > >>> is a risk of prolonged lru_lock wait times but the number of pages is > >>> limited to 10k and the heavy work is done outside of the lock. If this > >>> is really a problem then we can tear LRU part and the actual > >>> freeing/uncharging into a separate functions in this path. > >>> > >>> Could you test with this half baked patch, please? I didn't get to test > >>> it myself unfortunately. > >> > >> 3.16 settled out at about 11.5M faults/sec before the regression. This > >> patch gets it back up to about 10.5M, which is good. The top spinlock > >> contention in the kernel is still from the resource counter code via > >> mem_cgroup_commit_charge(), though. > > > > Thanks for testing, that looks a lot better. > > > > But commit doesn't touch resource counters - did you mean try_charge() > > or uncharge() by any chance? > > I don't have the perf output that I was looking at when I said this, but > here's the path that I think I was referring to. The inlining makes > this non-obvious, but this memcg_check_events() calls > mem_cgroup_update_tree() which is contending on mctz->lock. > > So, you were right, it's not the resource counters code, it's a lock in > 'struct mem_cgroup_tree_per_zone'. But, the contention isn't _that_ > high (2% of CPU) in this case. But, that is 2% that we didn't see before. > > > 1.87% 1.87% [kernel] [k] _raw_spin_lock_irqsave > > | > > --- _raw_spin_lock_irqsave > > | > > |--107.09%-- memcg_check_events > > | | > > | |--79.98%-- mem_cgroup_commit_charge > > | | | > > | | |--99.81%-- do_cow_fault > > | | | handle_mm_fault > > | | | __do_page_fault > > | | | do_page_fault > > | | | page_fault > > | | | testcase > > | | --0.19%-- [...] The mctz->lock is only taken when there is, or has been, soft limit excess. However, the soft limit defaults to infinity, so unless you set it explicitly on the root level, I can't see how this could be mctz->lock contention. It's more plausible that this is the res_counter lock for testing soft limit excess - for me, both these locks get inlined into check_events, could you please double check you got the right lock? As the limit defaults to infinity, and really doesn't mean anything on the root level it's idiotic to test it, we can easily eliminate that. With the patch below, I don't have that trace show up in the profile anymore. Could you please give it a try? You also said that this cost hasn't been there before, but I do see that trace in both v3.16 and v3.17-rc3 with roughly the same impact (although my machines show less contention than yours). Could you please double check that this is in fact a regression independent of 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter")? Thanks! --- >>From 465c5caa0628d640c2493e9d849dc9a1f0b373a4 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 9 Sep 2014 09:25:20 -0400 Subject: [patch] mm: memcontrol: do not track soft limit excess on the root level Dave encounters res_counter lock contention from memcg_check_events() when running a multi-threaded page fault benchmark in the root group. This lock is taken to maintain the tree of soft limit excessors, which is used by global reclaim to prioritize excess groups. But that makes no sense on the root level - parent to all other groups, and so all this overhead is unnecessary. Skip it. [ The soft limit really shouldn't even be settable on the root level, but it's been like that forever, so don't risk breaking dopy user space over this now. ] Reported-by: Dave Hansen Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 085dc6d2f876..b4de17e4f267 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1013,10 +1013,11 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) /* threshold event is triggered in finer grain than soft limit */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { - bool do_softlimit; + bool do_softlimit = false; bool do_numainfo __maybe_unused; - do_softlimit = mem_cgroup_event_ratelimit(memcg, + if (!mem_cgroup_is_root(memcg)) + do_softlimit = mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_SOFTLIMIT); #if MAX_NUMNODES > 1 do_numainfo = mem_cgroup_event_ratelimit(memcg, -- 2.0.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751900AbaIISXU (ORCPT ); Tue, 9 Sep 2014 14:23:20 -0400 Received: from www.sr71.net ([198.145.64.142]:37438 "EHLO blackbird.sr71.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751774AbaIISXR (ORCPT ); Tue, 9 Sep 2014 14:23:17 -0400 Message-ID: <540F4592.9030408@sr71.net> Date: Tue, 09 Sep 2014 11:23:14 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Johannes Weiner , Dave Hansen CC: Michal Hocko , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905123517.GA21208@cmpxchg.org> <540DCF99.2070900@intel.com> <20140909145044.GA16027@cmpxchg.org> In-Reply-To: <20140909145044.GA16027@cmpxchg.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/09/2014 07:50 AM, Johannes Weiner wrote: > The mctz->lock is only taken when there is, or has been, soft limit > excess. However, the soft limit defaults to infinity, so unless you > set it explicitly on the root level, I can't see how this could be > mctz->lock contention. > > It's more plausible that this is the res_counter lock for testing soft > limit excess - for me, both these locks get inlined into check_events, > could you please double check you got the right lock? I got the wrong lock. Here's how it looks after mainline, plus your free_pages_and_swap_cache() patch: Samples: 2M of event 'cycles', Event count (approx.): 51647128377 + 60.60% 1.33% page_fault2_processes [.] testcase ▒ + 59.14% 0.41% [kernel] [k] page_fault ◆ + 58.72% 0.01% [kernel] [k] do_page_fault ▒ + 58.70% 0.08% [kernel] [k] __do_page_fault ▒ + 58.50% 0.29% [kernel] [k] handle_mm_fault ▒ + 40.14% 0.28% [kernel] [k] do_cow_fault ▒ - 34.56% 34.56% [kernel] [k] _raw_spin_lock ▒ - _raw_spin_lock ▒ - 78.11% __res_counter_charge ▒ res_counter_charge ▒ try_charge ▒ - mem_cgroup_try_charge ▒ + 99.99% do_cow_fault ▒ - 10.30% res_counter_uncharge_until ▒ res_counter_uncharge ▒ uncharge_batch ▒ uncharge_list ▒ mem_cgroup_uncharge_list ▒ release_pages ▒ + 4.75% free_pcppages_bulk ▒ + 3.65% do_cow_fault ▒ + 2.24% get_page_from_freelist ▒ > You also said that this cost hasn't been there before, but I do see > that trace in both v3.16 and v3.17-rc3 with roughly the same impact > (although my machines show less contention than yours). Could you > please double check that this is in fact a regression independent of > 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter")? Here's the same workload on the same machine with only Johannes' revert applied: - 35.92% 35.92% [kernel] [k] _raw_spin_lock ▒ - _raw_spin_lock ▒ - 49.09% get_page_from_freelist ▒ - __alloc_pages_nodemask ▒ + 99.90% alloc_pages_vma ▒ - 43.67% free_pcppages_bulk ▒ - 100.00% free_hot_cold_page ▒ + 99.93% free_hot_cold_page_list ▒ - 7.08% do_cow_fault ▒ handle_mm_fault ▒ __do_page_fault ▒ do_page_fault ▒ page_fault ▒ testcase ▒ So I think it's probably part of the same regression. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752777AbaIJQ3n (ORCPT ); Wed, 10 Sep 2014 12:29:43 -0400 Received: from mail-wg0-f47.google.com ([74.125.82.47]:60958 "EHLO mail-wg0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752459AbaIJQ3j (ORCPT ); Wed, 10 Sep 2014 12:29:39 -0400 Date: Wed, 10 Sep 2014 18:29:36 +0200 From: Michal Hocko To: Dave Hansen Cc: Johannes Weiner , Hugh Dickins , Dave Hansen , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140910162936.GI25219@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905092537.GC26243@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140905092537.GC26243@dhcp22.suse.cz> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 05-09-14 11:25:37, Michal Hocko wrote: > On Thu 04-09-14 13:27:26, Dave Hansen wrote: > > On 09/04/2014 07:27 AM, Michal Hocko wrote: > > > Ouch. free_pages_and_swap_cache completely kills the uncharge batching > > > because it reduces it to PAGEVEC_SIZE batches. > > > > > > I think we really do not need PAGEVEC_SIZE batching anymore. We are > > > already batching on tlb_gather layer. That one is limited so I think > > > the below should be safe but I have to think about this some more. There > > > is a risk of prolonged lru_lock wait times but the number of pages is > > > limited to 10k and the heavy work is done outside of the lock. If this > > > is really a problem then we can tear LRU part and the actual > > > freeing/uncharging into a separate functions in this path. > > > > > > Could you test with this half baked patch, please? I didn't get to test > > > it myself unfortunately. > > > > 3.16 settled out at about 11.5M faults/sec before the regression. This > > patch gets it back up to about 10.5M, which is good. > > Dave, would you be willing to test the following patch as well? I do not > have a huge machine at hand right now. It would be great if you could I was playing with 48CPU with 32G of RAM machine but the res_counter lock didn't show up in the traces much (this was with 96 processes doing mmap (256M private file, faul, unmap in parallel): |--0.75%-- __res_counter_charge | res_counter_charge | try_charge | mem_cgroup_try_charge | | | |--81.56%-- do_cow_fault | | handle_mm_fault | | __do_page_fault | | do_page_fault | | page_fault [...] | | | --18.44%-- __add_to_page_cache_locked | add_to_page_cache_lru | mpage_readpages | ext4_readpages | __do_page_cache_readahead | ondemand_readahead | page_cache_async_readahead | filemap_fault | __do_fault | do_cow_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault Nothing really changed in that regards when I reduced mmap size to 128M and run with 4*CPUs. I do not have a bigger machine to play with unfortunately. I think the patch makes sense on its own. I would really appreciate if you could give it a try on your machine with !root memcg case to see how much it helped. I would expect similar results to your previous testing without the revert and Johannes' patch. [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752527AbaIJQ6N (ORCPT ); Wed, 10 Sep 2014 12:58:13 -0400 Received: from mga02.intel.com ([134.134.136.20]:43888 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751688AbaIJQ6L (ORCPT ); Wed, 10 Sep 2014 12:58:11 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.04,500,1406617200"; d="scan'208";a="600953121" Message-ID: <54108314.80400@intel.com> Date: Wed, 10 Sep 2014 09:57:56 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Michal Hocko , Dave Hansen CC: Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905092537.GC26243@dhcp22.suse.cz> <20140910162936.GI25219@dhcp22.suse.cz> In-Reply-To: <20140910162936.GI25219@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/10/2014 09:29 AM, Michal Hocko wrote: > I do not have a bigger machine to play with unfortunately. I think the > patch makes sense on its own. I would really appreciate if you could > give it a try on your machine with !root memcg case to see how much it > helped. I would expect similar results to your previous testing without > the revert and Johannes' patch. So you want to see before/after this patch: Subject: [PATCH] mm, memcg: Do not kill release batching in free_pages_and_swap_cache And you want it on top of a kernel with the revert or without? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752410AbaIJRFM (ORCPT ); Wed, 10 Sep 2014 13:05:12 -0400 Received: from mail-we0-f179.google.com ([74.125.82.179]:40526 "EHLO mail-we0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751511AbaIJRFK (ORCPT ); Wed, 10 Sep 2014 13:05:10 -0400 Date: Wed, 10 Sep 2014 19:05:06 +0200 From: Michal Hocko To: Dave Hansen Cc: Dave Hansen , Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , Andrew Morton , Linus Torvalds , Vladimir Davydov , LKML Subject: Re: regression caused by cgroups optimization in 3.17-rc2 Message-ID: <20140910170506.GL25219@dhcp22.suse.cz> References: <54061505.8020500@sr71.net> <5406262F.4050705@intel.com> <54062F32.5070504@sr71.net> <20140904142721.GB14548@dhcp22.suse.cz> <5408CB2E.3080101@sr71.net> <20140905092537.GC26243@dhcp22.suse.cz> <20140910162936.GI25219@dhcp22.suse.cz> <54108314.80400@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54108314.80400@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-09-14 09:57:56, Dave Hansen wrote: > On 09/10/2014 09:29 AM, Michal Hocko wrote: > > I do not have a bigger machine to play with unfortunately. I think the > > patch makes sense on its own. I would really appreciate if you could > > give it a try on your machine with !root memcg case to see how much it > > helped. I would expect similar results to your previous testing without > > the revert and Johannes' patch. > > So you want to see before/after this patch: > > Subject: [PATCH] mm, memcg: Do not kill release batching in > free_pages_and_swap_cache > > And you want it on top of a kernel with the revert or without? Revert doesn't make any difference if you run the load inside a memcg (without any limit set). So just before and after the patch would be sufficient. Thanks a lot Dave! -- Michal Hocko SUSE Labs