* [patch 0/4] mm: memcontrol: populate unified hierarchy interface
@ 2014-08-04 21:14 Johannes Weiner
2014-08-04 21:14 ` [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests Johannes Weiner
` (4 more replies)
0 siblings, 5 replies; 23+ messages in thread
From: Johannes Weiner @ 2014-08-04 21:14 UTC (permalink / raw)
To: Andrew Morton
Cc: Michal Hocko, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
Hi,
the ongoing versioning of the cgroup user interface gives us a chance
to clean up the memcg control interface and fix a lot of
inconsistencies and ugliness that crept in over time.
This series adds a minimal set of control files to the new memcg
interface to get basic memcg functionality going in unified hierarchy:
- memory.current: a read-only file that shows current memory usage.
- memory.high: a file that allows setting a high limit on the memory
usage. This is an elastic limit, which is enforced via direct
reclaim, so allocators are throttled once it's reached, but it can
be exceeded and does not trigger OOM kills. This should be a much
more suitable default upper boundary for the majority of use cases
that are better off with some elasticity than with sudden OOM kills.
- memory.max: a file that allows setting a maximum limit on memory
usage which is ultimately enforced by OOM killing tasks in the
group. This is for setups that want strict isolation at the cost of
task death above a certain point. However, even those can still
combine the max limit with the high limit to approach OOM situations
gracefully and with time to intervene.
- memory.vmstat: vmstat-style per-memcg statistics. Very minimal for
now (lru stats, allocations and frees, faults), but fixing
fundamental issues of the old memory.stat file, including gross
misnomers like pgpgin/pgpgout for pages charged/uncharged etc.
Documentation/cgroups/unified-hierarchy.txt | 18 +++
include/linux/res_counter.h | 29 +++++
include/linux/swap.h | 3 +-
kernel/res_counter.c | 3 +
mm/memcontrol.c | 177 +++++++++++++++++++++++++---
mm/vmscan.c | 3 +-
6 files changed, 216 insertions(+), 17 deletions(-)
^ permalink raw reply [flat|nested] 23+ messages in thread* [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests 2014-08-04 21:14 [patch 0/4] mm: memcontrol: populate unified hierarchy interface Johannes Weiner @ 2014-08-04 21:14 ` Johannes Weiner [not found] ` <1407186897-21048-2-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2014-08-04 21:14 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner ` (3 subsequent siblings) 4 siblings, 1 reply; 23+ messages in thread From: Johannes Weiner @ 2014-08-04 21:14 UTC (permalink / raw) To: Andrew Morton; +Cc: Michal Hocko, Tejun Heo, linux-mm, cgroups, linux-kernel Instead of passing the request size to direct reclaim, memcg just manually loops around reclaiming SWAP_CLUSTER_MAX pages until the charge can succeed. That potentially wastes scan progress when huge page allocations require multiple invocations, which always have to restart from the default scan priority. Pass the request size as a reclaim target to direct reclaim and leave it to that code to reach the goal. Charging will still have to loop in case concurrent allocations steal reclaim effort, but at least it doesn't have to loop to meet even the basic request size. This also prepares the memcg reclaim API for use with the planned high limit, to reclaim excess with a single call. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/swap.h | 3 ++- mm/memcontrol.c | 15 +++++++++------ mm/vmscan.c | 3 ++- 3 files changed, 13 insertions(+), 8 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 1b72060f093a..473a3ae4cdd6 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -327,7 +327,8 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, + unsigned long nr_pages, gfp_t gfp_mask, bool noswap); extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ec4dcf1b9562..ddffeeda2d52 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1793,6 +1793,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, } static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, + unsigned long nr_pages, gfp_t gfp_mask, unsigned long flags) { @@ -1808,7 +1809,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) { if (loop) drain_all_stock_async(memcg); - total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap); + total += try_to_free_mem_cgroup_pages(memcg, nr_pages, + gfp_mask, noswap); /* * Allow limit shrinkers, which are triggered directly * by userspace, to catch signals and stop reclaim @@ -1816,7 +1818,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, */ if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK)) break; - if (mem_cgroup_margin(memcg)) + if (mem_cgroup_margin(memcg) >= nr_pages) break; /* * If nothing was reclaimed after two attempts, there @@ -2572,7 +2574,8 @@ retry: if (!(gfp_mask & __GFP_WAIT)) goto nomem; - nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); + nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, nr_pages, + gfp_mask, flags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) goto retry; @@ -3718,7 +3721,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg, if (!ret) break; - mem_cgroup_reclaim(memcg, GFP_KERNEL, + mem_cgroup_reclaim(memcg, 1, GFP_KERNEL, MEM_CGROUP_RECLAIM_SHRINK); curusage = res_counter_read_u64(&memcg->res, RES_USAGE); /* Usage is reduced ? */ @@ -3777,7 +3780,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, if (!ret) break; - mem_cgroup_reclaim(memcg, GFP_KERNEL, + mem_cgroup_reclaim(memcg, 1, GFP_KERNEL, MEM_CGROUP_RECLAIM_NOSWAP | MEM_CGROUP_RECLAIM_SHRINK); curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE); @@ -4028,7 +4031,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) if (signal_pending(current)) return -EINTR; - progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL, + progress = try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false); if (!progress) { nr_retries--; diff --git a/mm/vmscan.c b/mm/vmscan.c index d698f4f7b0f2..7db33f100db4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2747,6 +2747,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, } unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, + unsigned long nr_pages, gfp_t gfp_mask, bool noswap) { @@ -2754,7 +2755,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_reclaimed; int nid; struct scan_control sc = { - .nr_to_reclaim = SWAP_CLUSTER_MAX, + .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), .target_mem_cgroup = memcg, -- 2.0.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
[parent not found: <1407186897-21048-2-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests [not found] ` <1407186897-21048-2-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2014-08-07 13:08 ` Michal Hocko [not found] ` <20140807130822.GB12730-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2014-08-07 13:08 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon 04-08-14 17:14:54, Johannes Weiner wrote: > Instead of passing the request size to direct reclaim, memcg just > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the > charge can succeed. That potentially wastes scan progress when huge > page allocations require multiple invocations, which always have to > restart from the default scan priority. > > Pass the request size as a reclaim target to direct reclaim and leave > it to that code to reach the goal. THP charge then will ask for 512 pages to be (direct) reclaimed. That is _a lot_ and I would expect long stalls to achieve this target. I would also expect quick priority drop down and potential over-reclaim for small and moderately sized memcgs (e.g. memcg with 1G worth of pages would need to drop down below DEF_PRIORITY-2 to have a chance to scan that many pages). All that done for a charge which can fallback to a single page charge. The current code is quite hostile to THP when we are close to the limit but solving this by introducing long stalls instead doesn't sound like a proper approach to me. > Charging will still have to loop in case concurrent allocations steal > reclaim effort, but at least it doesn't have to loop to meet even the > basic request size. This also prepares the memcg reclaim API for use > with the planned high limit, to reclaim excess with a single call. > > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > --- > include/linux/swap.h | 3 ++- > mm/memcontrol.c | 15 +++++++++------ > mm/vmscan.c | 3 ++- > 3 files changed, 13 insertions(+), 8 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 1b72060f093a..473a3ae4cdd6 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -327,7 +327,8 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > gfp_t gfp_mask, nodemask_t *mask); > extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > + unsigned long nr_pages, > gfp_t gfp_mask, bool noswap); > extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, > gfp_t gfp_mask, bool noswap, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ec4dcf1b9562..ddffeeda2d52 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1793,6 +1793,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > } > > static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, > + unsigned long nr_pages, > gfp_t gfp_mask, > unsigned long flags) > { > @@ -1808,7 +1809,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, > for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) { > if (loop) > drain_all_stock_async(memcg); > - total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap); > + total += try_to_free_mem_cgroup_pages(memcg, nr_pages, > + gfp_mask, noswap); > /* > * Allow limit shrinkers, which are triggered directly > * by userspace, to catch signals and stop reclaim > @@ -1816,7 +1818,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, > */ > if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK)) > break; > - if (mem_cgroup_margin(memcg)) > + if (mem_cgroup_margin(memcg) >= nr_pages) > break; > /* > * If nothing was reclaimed after two attempts, there > @@ -2572,7 +2574,8 @@ retry: > if (!(gfp_mask & __GFP_WAIT)) > goto nomem; > > - nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); > + nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, nr_pages, > + gfp_mask, flags); > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > goto retry; > @@ -3718,7 +3721,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg, > if (!ret) > break; > > - mem_cgroup_reclaim(memcg, GFP_KERNEL, > + mem_cgroup_reclaim(memcg, 1, GFP_KERNEL, > MEM_CGROUP_RECLAIM_SHRINK); > curusage = res_counter_read_u64(&memcg->res, RES_USAGE); > /* Usage is reduced ? */ > @@ -3777,7 +3780,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, > if (!ret) > break; > > - mem_cgroup_reclaim(memcg, GFP_KERNEL, > + mem_cgroup_reclaim(memcg, 1, GFP_KERNEL, > MEM_CGROUP_RECLAIM_NOSWAP | > MEM_CGROUP_RECLAIM_SHRINK); > curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE); > @@ -4028,7 +4031,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > if (signal_pending(current)) > return -EINTR; > > - progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL, > + progress = try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > false); > if (!progress) { > nr_retries--; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d698f4f7b0f2..7db33f100db4 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2747,6 +2747,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, > } > > unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > + unsigned long nr_pages, > gfp_t gfp_mask, > bool noswap) > { > @@ -2754,7 +2755,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_reclaimed; > int nid; > struct scan_control sc = { > - .nr_to_reclaim = SWAP_CLUSTER_MAX, > + .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | > (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), > .target_mem_cgroup = memcg, > -- > 2.0.3 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20140807130822.GB12730-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests [not found] ` <20140807130822.GB12730-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2014-08-07 15:31 ` Johannes Weiner 2014-08-07 16:10 ` Greg Thelen 2014-08-08 12:32 ` Michal Hocko 0 siblings, 2 replies; 23+ messages in thread From: Johannes Weiner @ 2014-08-07 15:31 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote: > On Mon 04-08-14 17:14:54, Johannes Weiner wrote: > > Instead of passing the request size to direct reclaim, memcg just > > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the > > charge can succeed. That potentially wastes scan progress when huge > > page allocations require multiple invocations, which always have to > > restart from the default scan priority. > > > > Pass the request size as a reclaim target to direct reclaim and leave > > it to that code to reach the goal. > > THP charge then will ask for 512 pages to be (direct) reclaimed. That > is _a lot_ and I would expect long stalls to achieve this target. I > would also expect quick priority drop down and potential over-reclaim > for small and moderately sized memcgs (e.g. memcg with 1G worth of pages > would need to drop down below DEF_PRIORITY-2 to have a chance to scan > that many pages). All that done for a charge which can fallback to a > single page charge. > > The current code is quite hostile to THP when we are close to the limit > but solving this by introducing long stalls instead doesn't sound like a > proper approach to me. THP latencies are actually the same when comparing high limit nr_pages reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, although system time is reduced with the high limit. High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but it doesn't actually contain the workload - with 1G high and a 4G load, the consumption at the end of the run is 3.7G. So what I'm proposing works and is of equal quality from a THP POV. This change is complicated enough when we stick to the facts, let's not make up things based on gut feeling. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests 2014-08-07 15:31 ` Johannes Weiner @ 2014-08-07 16:10 ` Greg Thelen 2014-08-08 12:47 ` Michal Hocko 2014-08-08 12:32 ` Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Greg Thelen @ 2014-08-07 16:10 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Thu, Aug 07 2014, Johannes Weiner wrote: > On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote: >> On Mon 04-08-14 17:14:54, Johannes Weiner wrote: >> > Instead of passing the request size to direct reclaim, memcg just >> > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the >> > charge can succeed. That potentially wastes scan progress when huge >> > page allocations require multiple invocations, which always have to >> > restart from the default scan priority. >> > >> > Pass the request size as a reclaim target to direct reclaim and leave >> > it to that code to reach the goal. >> >> THP charge then will ask for 512 pages to be (direct) reclaimed. That >> is _a lot_ and I would expect long stalls to achieve this target. I >> would also expect quick priority drop down and potential over-reclaim >> for small and moderately sized memcgs (e.g. memcg with 1G worth of pages >> would need to drop down below DEF_PRIORITY-2 to have a chance to scan >> that many pages). All that done for a charge which can fallback to a >> single page charge. >> >> The current code is quite hostile to THP when we are close to the limit >> but solving this by introducing long stalls instead doesn't sound like a >> proper approach to me. > > THP latencies are actually the same when comparing high limit nr_pages > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, although > system time is reduced with the high limit. > > High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but > it doesn't actually contain the workload - with 1G high and a 4G load, > the consumption at the end of the run is 3.7G. > > So what I'm proposing works and is of equal quality from a THP POV. > This change is complicated enough when we stick to the facts, let's > not make up things based on gut feeling. I think that high order non THP page allocations also benefit from this. Such allocations don't have a small page fallback. This may be in flux, but linux-next shows me that: * mem_cgroup_reclaim() frees at least SWAP_CLUSTER_MAX (32) pages. * try_charge() calls mem_cgroup_reclaim() indefinitely for costly (3) or smaller orders assuming that something is reclaimed on each iteration. * try_charge() uses a loop of MEM_CGROUP_RECLAIM_RETRIES (5) for larger-than-costly orders. So for larger-than-costly allocations, try_charge() should be able to reclaim 160 (5*32) pages which satisfies an order:7 allocation. But for order:8+ allocations try_charge() and mem_cgroup_reclaim() are too eager to give up without something like this. So I think this patch is a step in the right direction. Coincidentally, we've been recently been experimenting with something like this. Though we didn't modify the interface between mem_cgroup_reclaim() and try_to_free_mem_cgroup_pages() - instead we looped within mem_cgroup_reclaim() until nr_pages of margin were found. But I have no objection the proposed plumbing of nr_pages all the way into try_to_free_mem_cgroup_pages(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests 2014-08-07 16:10 ` Greg Thelen @ 2014-08-08 12:47 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2014-08-08 12:47 UTC (permalink / raw) To: Greg Thelen Cc: Johannes Weiner, Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Thu 07-08-14 09:10:43, Greg Thelen wrote: > On Thu, Aug 07 2014, Johannes Weiner wrote: [...] > > So what I'm proposing works and is of equal quality from a THP POV. > > This change is complicated enough when we stick to the facts, let's > > not make up things based on gut feeling. > > I think that high order non THP page allocations also benefit from this. > Such allocations don't have a small page fallback. > > This may be in flux, but linux-next shows me that: > * mem_cgroup_reclaim() > frees at least SWAP_CLUSTER_MAX (32) pages. > * try_charge() calls mem_cgroup_reclaim() indefinitely for > costly (3) or smaller orders assuming that something is reclaimed on > each iteration. > * try_charge() uses a loop of MEM_CGROUP_RECLAIM_RETRIES (5) for > larger-than-costly orders. Unless there is __GFP_NORETRY which fails the charge after the first round of unsuccessful reclaim. This is the case regardless of nr_pages but only THP are charged with __GFP_NORETRY currently. > So for larger-than-costly allocations, try_charge() should be able to > reclaim 160 (5*32) pages which satisfies an order:7 allocation. But for > order:8+ allocations try_charge() and mem_cgroup_reclaim() are too eager > to give up without something like this. So I think this patch is a step > in the right direction. I think we should be careful for charges which are OK to fail because there is a fallback for them (THP). The only other high-order charges are coming from kmem and I am yet not sure what to do about those without memcg specific slab reclaim. I wouldn't make this discussion more complicated for this case now. > Coincidentally, we've been recently been experimenting with something > like this. Though we didn't modify the interface between > mem_cgroup_reclaim() and try_to_free_mem_cgroup_pages() - instead we > looped within mem_cgroup_reclaim() until nr_pages of margin were found. > But I have no objection the proposed plumbing of nr_pages all the way > into try_to_free_mem_cgroup_pages(). -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests 2014-08-07 15:31 ` Johannes Weiner 2014-08-07 16:10 ` Greg Thelen @ 2014-08-08 12:32 ` Michal Hocko 2014-08-08 13:26 ` Johannes Weiner 1 sibling, 1 reply; 23+ messages in thread From: Michal Hocko @ 2014-08-08 12:32 UTC (permalink / raw) To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Thu 07-08-14 11:31:41, Johannes Weiner wrote: > On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote: > > On Mon 04-08-14 17:14:54, Johannes Weiner wrote: > > > Instead of passing the request size to direct reclaim, memcg just > > > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the > > > charge can succeed. That potentially wastes scan progress when huge > > > page allocations require multiple invocations, which always have to > > > restart from the default scan priority. > > > > > > Pass the request size as a reclaim target to direct reclaim and leave > > > it to that code to reach the goal. > > > > THP charge then will ask for 512 pages to be (direct) reclaimed. That > > is _a lot_ and I would expect long stalls to achieve this target. I > > would also expect quick priority drop down and potential over-reclaim > > for small and moderately sized memcgs (e.g. memcg with 1G worth of pages > > would need to drop down below DEF_PRIORITY-2 to have a chance to scan > > that many pages). All that done for a charge which can fallback to a > > single page charge. > > > > The current code is quite hostile to THP when we are close to the limit > > but solving this by introducing long stalls instead doesn't sound like a > > proper approach to me. > > THP latencies are actually the same when comparing high limit nr_pages > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, Are you sure about this? I fail to see how they can be same as THP allocations/charges are __GFP_NORETRY so there is only one reclaim round for the hard limit reclaim followed by the charge failure if it is not successful. > although system time is reduced with the high limit. > High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but > it doesn't actually contain the workload - with 1G high and a 4G load, > the consumption at the end of the run is 3.7G. Wouldn't it help to simply fail the charge and allow the charger to fallback for THP allocations if the usage is above high limit too much? The follow up single page charge fallback would be still throttled. > So what I'm proposing works and is of equal quality from a THP POV. > This change is complicated enough when we stick to the facts, let's > not make up things based on gut feeling. Agreed and I would expect those _facts_ to be part of the changelog. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests 2014-08-08 12:32 ` Michal Hocko @ 2014-08-08 13:26 ` Johannes Weiner 2014-08-11 7:49 ` Michal Hocko 2014-08-13 14:59 ` Michal Hocko 0 siblings, 2 replies; 23+ messages in thread From: Johannes Weiner @ 2014-08-08 13:26 UTC (permalink / raw) To: Michal Hocko; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Fri, Aug 08, 2014 at 02:32:58PM +0200, Michal Hocko wrote: > On Thu 07-08-14 11:31:41, Johannes Weiner wrote: > > On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote: > > > On Mon 04-08-14 17:14:54, Johannes Weiner wrote: > > > > Instead of passing the request size to direct reclaim, memcg just > > > > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the > > > > charge can succeed. That potentially wastes scan progress when huge > > > > page allocations require multiple invocations, which always have to > > > > restart from the default scan priority. > > > > > > > > Pass the request size as a reclaim target to direct reclaim and leave > > > > it to that code to reach the goal. > > > > > > THP charge then will ask for 512 pages to be (direct) reclaimed. That > > > is _a lot_ and I would expect long stalls to achieve this target. I > > > would also expect quick priority drop down and potential over-reclaim > > > for small and moderately sized memcgs (e.g. memcg with 1G worth of pages > > > would need to drop down below DEF_PRIORITY-2 to have a chance to scan > > > that many pages). All that done for a charge which can fallback to a > > > single page charge. > > > > > > The current code is quite hostile to THP when we are close to the limit > > > but solving this by introducing long stalls instead doesn't sound like a > > > proper approach to me. > > > > THP latencies are actually the same when comparing high limit nr_pages > > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, > > Are you sure about this? I fail to see how they can be same as THP > allocations/charges are __GFP_NORETRY so there is only one reclaim > round for the hard limit reclaim followed by the charge failure if > it is not successful. I use this test program that faults in anon pages, reports average and max for every 512-page chunk (THP size), then reports the aggregate at the end: memory.max: avg=18729us max=450625us real 0m14.335s user 0m0.157s sys 0m6.307s memory.high: avg=18676us max=457499us real 0m14.375s user 0m0.046s sys 0m4.294s > > although system time is reduced with the high limit. > > High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but > > it doesn't actually contain the workload - with 1G high and a 4G load, > > the consumption at the end of the run is 3.7G. > > Wouldn't it help to simply fail the charge and allow the charger to > fallback for THP allocations if the usage is above high limit too > much? The follow up single page charge fallback would be still > throttled. This is about defining the limit semantics in unified hierarchy, and not really the time or place to optimize THP charge latency. What are you trying to accomplish here? > > So what I'm proposing works and is of equal quality from a THP POV. > > This change is complicated enough when we stick to the facts, let's > > not make up things based on gut feeling. > > Agreed and I would expect those _facts_ to be part of the changelog. You made unfounded claims about THP allocation latencies, and I showed the numbers to refute it, but that doesn't make any of this relevant for the changelogs of these patches. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests 2014-08-08 13:26 ` Johannes Weiner @ 2014-08-11 7:49 ` Michal Hocko 2014-08-13 14:59 ` Michal Hocko 1 sibling, 0 replies; 23+ messages in thread From: Michal Hocko @ 2014-08-11 7:49 UTC (permalink / raw) To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Fri 08-08-14 09:26:35, Johannes Weiner wrote: > On Fri, Aug 08, 2014 at 02:32:58PM +0200, Michal Hocko wrote: > > On Thu 07-08-14 11:31:41, Johannes Weiner wrote: [...] > > > although system time is reduced with the high limit. > > > High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but > > > it doesn't actually contain the workload - with 1G high and a 4G load, > > > the consumption at the end of the run is 3.7G. > > > > Wouldn't it help to simply fail the charge and allow the charger to > > fallback for THP allocations if the usage is above high limit too > > much? The follow up single page charge fallback would be still > > throttled. > > This is about defining the limit semantics in unified hierarchy, and > not really the time or place to optimize THP charge latency. > > What are you trying to accomplish here? Well there are two things. The first one is that this patch changes the way how THP are charged for the hard limit without any data to back it up in the changelog. This is the primary concern. The other part is the high limit behavior for large excess. You have chosen to reclaim all excessive charges even when quite a lot of pages might be direct reclaimed. This is potentially dangerous because the excess might be really huge (consider multiple tasks charging THPs simultaneously on many CPUs). Do you really want to direct reclaim nr_online_cpus * 512 pages in the single direct reclaim pass and for all those cpus? This is an extreme case, all right, but the point stays. There has to be a certain cap. Also it seems that the primary source of troubles is THP so the question is. Do we really want to push hard to reclaim enough charges or do we rather fail THP charge and go with single page retry? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests 2014-08-08 13:26 ` Johannes Weiner 2014-08-11 7:49 ` Michal Hocko @ 2014-08-13 14:59 ` Michal Hocko 2014-08-13 20:41 ` Johannes Weiner 1 sibling, 1 reply; 23+ messages in thread From: Michal Hocko @ 2014-08-13 14:59 UTC (permalink / raw) To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Fri 08-08-14 09:26:35, Johannes Weiner wrote: > On Fri, Aug 08, 2014 at 02:32:58PM +0200, Michal Hocko wrote: > > On Thu 07-08-14 11:31:41, Johannes Weiner wrote: [...] > > > THP latencies are actually the same when comparing high limit nr_pages > > > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, > > > > Are you sure about this? I fail to see how they can be same as THP > > allocations/charges are __GFP_NORETRY so there is only one reclaim > > round for the hard limit reclaim followed by the charge failure if > > it is not successful. > > I use this test program that faults in anon pages, reports average and > max for every 512-page chunk (THP size), then reports the aggregate at > the end: > > memory.max: > > avg=18729us max=450625us > > real 0m14.335s > user 0m0.157s > sys 0m6.307s > > memory.high: > > avg=18676us max=457499us > > real 0m14.375s > user 0m0.046s > sys 0m4.294s I was playing with something like that as well. mmap 800MB anon mapping in 256MB memcg (kvm guest had 1G RAM and 2G swap so the global reclaim doesn't trigger and the host 2G free memory), start faulting in from THP aligned address and measured each fault. Then I was recording mm_vmscan_lru_shrink_inactive and mm_vmscan_memcg_reclaim_{begin,end} tracepoints to see how the reclaim went. I was testing two setups 1) fault in every 4k page 2) fault in only 2M aligned addresses. The first simulates the case where successful THP allocation saves follow up 511 fallback charges and so the excessive reclaim might pay off. The second one simulates potential time wasting when memory is used extremely sparsely and any latencies would be unwelcome. (new refers to nr_reclaim target, old to SWAP_CLUSTER_MAX, thponly faults only 2M aligned addresses, 4k pages are faulted otherwise) vmstat says: out.256m.new-thponly.vmstat.after:pswpin 44 out.256m.new-thponly.vmstat.after:pswpout 154681 out.256m.new-thponly.vmstat.after:thp_fault_alloc 399 out.256m.new-thponly.vmstat.after:thp_fault_fallback 0 out.256m.new-thponly.vmstat.after:thp_split 302 out.256m.old-thponly.vmstat.after:pswpin 28 out.256m.old-thponly.vmstat.after:pswpout 31271 out.256m.old-thponly.vmstat.after:thp_fault_alloc 149 out.256m.old-thponly.vmstat.after:thp_fault_fallback 250 out.256m.old-thponly.vmstat.after:thp_split 61 out.256m.new.vmstat.after:pswpin 48 out.256m.new.vmstat.after:pswpout 169530 out.256m.new.vmstat.after:thp_fault_alloc 399 out.256m.new.vmstat.after:thp_fault_fallback 0 out.256m.new.vmstat.after:thp_split 331 out.256m.old.vmstat.after:pswpin 47 out.256m.old.vmstat.after:pswpout 156514 out.256m.old.vmstat.after:thp_fault_alloc 127 out.256m.old.vmstat.after:thp_fault_fallback 272 out.256m.old.vmstat.after:thp_split 127 As expected new managed to fault in all requests as THP without a single fallback allocation while with the old reclaim we got to the limit and then most of the THP charges failed and fallen back to single page charge. Note the increased swapout activity for new. It is almost 5x more for thponly and +8% with per-page faults. This looks like a fallout from the over-reclaim in smaller priorities. Tracepoints will tell us the priority at which we ended up the reclaim round: - trace.new-thponly Count Priority 1 3 2 5 159 6 24 7 - trace.old-thponly 230 10 1 11 1 12 1 3 39 9 Again expected that the priority is falling down for the new much more. - trace.new 229 0 3 12 - trace.old 294 0 2 1 25 10 1 11 3 12 8 2 8 3 20 4 33 5 21 6 43 7 1286 8 1279 9 And here as well, we have to reclaim much more because we do much more charges so the load benefits a bit from the high reclaim target. mm_vmscan_memcg_reclaim_end tracepoint tells us also how many pages were reclaimed during each run and the cummulative numbers are: - trace.new-thponly: 139029 - trace.old-thponly: 11344 - trace.new: 139687 - trace.old: 139887 time -v says: out.256m.new-thponly.time: System time (seconds): 1.50 out.256m.new-thponly.time: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.56 out.256m.old-thponly.time: System time (seconds): 0.45 out.256m.old-thponly.time: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.76 out.256m.new.time: System time (seconds): 1.45 out.256m.new.time: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.12 out.256m.old.time: System time (seconds): 2.08 out.256m.old.time: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.26 I guess this is expected as well. Sparse access doesn't amortize the costly reclaim for each charged THP. On the other hand it can help a bit if the whole mmap is populated. If we compare fault latencies then we get the following: - the worst latency [ms]: out.256m.new-thponly 1991 out.256m.old-thponly 1838 out.256m.new 6197 out.256m.old 5538 - top 5 worst latencies (sum in [ms]): out.256m.new-thponly 5694 out.256m.old-thponly 3168 out.256m.new 9498 out.256m.old 8291 - top 10 out.256m.new-thponly 7139 out.256m.old-thponly 3193 out.256m.new 11786 out.256m.old 9347 - top 100 out.256m.new-thponly 13035 out.256m.old-thponly 3434 out.256m.new 14634 out.256m.old 12881 I think this shows up that my concern about excessive reclaim and stalls is real and it is worse when the memory is used sparsely. It is true it might help when the whole THP section is used and so the additional cost is amortized but the more sparsely each THP section is used the higher overhead you are adding without userspace actually asking for it. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests 2014-08-13 14:59 ` Michal Hocko @ 2014-08-13 20:41 ` Johannes Weiner 2014-08-14 16:12 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Johannes Weiner @ 2014-08-13 20:41 UTC (permalink / raw) To: Michal Hocko; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Wed, Aug 13, 2014 at 04:59:04PM +0200, Michal Hocko wrote: > On Fri 08-08-14 09:26:35, Johannes Weiner wrote: > > On Fri, Aug 08, 2014 at 02:32:58PM +0200, Michal Hocko wrote: > > > On Thu 07-08-14 11:31:41, Johannes Weiner wrote: > [...] > > > > THP latencies are actually the same when comparing high limit nr_pages > > > > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, > > > > > > Are you sure about this? I fail to see how they can be same as THP > > > allocations/charges are __GFP_NORETRY so there is only one reclaim > > > round for the hard limit reclaim followed by the charge failure if > > > it is not successful. > > > > I use this test program that faults in anon pages, reports average and > > max for every 512-page chunk (THP size), then reports the aggregate at > > the end: > > > > memory.max: > > > > avg=18729us max=450625us > > > > real 0m14.335s > > user 0m0.157s > > sys 0m6.307s > > > > memory.high: > > > > avg=18676us max=457499us > > > > real 0m14.375s > > user 0m0.046s > > sys 0m4.294s > > I was playing with something like that as well. mmap 800MB anon mapping > in 256MB memcg (kvm guest had 1G RAM and 2G swap so the global reclaim > doesn't trigger and the host 2G free memory), start faulting in from > THP aligned address and measured each fault. Then I was recording > mm_vmscan_lru_shrink_inactive and mm_vmscan_memcg_reclaim_{begin,end} > tracepoints to see how the reclaim went. > > I was testing two setups > 1) fault in every 4k page > 2) fault in only 2M aligned addresses. > > The first simulates the case where successful THP allocation saves > follow up 511 fallback charges and so the excessive reclaim might > pay off. > The second one simulates potential time wasting when memory is used > extremely sparsely and any latencies would be unwelcome. > > (new refers to nr_reclaim target, old to SWAP_CLUSTER_MAX, thponly > faults only 2M aligned addresses, 4k pages are faulted otherwise) > > vmstat says: > out.256m.new-thponly.vmstat.after:pswpin 44 > out.256m.new-thponly.vmstat.after:pswpout 154681 > out.256m.new-thponly.vmstat.after:thp_fault_alloc 399 > out.256m.new-thponly.vmstat.after:thp_fault_fallback 0 > out.256m.new-thponly.vmstat.after:thp_split 302 > > out.256m.old-thponly.vmstat.after:pswpin 28 > out.256m.old-thponly.vmstat.after:pswpout 31271 > out.256m.old-thponly.vmstat.after:thp_fault_alloc 149 > out.256m.old-thponly.vmstat.after:thp_fault_fallback 250 > out.256m.old-thponly.vmstat.after:thp_split 61 > > out.256m.new.vmstat.after:pswpin 48 > out.256m.new.vmstat.after:pswpout 169530 > out.256m.new.vmstat.after:thp_fault_alloc 399 > out.256m.new.vmstat.after:thp_fault_fallback 0 > out.256m.new.vmstat.after:thp_split 331 > > out.256m.old.vmstat.after:pswpin 47 > out.256m.old.vmstat.after:pswpout 156514 > out.256m.old.vmstat.after:thp_fault_alloc 127 > out.256m.old.vmstat.after:thp_fault_fallback 272 > out.256m.old.vmstat.after:thp_split 127 > > As expected new managed to fault in all requests as THP without a single > fallback allocation while with the old reclaim we got to the limit and > then most of the THP charges failed and fallen back to single page > charge. In a more realistic workload, global reclaim and compaction have to invest a decent amount of work to create the 2MB page in the first place. You would be wasting this work on the off-chance that only a small part of the THP is actually used. Once that 2MB page is already assembled, I don't think the burden on memcg is high enough to justify wasting that work speculatively. If we really have a latency issue here, I think the right solution is to attempt the charge first - because it's much less work - and only if it succeeds allocate and commit an actual 2MB physical page. But I have yet to be convinced that there is a practical issue here. Who uses only 4k out of every 2MB area and enables THP? The 'thponly' scenario is absurd. > Note the increased swapout activity for new. It is almost 5x more for > thponly and +8% with per-page faults. This looks like a fallout from the > over-reclaim in smaller priorities. > > - trace.new > 229 0 > 3 12 > - trace.old > 294 0 > 2 1 > 25 10 > 1 11 > 3 12 > 8 2 > 8 3 > 20 4 > 33 5 > 21 6 > 43 7 > 1286 8 > 1279 9 > > And here as well, we have to reclaim much more because we do much more > charges so the load benefits a bit from the high reclaim target. > > mm_vmscan_memcg_reclaim_end tracepoint tells us also how many pages were > reclaimed during each run and the cummulative numbers are: > - trace.new-thponly: 139029 > - trace.old-thponly: 11344 > - trace.new: 139687 > - trace.old: 139887 Here the number of reclaimed pages is actually lower in new, so I'm guessing the increase in swapouts above is variation between runs, as there doesn't seem to be a significant amount of cache in that group. > time -v says: > out.256m.new-thponly.time: System time (seconds): 1.50 > out.256m.new-thponly.time: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.56 > out.256m.old-thponly.time: System time (seconds): 0.45 > out.256m.old-thponly.time: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.76 > > out.256m.new.time: System time (seconds): 1.45 > out.256m.new.time: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.12 > out.256m.old.time: System time (seconds): 2.08 > out.256m.old.time: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.26 > > I guess this is expected as well. Sparse access doesn't amortize the > costly reclaim for each charged THP. On the other hand it can help a bit > if the whole mmap is populated. > > If we compare fault latencies then we get the following: > - the worst latency [ms]: > out.256m.new-thponly 1991 > out.256m.old-thponly 1838 > out.256m.new 6197 > out.256m.old 5538 > > - top 5 worst latencies (sum in [ms]): > out.256m.new-thponly 5694 > out.256m.old-thponly 3168 > out.256m.new 9498 > out.256m.old 8291 > > - top 10 > out.256m.new-thponly 7139 > out.256m.old-thponly 3193 > out.256m.new 11786 > out.256m.old 9347 > > - top 100 > out.256m.new-thponly 13035 > out.256m.old-thponly 3434 > out.256m.new 14634 > out.256m.old 12881 > > I think this shows up that my concern about excessive reclaim and stalls > is real and it is worse when the memory is used sparsely. It is true it > might help when the whole THP section is used and so the additional cost > is amortized but the more sparsely each THP section is used the higher > overhead you are adding without userspace actually asking for it. THP is expected to have some overhead in terms of initial fault cost and space efficiency, don't use it when you get little to no benefit from it. It can be argued that my patch moves that breakeven point a little bit, but the THP-positive end of the spectrum is much better off: THP coverage goes from 37% to 100%, while reclaim efficiency is significantly improved and system time significantly reduced. You demonstrated a THP-workload that really benefits from my change, and another workload that shouldn't be using THP in the first place. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests 2014-08-13 20:41 ` Johannes Weiner @ 2014-08-14 16:12 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2014-08-14 16:12 UTC (permalink / raw) To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Wed 13-08-14 16:41:34, Johannes Weiner wrote: > On Wed, Aug 13, 2014 at 04:59:04PM +0200, Michal Hocko wrote: [...] > > I think this shows up that my concern about excessive reclaim and stalls > > is real and it is worse when the memory is used sparsely. It is true it > > might help when the whole THP section is used and so the additional cost > > is amortized but the more sparsely each THP section is used the higher > > overhead you are adding without userspace actually asking for it. > > THP is expected to have some overhead in terms of initial fault cost yes, but that overhead should be as small as possible. Direct reclaim with such a big target will lead to all types of problems. > and space efficiency, don't use it when you get little to no benefit > from it. Do you really expect that all such users will use MADV_NOHUGEPAGE just to prevent from reclaim stalls? This sounds unrealistic to me. Instead we will end up with THP disabled globally. The same way we have seen it when THP has been introduced and caused all kinds of reclaim issues. > It can be argued that my patch moves that breakeven point a > little bit, but the THP-positive end of the spectrum is much better > off: THP coverage goes from 37% to 100%, while reclaim efficiency is > significantly improved and system time significantly reduced. I didn't see significantly improved reclaim efficiency the only difference was that the reclaim happen less times. The system time is reduced but the elapsed time is less than 1% improved in the per-page walk but more than 3 times worse for the other extreme. > You demonstrated a THP-workload that really benefits from my change, > and another workload that shouldn't be using THP in the first place. I do not think that the presented test case is appropriate for any reclaim decision evaluation. Linear used-once walker usually benefits from excessive reclaim in general. The only point I wanted to raise is that the numbers look much worse when the memory is used sparsely and thponly is the obvious worst case. So if you want to increase THP charge success rate then back it by real numbers from real loads and prove that the potential regressions are unlikely and biased by the overall improvements. Until then NACK to this patch from me. The change is too risky. Besides that I do believe that you do not need this change for the high limit as it can fail the charge for excessive THP charges same like the hard limit. So you do not have the high limit escape problem. As mentioned in other email, reclaiming the whole high limit excess as a target is even more risky because heavy parallel load on many CPUs can cause large excess and direct reclaim much more than 512 pages. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy 2014-08-04 21:14 [patch 0/4] mm: memcontrol: populate unified hierarchy interface Johannes Weiner 2014-08-04 21:14 ` [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests Johannes Weiner @ 2014-08-04 21:14 ` Johannes Weiner 2014-08-07 13:36 ` Michal Hocko 2014-08-04 21:14 ` [patch 3/4] mm: memcontrol: add memory.max " Johannes Weiner ` (2 subsequent siblings) 4 siblings, 1 reply; 23+ messages in thread From: Johannes Weiner @ 2014-08-04 21:14 UTC (permalink / raw) To: Andrew Morton; +Cc: Michal Hocko, Tejun Heo, linux-mm, cgroups, linux-kernel Provide the most fundamental interface necessary for memory cgroups to be useful in the default hierarchy: report the current usage and allow setting an upper limit on it. The upper limit, set in memory.high, is not a strict OOM limit and can be breached under pressure. But once it's breached, allocators are throttled and forced into reclaim to clean up the excess. This has many advantages over more traditional hard upper limits and is thus more suitable as the default upper boundary: First, the limit is artificial and not due to shortness of actual memory, so invoking the OOM killer to enforce it seems excessive for the majority of usecases. It's much preferable to breach the limit temporarily and throttle the allocators, which in turn gives managing software a chance to detect pressure in the group and intervene by re-evaluating the limit, migrating the job to another machine, hot-plugging memory etc., without fearing interfering OOM kills. A secondary concern is allocation fairness: requiring the limit to always be met allows the reclaim efforts of one allocator to be stolen by concurrent allocations. Most usecases would prefer temporarily exceeding the default upper limit by a few pages over starving random allocators indefinitely. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- Documentation/cgroups/unified-hierarchy.txt | 6 ++ include/linux/res_counter.h | 29 ++++++++++ kernel/res_counter.c | 3 + mm/memcontrol.c | 87 ++++++++++++++++++++++++++--- 4 files changed, 116 insertions(+), 9 deletions(-) diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index 4f4563277864..fd4f7f6847f6 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -327,6 +327,12 @@ supported and the interface files "release_agent" and - use_hierarchy is on by default and the cgroup file for the flag is not created. +- memory.limit_in_bytes is removed as the primary upper boundary and + replaced with memory.high, a soft upper limit that will put memory + pressure on the group but can be breached in favor of OOM killing. + +- memory.usage_in_bytes is renamed to memory.current to be in line + with the new naming scheme 5. Planned Changes diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h index 56b7bc32db4f..27394cfdf1fe 100644 --- a/include/linux/res_counter.h +++ b/include/linux/res_counter.h @@ -32,6 +32,10 @@ struct res_counter { */ unsigned long long max_usage; /* + * the high limit that creates pressure but can be exceeded + */ + unsigned long long high; + /* * the limit that usage cannot exceed */ unsigned long long limit; @@ -85,6 +89,7 @@ int res_counter_memparse_write_strategy(const char *buf, enum { RES_USAGE, RES_MAX_USAGE, + RES_HIGH, RES_LIMIT, RES_FAILCNT, RES_SOFT_LIMIT, @@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val); u64 res_counter_uncharge_until(struct res_counter *counter, struct res_counter *top, unsigned long val); + +static inline unsigned long long res_counter_high(struct res_counter *cnt) +{ + unsigned long long high = 0; + unsigned long flags; + + spin_lock_irqsave(&cnt->lock, flags); + if (cnt->usage > cnt->high) + high = cnt->usage - cnt->high; + spin_unlock_irqrestore(&cnt->lock, flags); + return high; +} + /** * res_counter_margin - calculate chargeable space of a counter * @cnt: the counter @@ -193,6 +211,17 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt) spin_unlock_irqrestore(&cnt->lock, flags); } +static inline int res_counter_set_high(struct res_counter *cnt, + unsigned long long high) +{ + unsigned long flags; + + spin_lock_irqsave(&cnt->lock, flags); + cnt->high = high; + spin_unlock_irqrestore(&cnt->lock, flags); + return 0; +} + static inline int res_counter_set_limit(struct res_counter *cnt, unsigned long long limit) { diff --git a/kernel/res_counter.c b/kernel/res_counter.c index e791130f85a7..26a08be49a3d 100644 --- a/kernel/res_counter.c +++ b/kernel/res_counter.c @@ -17,6 +17,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent) { spin_lock_init(&counter->lock); + counter->high = RES_COUNTER_MAX; counter->limit = RES_COUNTER_MAX; counter->soft_limit = RES_COUNTER_MAX; counter->parent = parent; @@ -130,6 +131,8 @@ res_counter_member(struct res_counter *counter, int member) return &counter->usage; case RES_MAX_USAGE: return &counter->max_usage; + case RES_HIGH: + return &counter->high; case RES_LIMIT: return &counter->limit; case RES_FAILCNT: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ddffeeda2d52..5a64fa96c08a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2530,8 +2530,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int batch = max(CHARGE_BATCH, nr_pages); int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; struct mem_cgroup *mem_over_limit; - struct res_counter *fail_res; unsigned long nr_reclaimed; + struct res_counter *res; unsigned long flags = 0; unsigned long long size; int ret = 0; @@ -2541,16 +2541,16 @@ retry: goto done; size = batch * PAGE_SIZE; - if (!res_counter_charge(&memcg->res, size, &fail_res)) { + if (!res_counter_charge(&memcg->res, size, &res)) { if (!do_swap_account) goto done_restock; - if (!res_counter_charge(&memcg->memsw, size, &fail_res)) + if (!res_counter_charge(&memcg->memsw, size, &res)) goto done_restock; res_counter_uncharge(&memcg->res, size); - mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); + mem_over_limit = mem_cgroup_from_res_counter(res, memsw); flags |= MEM_CGROUP_RECLAIM_NOSWAP; } else - mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); + mem_over_limit = mem_cgroup_from_res_counter(res, res); if (batch > nr_pages) { batch = nr_pages; @@ -2621,6 +2621,20 @@ bypass: done_restock: if (batch > nr_pages) refill_stock(memcg, batch - nr_pages); + + res = &memcg->res; + while (res) { + unsigned long long high = res_counter_high(res); + + if (high) { + unsigned long high_pages = high >> PAGE_SHIFT; + struct mem_cgroup *memcg; + + memcg = mem_cgroup_from_res_counter(res, res); + mem_cgroup_reclaim(memcg, high_pages, gfp_mask, 0); + } + res = res->parent; + } done: return ret; } @@ -5196,7 +5210,7 @@ out_kfree: return ret; } -static struct cftype mem_cgroup_files[] = { +static struct cftype mem_cgroup_legacy_files[] = { { .name = "usage_in_bytes", .private = MEMFILE_PRIVATE(_MEM, RES_USAGE), @@ -5305,7 +5319,7 @@ static struct cftype mem_cgroup_files[] = { }; #ifdef CONFIG_MEMCG_SWAP -static struct cftype memsw_cgroup_files[] = { +static struct cftype memsw_cgroup_legacy_files[] = { { .name = "memsw.usage_in_bytes", .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE), @@ -6250,6 +6264,60 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css) mem_cgroup_from_css(root_css)->use_hierarchy = true; } +static u64 memory_current_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + return res_counter_read_u64(&memcg->res, RES_USAGE); +} + +static u64 memory_high_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + return res_counter_read_u64(&memcg->res, RES_HIGH); +} + +static ssize_t memory_high_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + u64 high; + int ret; + + if (mem_cgroup_is_root(memcg)) + return -EINVAL; + + buf = strim(buf); + ret = res_counter_memparse_write_strategy(buf, &high); + if (ret) + return ret; + + ret = res_counter_set_high(&memcg->res, high); + if (ret) + return ret; + + high = res_counter_high(&memcg->res); + if (high) + mem_cgroup_reclaim(memcg, high >> PAGE_SHIFT, GFP_KERNEL, 0); + + return nbytes; +} + +static struct cftype memory_files[] = { + { + .name = "current", + .read_u64 = memory_current_read, + }, + { + .name = "high", + .read_u64 = memory_high_read, + .write = memory_high_write, + }, +}; + struct cgroup_subsys memory_cgrp_subsys = { .css_alloc = mem_cgroup_css_alloc, .css_online = mem_cgroup_css_online, @@ -6260,7 +6328,8 @@ struct cgroup_subsys memory_cgrp_subsys = { .cancel_attach = mem_cgroup_cancel_attach, .attach = mem_cgroup_move_task, .bind = mem_cgroup_bind, - .legacy_cftypes = mem_cgroup_files, + .dfl_cftypes = memory_files, + .legacy_cftypes = mem_cgroup_legacy_files, .early_init = 0, }; @@ -6278,7 +6347,7 @@ __setup("swapaccount=", enable_swap_account); static void __init memsw_file_init(void) { WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, - memsw_cgroup_files)); + memsw_cgroup_legacy_files)); } static void __init enable_swap_cgroup(void) -- 2.0.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy 2014-08-04 21:14 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner @ 2014-08-07 13:36 ` Michal Hocko 2014-08-07 13:39 ` Michal Hocko 2014-08-07 15:47 ` Johannes Weiner 0 siblings, 2 replies; 23+ messages in thread From: Michal Hocko @ 2014-08-07 13:36 UTC (permalink / raw) To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Mon 04-08-14 17:14:55, Johannes Weiner wrote: [...] > @@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val); > u64 res_counter_uncharge_until(struct res_counter *counter, > struct res_counter *top, > unsigned long val); > + > +static inline unsigned long long res_counter_high(struct res_counter *cnt) soft limit used res_counter_soft_limit_excess which has quite a long name but at least those two should be consistent. I will post two helper patches which I have used to make this and other operations on res counter easier as a reply to this. > +{ > + unsigned long long high = 0; > + unsigned long flags; > + > + spin_lock_irqsave(&cnt->lock, flags); > + if (cnt->usage > cnt->high) > + high = cnt->usage - cnt->high; > + spin_unlock_irqrestore(&cnt->lock, flags); > + return high; > +} > + > /** > * res_counter_margin - calculate chargeable space of a counter > * @cnt: the counter > @@ -193,6 +211,17 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt) > spin_unlock_irqrestore(&cnt->lock, flags); > } > > +static inline int res_counter_set_high(struct res_counter *cnt, > + unsigned long long high) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(&cnt->lock, flags); > + cnt->high = high; > + spin_unlock_irqrestore(&cnt->lock, flags); > + return 0; > +} > + [...] > @@ -2541,16 +2541,16 @@ retry: > goto done; > > size = batch * PAGE_SIZE; > - if (!res_counter_charge(&memcg->res, size, &fail_res)) { > + if (!res_counter_charge(&memcg->res, size, &res)) { > if (!do_swap_account) > goto done_restock; > - if (!res_counter_charge(&memcg->memsw, size, &fail_res)) > + if (!res_counter_charge(&memcg->memsw, size, &res)) > goto done_restock; > res_counter_uncharge(&memcg->res, size); > - mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); > + mem_over_limit = mem_cgroup_from_res_counter(res, memsw); > flags |= MEM_CGROUP_RECLAIM_NOSWAP; > } else > - mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); > + mem_over_limit = mem_cgroup_from_res_counter(res, res); > > if (batch > nr_pages) { > batch = nr_pages; > @@ -2621,6 +2621,20 @@ bypass: > done_restock: > if (batch > nr_pages) > refill_stock(memcg, batch - nr_pages); > + > + res = &memcg->res; > + while (res) { > + unsigned long long high = res_counter_high(res); > + > + if (high) { > + unsigned long high_pages = high >> PAGE_SHIFT; > + struct mem_cgroup *memcg; > + > + memcg = mem_cgroup_from_res_counter(res, res); > + mem_cgroup_reclaim(memcg, high_pages, gfp_mask, 0); > + } > + res = res->parent; > + } > done: > return ret; > } Why haven't you followed what we do for hard limit here? In my implementation I have the following: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a37465fcd8ae..6a797c740ea5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2529,6 +2529,21 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, return NOTIFY_OK; } +static bool high_limit_excess(struct mem_cgroup *memcg, + struct mem_cgroup **memcg_over_limit) +{ + struct mem_cgroup *parent = memcg; + + do { + if (res_counter_limit_excess(&parent->res, RES_HIGH_LIMIT)) { + *memcg_over_limit = parent; + return true; + } + } while ((parent = parent_mem_cgroup(parent))); + + return false; +} + static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { @@ -2623,6 +2638,10 @@ bypass: goto retry; done_restock: + /* Throttle charger a bit if it is above high limit. */ + if (high_limit_excess(memcg, &mem_over_limit)) + mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); + if (batch > nr_pages) refill_stock(memcg, batch - nr_pages); done: -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy 2014-08-07 13:36 ` Michal Hocko @ 2014-08-07 13:39 ` Michal Hocko 2014-08-07 15:47 ` Johannes Weiner 1 sibling, 0 replies; 23+ messages in thread From: Michal Hocko @ 2014-08-07 13:39 UTC (permalink / raw) To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Thu 07-08-14 15:36:14, Michal Hocko wrote: > On Mon 04-08-14 17:14:55, Johannes Weiner wrote: > [...] > > @@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val); > > u64 res_counter_uncharge_until(struct res_counter *counter, > > struct res_counter *top, > > unsigned long val); > > + > > +static inline unsigned long long res_counter_high(struct res_counter *cnt) > > soft limit used res_counter_soft_limit_excess which has quite a long > name but at least those two should be consistent. > I will post two helper patches which I have used to make this and other > operations on res counter easier as a reply to this. These two are sleeping in my queue for quite some time. I didn't get to post them yet but if you think they will make sense I can try to rebase them on the current tree and post. --- From 3f3185306b225931a45387f288645ba9044565d0 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Thu, 19 Jun 2014 19:14:31 +0200 Subject: [PATCH 1/2] res_counter: provide res_counter_write_u64 to allow setting member based value setting. This will reduce code duplication for the new limits added by this patch series. Use the new helper to replace one-off res_counter_set_soft_limit. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/res_counter.h | 14 ++------------ kernel/res_counter.c | 14 ++++++++++++++ mm/memcontrol.c | 2 +- 3 files changed, 17 insertions(+), 13 deletions(-) diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h index 56b7bc32db4f..bea7f9f45f7a 100644 --- a/include/linux/res_counter.h +++ b/include/linux/res_counter.h @@ -71,6 +71,8 @@ struct res_counter { u64 res_counter_read_u64(struct res_counter *counter, int member); +void res_counter_write_u64(struct res_counter *counter, int member, u64 val); + ssize_t res_counter_read(struct res_counter *counter, int member, const char __user *buf, size_t nbytes, loff_t *pos, int (*read_strategy)(unsigned long long val, char *s)); @@ -208,16 +210,4 @@ static inline int res_counter_set_limit(struct res_counter *cnt, return ret; } -static inline int -res_counter_set_soft_limit(struct res_counter *cnt, - unsigned long long soft_limit) -{ - unsigned long flags; - - spin_lock_irqsave(&cnt->lock, flags); - cnt->soft_limit = soft_limit; - spin_unlock_irqrestore(&cnt->lock, flags); - return 0; -} - #endif diff --git a/kernel/res_counter.c b/kernel/res_counter.c index e791130f85a7..4789c2323a94 100644 --- a/kernel/res_counter.c +++ b/kernel/res_counter.c @@ -171,11 +171,25 @@ u64 res_counter_read_u64(struct res_counter *counter, int member) return ret; } + +void res_counter_write_u64(struct res_counter *counter, int member, u64 val) +{ + unsigned long flags; + + spin_lock_irqsave(&counter->lock, flags); + *res_counter_member(counter, member) = val; + spin_unlock_irqrestore(&counter->lock, flags); +} #else u64 res_counter_read_u64(struct res_counter *counter, int member) { return *res_counter_member(counter, member); } + +void res_counter_write_u64(struct res_counter *counter, int member, u64 val) +{ + *res_counter_member(counter, member) = val; +} #endif int res_counter_memparse_write_strategy(const char *buf, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d1b311687769..1ad5d4a2bc4e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4378,7 +4378,7 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of, * control without swap */ if (type == _MEM) - ret = res_counter_set_soft_limit(&memcg->res, val); + res_counter_write_u64(&memcg->res, name, val); else ret = -EINVAL; break; -- 2.1.0.rc1 --- From 8c79f2c209f806b97ec368c3e649ef58caeb7e99 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Thu, 19 Jun 2014 19:42:25 +0200 Subject: [PATCH 2/2] memcg, res_counter: replace res_counter_soft_limit_excess by a more generic helper Later patches in the series will add new limits which we will want to check for excess as well so change the current on-off res_counter_soft_limit_excess to a more generic helper. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/res_counter.h | 21 +++++++++++++++------ mm/memcontrol.c | 10 +++++----- 2 files changed, 20 insertions(+), 11 deletions(-) diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h index bea7f9f45f7a..9015013784fa 100644 --- a/include/linux/res_counter.h +++ b/include/linux/res_counter.h @@ -156,23 +156,32 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt) } /** - * Get the difference between the usage and the soft limit + * Get the difference between the usage and the limit defined + * by the given member * @cnt: The counter * - * Returns 0 if usage is less than or equal to soft limit - * The difference between usage and soft limit, otherwise. + * Returns 0 if usage is less than or equal to the limit defined + * by member or the difference otherwise. */ static inline unsigned long long -res_counter_soft_limit_excess(struct res_counter *cnt) +res_counter_limit_excess(struct res_counter *cnt, int member) { unsigned long long excess; + unsigned long long limit; unsigned long flags; spin_lock_irqsave(&cnt->lock, flags); - if (cnt->usage <= cnt->soft_limit) + switch(member) { + case RES_SOFT_LIMIT: + limit = cnt->soft_limit; + break; + default: + BUG(); + } + if (cnt->usage <= limit) excess = 0; else - excess = cnt->usage - cnt->soft_limit; + excess = cnt->usage - limit; spin_unlock_irqrestore(&cnt->lock, flags); return excess; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1ad5d4a2bc4e..75b5db78e9be 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -773,7 +773,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) */ for (; memcg; memcg = parent_mem_cgroup(memcg)) { mz = mem_cgroup_page_zoneinfo(memcg, page); - excess = res_counter_soft_limit_excess(&memcg->res); + excess = res_counter_limit_excess(&memcg->res, RES_SOFT_LIMIT); /* * We have to update the tree if mz is on RB-tree or * mem is over its softlimit. @@ -827,7 +827,7 @@ retry: * position in the tree. */ __mem_cgroup_remove_exceeded(mz, mctz); - if (!res_counter_soft_limit_excess(&mz->memcg->res) || + if (!res_counter_limit_excess(&mz->memcg->res, RES_SOFT_LIMIT) || !css_tryget_online(&mz->memcg->css)) goto retry; done: @@ -1983,7 +1983,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, .priority = 0, }; - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; + excess = res_counter_limit_excess(&root_memcg->res, RES_SOFT_LIMIT) >> PAGE_SHIFT; while (1) { victim = mem_cgroup_iter(root_memcg, victim, &reclaim); @@ -2014,7 +2014,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, zone, &nr_scanned); *total_scanned += nr_scanned; - if (!res_counter_soft_limit_excess(&root_memcg->res)) + if (!res_counter_limit_excess(&root_memcg->res, RES_SOFT_LIMIT)) break; } mem_cgroup_iter_break(root_memcg, victim); @@ -3941,7 +3941,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, } while (1); } __mem_cgroup_remove_exceeded(mz, mctz); - excess = res_counter_soft_limit_excess(&mz->memcg->res); + excess = res_counter_limit_excess(&mz->memcg->res, RES_SOFT_LIMIT); /* * One school of thought says that we should not add * back the node to the tree if reclaim returns 0. -- 2.1.0.rc1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy 2014-08-07 13:36 ` Michal Hocko 2014-08-07 13:39 ` Michal Hocko @ 2014-08-07 15:47 ` Johannes Weiner 1 sibling, 0 replies; 23+ messages in thread From: Johannes Weiner @ 2014-08-07 15:47 UTC (permalink / raw) To: Michal Hocko; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Thu, Aug 07, 2014 at 03:36:14PM +0200, Michal Hocko wrote: > On Mon 04-08-14 17:14:55, Johannes Weiner wrote: > [...] > > @@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val); > > u64 res_counter_uncharge_until(struct res_counter *counter, > > struct res_counter *top, > > unsigned long val); > > + > > +static inline unsigned long long res_counter_high(struct res_counter *cnt) > > soft limit used res_counter_soft_limit_excess which has quite a long > name but at least those two should be consistent. That name is horrible and a result from "soft_limit" being completely nondescriptive. I really see no point in trying to be consistent with this stuff that we are trying hard to delete. > > @@ -2621,6 +2621,20 @@ bypass: > > done_restock: > > if (batch > nr_pages) > > refill_stock(memcg, batch - nr_pages); > > + > > + res = &memcg->res; > > + while (res) { > > + unsigned long long high = res_counter_high(res); > > + > > + if (high) { > > + unsigned long high_pages = high >> PAGE_SHIFT; > > + struct mem_cgroup *memcg; > > + > > + memcg = mem_cgroup_from_res_counter(res, res); > > + mem_cgroup_reclaim(memcg, high_pages, gfp_mask, 0); > > + } > > + res = res->parent; > > + } > > done: > > return ret; > > } > > Why haven't you followed what we do for hard limit here? I did. > In my implementation I have the following: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index a37465fcd8ae..6a797c740ea5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2529,6 +2529,21 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, > return NOTIFY_OK; > } > > +static bool high_limit_excess(struct mem_cgroup *memcg, > + struct mem_cgroup **memcg_over_limit) > +{ > + struct mem_cgroup *parent = memcg; > + > + do { > + if (res_counter_limit_excess(&parent->res, RES_HIGH_LIMIT)) { > + *memcg_over_limit = parent; > + return true; > + } > + } while ((parent = parent_mem_cgroup(parent))); > + > + return false; > +} > + > static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > unsigned int nr_pages) > { > @@ -2623,6 +2638,10 @@ bypass: > goto retry; > > done_restock: > + /* Throttle charger a bit if it is above high limit. */ > + if (high_limit_excess(memcg, &mem_over_limit)) > + mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); This is not what the hard limit does. The hard limit, by its nature, can only be exceeded at one level at a time, so we try to charge, check the closest limit that was hit, reclaim, then retry. This means we are reclaiming up the hierarchy to enforce the hard limit on each level. I do the same here: reclaim up the hierarchy to enforce the high limit on each level. Your proposal only reclaims the closest offender, leaving higher hierarchy levels in excess. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch 3/4] mm: memcontrol: add memory.max to default hierarchy 2014-08-04 21:14 [patch 0/4] mm: memcontrol: populate unified hierarchy interface Johannes Weiner 2014-08-04 21:14 ` [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests Johannes Weiner 2014-08-04 21:14 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner @ 2014-08-04 21:14 ` Johannes Weiner 2014-08-04 21:14 ` [patch 4/4] mm: memcontrol: add memory.vmstat " Johannes Weiner [not found] ` <1407186897-21048-1-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 4 siblings, 0 replies; 23+ messages in thread From: Johannes Weiner @ 2014-08-04 21:14 UTC (permalink / raw) To: Andrew Morton; +Cc: Michal Hocko, Tejun Heo, linux-mm, cgroups, linux-kernel There are cases where a strict upper limit on a memcg is required, for example, when containers are rented out and interference between them can not be tolerated. Provide memory.max, a limit that can not be breached and will trigger group-internal OOM killing once page reclaim can no longer enforce it. This can be combined with the high limit, to create a window in which allocating tasks are throttled to approach the strict maximum limit gracefully and with opportunity for the user or admin to intervene. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- Documentation/cgroups/unified-hierarchy.txt | 4 ++++ mm/memcontrol.c | 35 +++++++++++++++++++++++++++++ 2 files changed, 39 insertions(+) diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index fd4f7f6847f6..6c52c926810f 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -334,6 +334,10 @@ supported and the interface files "release_agent" and - memory.usage_in_bytes is renamed to memory.current to be in line with the new naming scheme +- memory.max provides a hard upper limit as a last-resort backup to + memory.high for situations with aggressive isolation requirements. + + 5. Planned Changes 5-1. CAP for resource control diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5a64fa96c08a..461834c86b94 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6306,6 +6306,36 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, return nbytes; } +static u64 memory_max_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + return res_counter_read_u64(&memcg->res, RES_LIMIT); +} + +static ssize_t memory_max_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + u64 max; + int ret; + + if (mem_cgroup_is_root(memcg)) + return -EINVAL; + + buf = strim(buf); + ret = res_counter_memparse_write_strategy(buf, &max); + if (ret) + return ret; + + ret = mem_cgroup_resize_limit(memcg, max); + if (ret) + return ret; + + return nbytes; +} + static struct cftype memory_files[] = { { .name = "current", @@ -6316,6 +6346,11 @@ static struct cftype memory_files[] = { .read_u64 = memory_high_read, .write = memory_high_write, }, + { + .name = "max", + .read_u64 = memory_max_read, + .write = memory_max_write, + }, }; struct cgroup_subsys memory_cgrp_subsys = { -- 2.0.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [patch 4/4] mm: memcontrol: add memory.vmstat to default hierarchy 2014-08-04 21:14 [patch 0/4] mm: memcontrol: populate unified hierarchy interface Johannes Weiner ` (2 preceding siblings ...) 2014-08-04 21:14 ` [patch 3/4] mm: memcontrol: add memory.max " Johannes Weiner @ 2014-08-04 21:14 ` Johannes Weiner [not found] ` <1407186897-21048-1-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 4 siblings, 0 replies; 23+ messages in thread From: Johannes Weiner @ 2014-08-04 21:14 UTC (permalink / raw) To: Andrew Morton; +Cc: Michal Hocko, Tejun Heo, linux-mm, cgroups, linux-kernel Provide basic per-memcg vmstat-style statistics on LRU sizes, allocated and freed pages, major and minor faults. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- Documentation/cgroups/unified-hierarchy.txt | 8 ++++++ mm/memcontrol.c | 40 +++++++++++++++++++++++++++++ 2 files changed, 48 insertions(+) diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index 6c52c926810f..180b260c510a 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -337,6 +337,14 @@ supported and the interface files "release_agent" and - memory.max provides a hard upper limit as a last-resort backup to memory.high for situations with aggressive isolation requirements. +- memory.stat has been replaced by memory.vmstat, which provides + page-based statistics in the style of /proc/vmstat. + + As cgroups are now always hierarchical and no longer allow tasks in + intermediate levels, the local state is irrelevant and all + statistics represent the state of the entire hierarchy rooted at the + given group. + 5. Planned Changes diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 461834c86b94..6502e1cfc0fc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6336,6 +6336,42 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, return nbytes; } +static u64 tree_events(struct mem_cgroup *memcg, int event) +{ + struct mem_cgroup *mi; + u64 val = 0; + + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_read_events(mi, event); + return val; +} + +static int memory_vmstat_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + struct mem_cgroup *mi; + int i; + + for (i = 0; i < NR_LRU_LISTS; i++) { + u64 val = 0; + + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); + seq_printf(m, "%s %llu\n", vmstat_text[NR_LRU_BASE + i], val); + } + + seq_printf(m, "pgalloc %llu\n", + tree_events(memcg, MEM_CGROUP_EVENTS_PGPGIN)); + seq_printf(m, "pgfree %llu\n", + tree_events(memcg, MEM_CGROUP_EVENTS_PGPGOUT)); + seq_printf(m, "pgfault %llu\n", + tree_events(memcg, MEM_CGROUP_EVENTS_PGFAULT)); + seq_printf(m, "pgmajfault %llu\n", + tree_events(memcg, MEM_CGROUP_EVENTS_PGMAJFAULT)); + + return 0; +} + static struct cftype memory_files[] = { { .name = "current", @@ -6351,6 +6387,10 @@ static struct cftype memory_files[] = { .read_u64 = memory_max_read, .write = memory_max_write, }, + { + .name = "vmstat", + .seq_show = memory_vmstat_show, + }, }; struct cgroup_subsys memory_cgrp_subsys = { -- 2.0.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
[parent not found: <1407186897-21048-1-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [patch 0/4] mm: memcontrol: populate unified hierarchy interface [not found] ` <1407186897-21048-1-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2014-08-05 12:40 ` Michal Hocko 2014-08-05 13:53 ` Johannes Weiner 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2014-08-05 12:40 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon 04-08-14 17:14:53, Johannes Weiner wrote: > Hi, > > the ongoing versioning of the cgroup user interface gives us a chance > to clean up the memcg control interface and fix a lot of > inconsistencies and ugliness that crept in over time. The first patch doesn't fit into the series and should be posted separately. > This series adds a minimal set of control files to the new memcg > interface to get basic memcg functionality going in unified hierarchy: Hmm, I have posted RFC for new knobs quite some time ago and the discussion died without some questions answered and now you are coming with a new one. I cannot say I would be happy about that. One of the concern was renaming knobs which represent the same functionality as before. I have posted some concerns but haven't heard back anything. This series doesn't give any rationale for renaming either. It is true we have a v2 but that doesn't necessarily mean we should put everything upside down. > - memory.current: a read-only file that shows current memory usage. Even if we go with renaming existing knobs I really hate this name. The old one was too long but this is not descriptive enough. Same applies to max and high. I would expect at least limit in the name. > - memory.high: a file that allows setting a high limit on the memory > usage. This is an elastic limit, which is enforced via direct > reclaim, so allocators are throttled once it's reached, but it can > be exceeded and does not trigger OOM kills. This should be a much > more suitable default upper boundary for the majority of use cases > that are better off with some elasticity than with sudden OOM kills. I also thought you wanted to have all the new limits in the single series. My series is sitting idle until we finally come to conclusion which is the first set of exposed knobs. So I do not understand why are you coming with it right now. > - memory.max: a file that allows setting a maximum limit on memory > usage which is ultimately enforced by OOM killing tasks in the > group. This is for setups that want strict isolation at the cost of > task death above a certain point. However, even those can still > combine the max limit with the high limit to approach OOM situations > gracefully and with time to intervene. > > - memory.vmstat: vmstat-style per-memcg statistics. Very minimal for > now (lru stats, allocations and frees, faults), but fixing > fundamental issues of the old memory.stat file, including gross > misnomers like pgpgin/pgpgout for pages charged/uncharged etc. I am definitely for exposing LRU stats and have a half baked patch sitting and waiting for some polishing. So I agree with the vmstat part. Putting it into stat file is not the greatest match so a separate file is good here. > Documentation/cgroups/unified-hierarchy.txt | 18 +++ > include/linux/res_counter.h | 29 +++++ > include/linux/swap.h | 3 +- > kernel/res_counter.c | 3 + > mm/memcontrol.c | 177 +++++++++++++++++++++++++--- > mm/vmscan.c | 3 +- > 6 files changed, 216 insertions(+), 17 deletions(-) > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] mm: memcontrol: populate unified hierarchy interface 2014-08-05 12:40 ` [patch 0/4] mm: memcontrol: populate unified hierarchy interface Michal Hocko @ 2014-08-05 13:53 ` Johannes Weiner [not found] ` <20140805135325.GB14734-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Johannes Weiner @ 2014-08-05 13:53 UTC (permalink / raw) To: Michal Hocko; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Tue, Aug 05, 2014 at 02:40:33PM +0200, Michal Hocko wrote: > On Mon 04-08-14 17:14:53, Johannes Weiner wrote: > > Hi, > > > > the ongoing versioning of the cgroup user interface gives us a chance > > to clean up the memcg control interface and fix a lot of > > inconsistencies and ugliness that crept in over time. > > The first patch doesn't fit into the series and should be posted > separately. It's a prerequisite for the high limit implementation. > > This series adds a minimal set of control files to the new memcg > > interface to get basic memcg functionality going in unified hierarchy: > > Hmm, I have posted RFC for new knobs quite some time ago and the > discussion died without some questions answered and now you are coming > with a new one. I cannot say I would be happy about that. I remembered open questions mainly about other things like swappiness, charge immigration, kmem limits. My bad, I should have checked. Here are your concerns on these basic knobs from that email: --- On Thu, Jul 17, 2014 at 03:45:09PM +0200, Michal Hocko wrote: > On Wed 16-07-14 11:58:14, Johannes Weiner wrote: > > How about "memory.current"? > > I wanted old users to change the minimum possible when moving to unified > hierarchy so I didn't touch the old names. > Why should we make the end users life harder? If there is general > agreement I have no problem with renaming I just do not think it is > really necessary because there is no real reason why configurations > which do not use any of the deprecated or unified-hierarchy-only > features shouldn't run in both unified and legacy hierarchies without > any changes. There is the rub, though: you can't *not* use new interfaces. We are getting rid of the hard limit as the default and we really want people to rethink their configuration in the light of this. And even if you would just use the hard limit as before, there is no way we can leave the name 'memory.limit_in_bytes' when we have in fact 4 different limits. So I don't see any way how we can stay 100% backward compatible even with the most basic memcg functionality of setting an upper limit. And once you acknowledge that current users don't get around *some* adjustments, we really owe it to new users to present a clean and consistent interface. > I do realize that this is a _new_ API so we can do such radical changes > but I am also aware that some people have to maintain their stacks on > top of different kernels and it really sucks to maintain two different > configurations. In such a case it would be easier for those users to > stay with the legacy mode which is a fair option but I would much rather > see them move to the new API sooner rather than later. There is no way you can use the exact same scripts/configurations for the old and new API at the same time when the most basic way of using cgroups and memcg changed in v2. --- > One of the concern was renaming knobs which represent the same > functionality as before. I have posted some concerns but haven't heard > back anything. This series doesn't give any rationale for renaming > either. > It is true we have a v2 but that doesn't necessarily mean we should put > everything upside down. I'm certainly not going out of my way to turn things upside down, but the old interface is outrageous. I'm sorry if you can't see that it badly needs to be cleaned up and fixed. This is the time to do that. > > - memory.current: a read-only file that shows current memory usage. > > Even if we go with renaming existing knobs I really hate this name. The > old one was too long but this is not descriptive enough. Same applies to > max and high. I would expect at least limit in the name. Memory cgroups are about accounting and limiting memory usage. That's all they do. In that context, current, min, low, high, max seem perfectly descriptive to me, adding usage and limit seems redundant. We name syscalls creat() and open() and stat() because, while you have to look at the manpage once, they are easy to remember, easy to type, and they keep the code using them readable. memory.usage_in_bytes was the opposite approach: it tried to describe all there is to this knob in the name itself, assuming tab completion would help you type that long name. But we are more and more moving away from ad-hoc scripting of cgroups and I don't want to optimize for that anymore at the cost of really unwieldy identifiers. Like with all user interfaces, we should provide a short and catchy name and then provide the details in the documentation. > > - memory.high: a file that allows setting a high limit on the memory > > usage. This is an elastic limit, which is enforced via direct > > reclaim, so allocators are throttled once it's reached, but it can > > be exceeded and does not trigger OOM kills. This should be a much > > more suitable default upper boundary for the majority of use cases > > that are better off with some elasticity than with sudden OOM kills. > > I also thought you wanted to have all the new limits in the single > series. My series is sitting idle until we finally come to conclusion > which is the first set of exposed knobs. So I do not understand why are > you coming with it right now. I still would like to, but I'm not sure we can get the guarantees working in time as unified hierarchy leaves its experimental status. And I'm fairly confident that we know how the upper limits should behave and that we are no longer going to change that, and that we have a decent understanding on how the guarantees are going to work. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20140805135325.GB14734-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [patch 0/4] mm: memcontrol: populate unified hierarchy interface [not found] ` <20140805135325.GB14734-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2014-08-05 15:27 ` Michal Hocko 2014-08-07 14:21 ` Johannes Weiner 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2014-08-05 15:27 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue 05-08-14 09:53:25, Johannes Weiner wrote: > On Tue, Aug 05, 2014 at 02:40:33PM +0200, Michal Hocko wrote: > > On Mon 04-08-14 17:14:53, Johannes Weiner wrote: > > > Hi, > > > > > > the ongoing versioning of the cgroup user interface gives us a chance > > > to clean up the memcg control interface and fix a lot of > > > inconsistencies and ugliness that crept in over time. > > > > The first patch doesn't fit into the series and should be posted > > separately. > > It's a prerequisite for the high limit implementation. I do not think it is strictly needed. I am even not sure whether the patch is OK and have to think more about it. I think you can throttle high limit breachers by SWAP_CLUSTER_MAX for now. > > > This series adds a minimal set of control files to the new memcg > > > interface to get basic memcg functionality going in unified hierarchy: > > > > Hmm, I have posted RFC for new knobs quite some time ago and the > > discussion died without some questions answered and now you are coming > > with a new one. I cannot say I would be happy about that. > > I remembered open questions mainly about other things like swappiness, > charge immigration, kmem limits. My bad, I should have checked. Here > are your concerns on these basic knobs from that email: > > --- > > On Thu, Jul 17, 2014 at 03:45:09PM +0200, Michal Hocko wrote: > > On Wed 16-07-14 11:58:14, Johannes Weiner wrote: > > > How about "memory.current"? > > > > I wanted old users to change the minimum possible when moving to unified > > hierarchy so I didn't touch the old names. > > Why should we make the end users life harder? If there is general > > agreement I have no problem with renaming I just do not think it is > > really necessary because there is no real reason why configurations > > which do not use any of the deprecated or unified-hierarchy-only > > features shouldn't run in both unified and legacy hierarchies without > > any changes. > > There is the rub, though: you can't *not* use new interfaces. We are > getting rid of the hard limit as the default and we really want people > to rethink their configuration in the light of this. And even if you > would just use the hard limit as before, there is no way we can leave > the name 'memory.limit_in_bytes' when we have in fact 4 different > limits. We could theoretically keep a single limit and turn other limits into watermarks. I am _not_ suggesting that now because I haven't thought that through but I just think we should discuss other possible ways before we go on. > So I don't see any way how we can stay 100% backward compatible even > with the most basic memcg functionality of setting an upper limit. > > And once you acknowledge that current users don't get around *some* > adjustments, we really owe it to new users to present a clean and > consistent interface. > > > I do realize that this is a _new_ API so we can do such radical changes > > but I am also aware that some people have to maintain their stacks on > > top of different kernels and it really sucks to maintain two different > > configurations. In such a case it would be easier for those users to > > stay with the legacy mode which is a fair option but I would much rather > > see them move to the new API sooner rather than later. > > There is no way you can use the exact same scripts/configurations for > the old and new API at the same time when the most basic way of using > cgroups and memcg changed in v2. OK, this is a fair point. Cgroups configuration is probably a bigger problem. If a script rely on a tool/library to setup the hierarchy then that tool/library can probably do the mapping from old names to new as well otherwise it would need to be rewritten at least for cgroup part. I have no idea how tools (e.g. libcgroup, libvirt and others) will adapt to the new API and support of both APIs at the same time, though. > --- > > > One of the concern was renaming knobs which represent the same > > functionality as before. I have posted some concerns but haven't heard > > back anything. This series doesn't give any rationale for renaming > > either. > > It is true we have a v2 but that doesn't necessarily mean we should put > > everything upside down. > > I'm certainly not going out of my way to turn things upside down, but > the old interface is outrageous. I'm sorry if you can't see that it > badly needs to be cleaned up and fixed. This is the time to do that. Of course I can see many problems. But please let's think twice and even more times when doing radical changes. Many decisions sound reasonable at the time but then they turn out bad much later. > > > - memory.current: a read-only file that shows current memory usage. > > > > Even if we go with renaming existing knobs I really hate this name. The > > old one was too long but this is not descriptive enough. Same applies to > > max and high. I would expect at least limit in the name. > > Memory cgroups are about accounting and limiting memory usage. That's > all they do. In that context, current, min, low, high, max seem > perfectly descriptive to me, adding usage and limit seems redundant. Getting naming right is always pain and different people will always have different views. For example I really do not like memory.current and would prefer memory.usage much more. I am not a native speaker but `usage' sounds much less ambiguous to me. Whether shorter (without _limit suffix) names for limits are better I don't know. They certainly seem more descriptive with the suffix to me. > We name syscalls creat() and open() and stat() because, while you have > to look at the manpage once, they are easy to remember, easy to type, > and they keep the code using them readable. > > memory.usage_in_bytes was the opposite approach: it tried to describe > all there is to this knob in the name itself, assuming tab completion > would help you type that long name. But we are more and more moving > away from ad-hoc scripting of cgroups and I don't want to optimize for > that anymore at the cost of really unwieldy identifiers. I agree with you. _in_bytes is definitely excessive. It can be nicely demonstrated by the fact that different units are allowed when setting the value. > Like with all user interfaces, we should provide a short and catchy > name and then provide the details in the documentation. > > > > - memory.high: a file that allows setting a high limit on the memory > > > usage. This is an elastic limit, which is enforced via direct > > > reclaim, so allocators are throttled once it's reached, but it can > > > be exceeded and does not trigger OOM kills. This should be a much > > > more suitable default upper boundary for the majority of use cases > > > that are better off with some elasticity than with sudden OOM kills. > > > > I also thought you wanted to have all the new limits in the single > > series. My series is sitting idle until we finally come to conclusion > > which is the first set of exposed knobs. So I do not understand why are > > you coming with it right now. > > I still would like to, but I'm not sure we can get the guarantees > working in time as unified hierarchy leaves its experimental status. That shouldn't happen sooner than in next (maybe in 2) devel cycle(s) (http://marc.info/?l=linux-kernel&m=140716172618366). > And I'm fairly confident that we know how the upper limits should > behave and that we are no longer going to change that, and that we > have a decent understanding on how the guarantees are going to work. I think we should first settle with the new knobs before we introduce new ones. I understand you would like to have high limit as a preferable way from the early beginning but I think that can wait while the new API is still in devel mode. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] mm: memcontrol: populate unified hierarchy interface 2014-08-05 15:27 ` Michal Hocko @ 2014-08-07 14:21 ` Johannes Weiner 0 siblings, 0 replies; 23+ messages in thread From: Johannes Weiner @ 2014-08-07 14:21 UTC (permalink / raw) To: Michal Hocko; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel On Tue, Aug 05, 2014 at 05:27:40PM +0200, Michal Hocko wrote: > On Tue 05-08-14 09:53:25, Johannes Weiner wrote: > > On Tue, Aug 05, 2014 at 02:40:33PM +0200, Michal Hocko wrote: > > > On Mon 04-08-14 17:14:53, Johannes Weiner wrote: > > > > Hi, > > > > > > > > the ongoing versioning of the cgroup user interface gives us a chance > > > > to clean up the memcg control interface and fix a lot of > > > > inconsistencies and ugliness that crept in over time. > > > > > > The first patch doesn't fit into the series and should be posted > > > separately. > > > > It's a prerequisite for the high limit implementation. > > I do not think it is strictly needed. I am even not sure whether the > patch is OK and have to think more about it. I think you can throttle > high limit breachers by SWAP_CLUSTER_MAX for now. It really doesn't work once you have higher order pages. THP-heavy workloads overshoot the high limit by a lot if you reclaim 32 pages for every 512 charged. In my tests, with the change in question, even heavily swapping THP loads consistently stay around the high limit, whereas without it the memory consumption quickly overshoots. > > > > This series adds a minimal set of control files to the new memcg > > > > interface to get basic memcg functionality going in unified hierarchy: > > > > > > Hmm, I have posted RFC for new knobs quite some time ago and the > > > discussion died without some questions answered and now you are coming > > > with a new one. I cannot say I would be happy about that. > > > > I remembered open questions mainly about other things like swappiness, > > charge immigration, kmem limits. My bad, I should have checked. Here > > are your concerns on these basic knobs from that email: > > > > --- > > > > On Thu, Jul 17, 2014 at 03:45:09PM +0200, Michal Hocko wrote: > > > On Wed 16-07-14 11:58:14, Johannes Weiner wrote: > > > > How about "memory.current"? > > > > > > I wanted old users to change the minimum possible when moving to unified > > > hierarchy so I didn't touch the old names. > > > Why should we make the end users life harder? If there is general > > > agreement I have no problem with renaming I just do not think it is > > > really necessary because there is no real reason why configurations > > > which do not use any of the deprecated or unified-hierarchy-only > > > features shouldn't run in both unified and legacy hierarchies without > > > any changes. > > > > There is the rub, though: you can't *not* use new interfaces. We are > > getting rid of the hard limit as the default and we really want people > > to rethink their configuration in the light of this. And even if you > > would just use the hard limit as before, there is no way we can leave > > the name 'memory.limit_in_bytes' when we have in fact 4 different > > limits. > > We could theoretically keep a single limit and turn other limits into > watermarks. I am _not_ suggesting that now because I haven't thought > that through but I just think we should discuss other possible ways > before we go on. I am definitely open to discuss your alternative suggestions, but for that you have to actually propose them. :) The reason why I want existing users to rethink the approach to memory limiting is that the current hard limit completely fails to partition and isolate in a practical manner, and we are changing the fundamental approach here. Pretending to provide backward compatibility through the names of control knobs is specious, and will lead to more issues than it actually solves. > > > One of the concern was renaming knobs which represent the same > > > functionality as before. I have posted some concerns but haven't heard > > > back anything. This series doesn't give any rationale for renaming > > > either. > > > It is true we have a v2 but that doesn't necessarily mean we should put > > > everything upside down. > > > > I'm certainly not going out of my way to turn things upside down, but > > the old interface is outrageous. I'm sorry if you can't see that it > > badly needs to be cleaned up and fixed. This is the time to do that. > > Of course I can see many problems. But please let's think twice and even > more times when doing radical changes. Many decisions sound reasonable > at the time but then they turn out bad much later. These are radical changes, and I'm sorry that my justifications were very terse. I've updated this patch to include the following in Documentation/cgroups/unified-hierarchy.txt: --- 4-3-3. memory Memory cgroups account and limit the memory consumption of cgroups, but the current limit semantics make the feature hard to use and creates problems in existing configurations. 4.3.3.1 No more default hard limit 'memory.limit_in_bytes' is the current upper limit that can not be exceeded under any circumstances. If it can not be met by direct reclaim, the tasks in the cgroup are OOM killed. While this may look like a valid approach to partition the machine, in practice workloads expand and contract during runtime, and it's impossible to get the machine-wide configuration right: if users set this hard limit conservatively, they are plagued by cgroup-internal OOM kills while another group's memory might be idle. If they set it too generously, precious resources are wasted. As a result, many users overcommit such that the sum of all hard limits exceed the machine size, but this puts the actual burden of containment on global reclaim and OOM handling. This led to further extremes, such as the demand for having global reclaim honor group-specific priorities and minimums, and the ability to handle global OOM situations from userspace using task-specific physical memory reserves. All these outcomes and developments show the utter failure of hard limits to practically partition the machine for maximum resource utilization. In unified hierarchy, the primary means of limiting memory consumption is 'memory.high'. It's enforced by direct reclaim but can be exceeded under severe memory pressure. Memory pressure created by this limit still applies mainly to the group itself, but it prefers offloading the excess to the rest of the system in order to avoid OOM killing. Configurations can start out by setting this limit to a conservative estimate of the average workload size and then make upward adjustments based on monitoring high limit excess, workload performance, and the global memory situation. In untrusted environments, users may wish to limit the amount of such offloading in order to contain malicious workloads. For that purpose, a hard upper limit can be set through 'memory.max'. 'memory.pressure_level' was added for userspace to monitor memory pressure based on reclaim efficiency, but the window between initial memory pressure and an OOM kill is very short with hard limits. By the time high pressure is reported to userspace it's often too late to still intervene before the group goes OOM, thus severely limiting the usefulness of this feature for anticipating looming OOM situations. This new approach to limiting allows packing workloads more densely based on their average workingset size. Coinciding peaks of multiple groups are handled by throttling allocations within the groups rather than putting the full burden on global reclaim and OOM handling, and pressure situations build up gracefully and allow better monitoring. --- > > > > - memory.current: a read-only file that shows current memory usage. > > > > > > Even if we go with renaming existing knobs I really hate this name. The > > > old one was too long but this is not descriptive enough. Same applies to > > > max and high. I would expect at least limit in the name. > > > > Memory cgroups are about accounting and limiting memory usage. That's > > all they do. In that context, current, min, low, high, max seem > > perfectly descriptive to me, adding usage and limit seems redundant. > > Getting naming right is always pain and different people will always > have different views. For example I really do not like memory.current > and would prefer memory.usage much more. I am not a native speaker but > `usage' sounds much less ambiguous to me. Whether shorter (without _limit > suffix) names for limits are better I don't know. They certainly seem > more descriptive with the suffix to me. These knobs control the most fundamental behavior of memory cgroups, which is accounting and then limiting memory consumption, so I think we can agree at least that we want something short and poignant here that stands out compared to secondary controls, feature toggles etc. The reason I went with memory.current over memory.usage is that it's more consistent with the limit names I chose, which is memory.high and memory.max. Going with memory.usage begs the question what memory.max applies to, and now you need to add 'usage' or 'limit' to high/max as well (and 'guarantee' to min/low), which moves us away from short and poignant towards more specific niche control names. memory.current, memory.high, memory.max all imply the same thing: memory consumption - what memory cgroups is fundamentally about. > > We name syscalls creat() and open() and stat() because, while you have > > to look at the manpage once, they are easy to remember, easy to type, > > and they keep the code using them readable. > > > > memory.usage_in_bytes was the opposite approach: it tried to describe > > all there is to this knob in the name itself, assuming tab completion > > would help you type that long name. But we are more and more moving > > away from ad-hoc scripting of cgroups and I don't want to optimize for > > that anymore at the cost of really unwieldy identifiers. > > I agree with you. _in_bytes is definitely excessive. It can be nicely > demonstrated by the fact that different units are allowed when setting > the value. It's maybe misleading for that reason, but that wasn't my main point. There are certain things we can imply in the name and either explain in the documentation or assume from the context, and _in_bytes is one such piece of information. It's something that you need to know once and don't need to be reminded of everytime you type that control name. Likewise, I'm extending this argument that we don't need to include 'usage' or 'limit' in any of these basic knobs, because that's *the* thing that memory cgroups do. It's up to secondary controls to pick names that do not create ambiguity with those core controls. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2 @ 2014-08-08 21:38 Johannes Weiner 2014-08-08 21:38 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner 0 siblings, 1 reply; 23+ messages in thread From: Johannes Weiner @ 2014-08-08 21:38 UTC (permalink / raw) To: linux-mm Cc: Michal Hocko, Greg Thelen, Vladimir Davydov, Tejun Heo, cgroups, linux-kernel Hi, memory cgroups are fundamentally broken when it comes to partitioning the machine for many concurrent jobs. In real life, workloads expand and contract over time, and the hard limit is too static to reflect this - it either wastes memory outside of the group, or wastes memory inside the group. As a result, the hard limit is mostly just used to catch extreme consumption peaks, while workload trimming and balancing is left to global reclaim and global OOM handling. That in turn requires more and more cgroup-awareness on the global level to make up for the lack of useful policy enforcement on the cgroup level itself. The ongoing versioning of the cgroup user interface gives us a chance to fix such brokenness, and also clean up the interface and fix a lot of the inconsistencies and ugliness that crept in over time. This series adds a minimal set of control files to version 2 of the memcg interface, implementing a new approach to machine partitioning. Version 2 of this series is in response to feedback from Michal. Some of the changes are in code, but mostly it improves the documentation and changelogs to describe the fundamental problems with the original approach to machine partitioning and makes a case for the new model. Documentation/cgroups/unified-hierarchy.txt | 65 ++++++++ include/linux/res_counter.h | 29 ++++ include/linux/swap.h | 6 +- kernel/res_counter.c | 3 + mm/memcontrol.c | 250 +++++++++++++++++++--------- mm/vmscan.c | 7 +- 6 files changed, 277 insertions(+), 83 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy 2014-08-08 21:38 [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2 Johannes Weiner @ 2014-08-08 21:38 ` Johannes Weiner 0 siblings, 0 replies; 23+ messages in thread From: Johannes Weiner @ 2014-08-08 21:38 UTC (permalink / raw) To: linux-mm Cc: Michal Hocko, Greg Thelen, Vladimir Davydov, Tejun Heo, cgroups, linux-kernel Provide the most fundamental interface necessary for memory cgroups to partition the machine for concurrent workloads in unified hierarchy: report the current usage and allow setting an upper limit on it. The upper limit, set in memory.high, is not a strict OOM limit and is enforced purely by direct reclaim. This is a deviation from the old hard upper limit, which history has shown to fail at partitioning a machine for real workloads in a resource-efficient manner: if chosen conservatively, the hard limit risks OOM kills; if chosen generously, memory is underutilized most of the time. As a result, in practice the limit is mostly used to contain extremes and balancing of regular workingset fluctuations and cache trimming is left to global reclaim and the global OOM killer, which creates an increasing demand for complicated cgroup-specific prioritization features in both of them. The high limit on the other hand is a target size limit that is meant to trim caches and keep consumption at the average working set size while providing elasticity for peaks. This allows memory cgroups to be useful for workload packing without relying too much on global VM interventions, except for parallel peaks or inadequate configurations. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- Documentation/cgroups/unified-hierarchy.txt | 52 +++++++++++++++++ include/linux/res_counter.h | 29 ++++++++++ kernel/res_counter.c | 3 + mm/memcontrol.c | 89 ++++++++++++++++++++++++++--- 4 files changed, 164 insertions(+), 9 deletions(-) diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index 4f4563277864..2d91530b8d6c 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -324,9 +324,61 @@ supported and the interface files "release_agent" and 4-3-3. memory +Memory cgroups account and limit the memory consumption of cgroups, +but the current limit semantics make the feature hard to use and +creates problems in existing configurations. + +4.3.3.1 No more default hard limit + +'memory.limit_in_bytes' is the current upper limit that can not be +exceeded under any circumstances. If it can not be met by direct +reclaim, the tasks in the cgroup are OOM killed. + +While this may look like a valid approach to partition the machine, in +practice workloads expand and contract during runtime, and it's +impossible to get the machine-wide configuration right: if users set +this hard limit conservatively, they are plagued by cgroup-internal +OOM kills during peaks while memory might be idle (external waste). +If they set it too generously, precious resources are either unused or +wasted on old cache (internal waste). Because of that, in practice +users set the hard limit only to handle extremes and then overcommit +the machine. This leaves the actual partitioning and group trimming +to global reclaim and OOM handling, which has led to increasing +demands for recognizing cgroup policy during global reclaim, and even +the ability to handle global OOM situations from userspace using +task-specific memory reserves. All these outcomes and developments +show the utter failure of hard limits to effectively partition the +machine for maximum utilization. + +When it comes to monitoring cgroup health, 'memory.pressure_level' was +added for userspace to monitor memory pressure based on group-internal +reclaim efficiency. But as per above the group trimming is mostly +done by global reclaim and the pressure the group experiences is not +proportional to its excess. And once internal pressure actually +builds, the window between onset and an OOM kill can be very short +with hard limits - by the time internal pressure is reported to +userspace, it's often too late to intervene before the group goes OOM. +Both aspects severely limit the ability to monitor cgroup health, +detect looming OOM situations, and pinpoint offenders. + +In unified hierarchy, the primary means of limiting memory consumption +is 'memory.high'. It's enforced by direct reclaim to trim caches and +keep the workload lean, but can be exceeded during working set peaks. +This moves the responsibility of partitioning mostly back to memory +cgroups, and global handling only enganges during concurrent peaks. + +Configurations can start out by setting this limit to a conservative +estimate of the average working set size and then make upward +adjustments based on monitoring high limit excess, workload +performance, and the global memory situation. + +4.3.3.2 Misc changes + - use_hierarchy is on by default and the cgroup file for the flag is not created. +- memory.usage_in_bytes is renamed to memory.current to be in line + with the new limit naming scheme 5. Planned Changes diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h index 56b7bc32db4f..27394cfdf1fe 100644 --- a/include/linux/res_counter.h +++ b/include/linux/res_counter.h @@ -32,6 +32,10 @@ struct res_counter { */ unsigned long long max_usage; /* + * the high limit that creates pressure but can be exceeded + */ + unsigned long long high; + /* * the limit that usage cannot exceed */ unsigned long long limit; @@ -85,6 +89,7 @@ int res_counter_memparse_write_strategy(const char *buf, enum { RES_USAGE, RES_MAX_USAGE, + RES_HIGH, RES_LIMIT, RES_FAILCNT, RES_SOFT_LIMIT, @@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val); u64 res_counter_uncharge_until(struct res_counter *counter, struct res_counter *top, unsigned long val); + +static inline unsigned long long res_counter_high(struct res_counter *cnt) +{ + unsigned long long high = 0; + unsigned long flags; + + spin_lock_irqsave(&cnt->lock, flags); + if (cnt->usage > cnt->high) + high = cnt->usage - cnt->high; + spin_unlock_irqrestore(&cnt->lock, flags); + return high; +} + /** * res_counter_margin - calculate chargeable space of a counter * @cnt: the counter @@ -193,6 +211,17 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt) spin_unlock_irqrestore(&cnt->lock, flags); } +static inline int res_counter_set_high(struct res_counter *cnt, + unsigned long long high) +{ + unsigned long flags; + + spin_lock_irqsave(&cnt->lock, flags); + cnt->high = high; + spin_unlock_irqrestore(&cnt->lock, flags); + return 0; +} + static inline int res_counter_set_limit(struct res_counter *cnt, unsigned long long limit) { diff --git a/kernel/res_counter.c b/kernel/res_counter.c index e791130f85a7..26a08be49a3d 100644 --- a/kernel/res_counter.c +++ b/kernel/res_counter.c @@ -17,6 +17,7 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent) { spin_lock_init(&counter->lock); + counter->high = RES_COUNTER_MAX; counter->limit = RES_COUNTER_MAX; counter->soft_limit = RES_COUNTER_MAX; counter->parent = parent; @@ -130,6 +131,8 @@ res_counter_member(struct res_counter *counter, int member) return &counter->usage; case RES_MAX_USAGE: return &counter->max_usage; + case RES_HIGH: + return &counter->high; case RES_LIMIT: return &counter->limit; case RES_FAILCNT: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4146c0f47ba2..81627387fbd7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2481,8 +2481,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int batch = max(CHARGE_BATCH, nr_pages); int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; struct mem_cgroup *mem_over_limit; - struct res_counter *fail_res; unsigned long nr_reclaimed; + struct res_counter *res; unsigned long long size; bool may_swap = true; bool drained = false; @@ -2493,16 +2493,16 @@ retry: goto done; size = batch * PAGE_SIZE; - if (!res_counter_charge(&memcg->res, size, &fail_res)) { + if (!res_counter_charge(&memcg->res, size, &res)) { if (!do_swap_account) goto done_restock; - if (!res_counter_charge(&memcg->memsw, size, &fail_res)) + if (!res_counter_charge(&memcg->memsw, size, &res)) goto done_restock; res_counter_uncharge(&memcg->res, size); - mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); + mem_over_limit = mem_cgroup_from_res_counter(res, memsw); may_swap = false; } else - mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); + mem_over_limit = mem_cgroup_from_res_counter(res, res); if (batch > nr_pages) { batch = nr_pages; @@ -2579,6 +2579,21 @@ bypass: done_restock: if (batch > nr_pages) refill_stock(memcg, batch - nr_pages); + + res = &memcg->res; + while (res) { + unsigned long long high = res_counter_high(res); + + if (high) { + unsigned long high_pages = high >> PAGE_SHIFT; + struct mem_cgroup *memcg; + + memcg = mem_cgroup_from_res_counter(res, res); + try_to_free_mem_cgroup_pages(memcg, high_pages, + gfp_mask, true); + } + res = res->parent; + } done: return ret; } @@ -5141,7 +5156,7 @@ out_kfree: return ret; } -static struct cftype mem_cgroup_files[] = { +static struct cftype mem_cgroup_legacy_files[] = { { .name = "usage_in_bytes", .private = MEMFILE_PRIVATE(_MEM, RES_USAGE), @@ -5250,7 +5265,7 @@ static struct cftype mem_cgroup_files[] = { }; #ifdef CONFIG_MEMCG_SWAP -static struct cftype memsw_cgroup_files[] = { +static struct cftype memsw_cgroup_legacy_files[] = { { .name = "memsw.usage_in_bytes", .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE), @@ -6195,6 +6210,61 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css) mem_cgroup_from_css(root_css)->use_hierarchy = true; } +static u64 memory_current_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + return res_counter_read_u64(&memcg->res, RES_USAGE); +} + +static u64 memory_high_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + return res_counter_read_u64(&memcg->res, RES_HIGH); +} + +static ssize_t memory_high_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + u64 high; + int ret; + + if (mem_cgroup_is_root(memcg)) + return -EINVAL; + + buf = strim(buf); + ret = res_counter_memparse_write_strategy(buf, &high); + if (ret) + return ret; + + ret = res_counter_set_high(&memcg->res, high); + if (ret) + return ret; + + high = res_counter_high(&memcg->res); + if (high) + try_to_free_mem_cgroup_pages(memcg, high >> PAGE_SHIFT, + GFP_KERNEL, true); + + return nbytes; +} + +static struct cftype memory_files[] = { + { + .name = "current", + .read_u64 = memory_current_read, + }, + { + .name = "high", + .read_u64 = memory_high_read, + .write = memory_high_write, + }, +}; + struct cgroup_subsys memory_cgrp_subsys = { .css_alloc = mem_cgroup_css_alloc, .css_online = mem_cgroup_css_online, @@ -6205,7 +6275,8 @@ struct cgroup_subsys memory_cgrp_subsys = { .cancel_attach = mem_cgroup_cancel_attach, .attach = mem_cgroup_move_task, .bind = mem_cgroup_bind, - .legacy_cftypes = mem_cgroup_files, + .dfl_cftypes = memory_files, + .legacy_cftypes = mem_cgroup_legacy_files, .early_init = 0, }; @@ -6223,7 +6294,7 @@ __setup("swapaccount=", enable_swap_account); static void __init memsw_file_init(void) { WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, - memsw_cgroup_files)); + memsw_cgroup_legacy_files)); } static void __init enable_swap_cgroup(void) -- 2.0.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
end of thread, other threads:[~2014-08-14 16:12 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-04 21:14 [patch 0/4] mm: memcontrol: populate unified hierarchy interface Johannes Weiner
2014-08-04 21:14 ` [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests Johannes Weiner
[not found] ` <1407186897-21048-2-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-08-07 13:08 ` Michal Hocko
[not found] ` <20140807130822.GB12730-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-08-07 15:31 ` Johannes Weiner
2014-08-07 16:10 ` Greg Thelen
2014-08-08 12:47 ` Michal Hocko
2014-08-08 12:32 ` Michal Hocko
2014-08-08 13:26 ` Johannes Weiner
2014-08-11 7:49 ` Michal Hocko
2014-08-13 14:59 ` Michal Hocko
2014-08-13 20:41 ` Johannes Weiner
2014-08-14 16:12 ` Michal Hocko
2014-08-04 21:14 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner
2014-08-07 13:36 ` Michal Hocko
2014-08-07 13:39 ` Michal Hocko
2014-08-07 15:47 ` Johannes Weiner
2014-08-04 21:14 ` [patch 3/4] mm: memcontrol: add memory.max " Johannes Weiner
2014-08-04 21:14 ` [patch 4/4] mm: memcontrol: add memory.vmstat " Johannes Weiner
[not found] ` <1407186897-21048-1-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-08-05 12:40 ` [patch 0/4] mm: memcontrol: populate unified hierarchy interface Michal Hocko
2014-08-05 13:53 ` Johannes Weiner
[not found] ` <20140805135325.GB14734-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-08-05 15:27 ` Michal Hocko
2014-08-07 14:21 ` Johannes Weiner
-- strict thread matches above, loose matches on Subject: below --
2014-08-08 21:38 [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2 Johannes Weiner
2014-08-08 21:38 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox