[patch 0/4] mm: memcontrol: populate unified hierarchy interface

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 0/4] mm: memcontrol: populate unified hierarchy interface
@ 2014-08-04 21:14 Johannes Weiner
  2014-08-04 21:14 ` [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests Johannes Weiner
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Johannes Weiner @ 2014-08-04 21:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi,

the ongoing versioning of the cgroup user interface gives us a chance
to clean up the memcg control interface and fix a lot of
inconsistencies and ugliness that crept in over time.

This series adds a minimal set of control files to the new memcg
interface to get basic memcg functionality going in unified hierarchy:

- memory.current: a read-only file that shows current memory usage.

- memory.high: a file that allows setting a high limit on the memory
  usage.  This is an elastic limit, which is enforced via direct
  reclaim, so allocators are throttled once it's reached, but it can
  be exceeded and does not trigger OOM kills.  This should be a much
  more suitable default upper boundary for the majority of use cases
  that are better off with some elasticity than with sudden OOM kills.

- memory.max: a file that allows setting a maximum limit on memory
  usage which is ultimately enforced by OOM killing tasks in the
  group.  This is for setups that want strict isolation at the cost of
  task death above a certain point.  However, even those can still
  combine the max limit with the high limit to approach OOM situations
  gracefully and with time to intervene.

- memory.vmstat: vmstat-style per-memcg statistics.  Very minimal for
  now (lru stats, allocations and frees, faults), but fixing
  fundamental issues of the old memory.stat file, including gross
  misnomers like pgpgin/pgpgout for pages charged/uncharged etc.

 Documentation/cgroups/unified-hierarchy.txt |  18 +++
 include/linux/res_counter.h                 |  29 +++++
 include/linux/swap.h                        |   3 +-
 kernel/res_counter.c                        |   3 +
 mm/memcontrol.c                             | 177 +++++++++++++++++++++++++---
 mm/vmscan.c                                 |   3 +-
 6 files changed, 216 insertions(+), 17 deletions(-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
  2014-08-04 21:14 [patch 0/4] mm: memcontrol: populate unified hierarchy interface Johannes Weiner
@ 2014-08-04 21:14 ` Johannes Weiner
       [not found]   ` <1407186897-21048-2-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2014-08-04 21:14 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2014-08-04 21:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, Tejun Heo, linux-mm, cgroups, linux-kernel

Instead of passing the request size to direct reclaim, memcg just
manually loops around reclaiming SWAP_CLUSTER_MAX pages until the
charge can succeed.  That potentially wastes scan progress when huge
page allocations require multiple invocations, which always have to
restart from the default scan priority.

Pass the request size as a reclaim target to direct reclaim and leave
it to that code to reach the goal.

Charging will still have to loop in case concurrent allocations steal
reclaim effort, but at least it doesn't have to loop to meet even the
basic request size.  This also prepares the memcg reclaim API for use
with the planned high limit, to reclaim excess with a single call.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/swap.h |  3 ++-
 mm/memcontrol.c      | 15 +++++++++------
 mm/vmscan.c          |  3 ++-
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1b72060f093a..473a3ae4cdd6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -327,7 +327,8 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
+extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+						  unsigned long nr_pages,
 						  gfp_t gfp_mask, bool noswap);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec4dcf1b9562..ddffeeda2d52 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1793,6 +1793,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 }
 
 static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
+					unsigned long nr_pages,
 					gfp_t gfp_mask,
 					unsigned long flags)
 {
@@ -1808,7 +1809,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
 		if (loop)
 			drain_all_stock_async(memcg);
-		total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap);
+		total += try_to_free_mem_cgroup_pages(memcg, nr_pages,
+						      gfp_mask, noswap);
 		/*
 		 * Allow limit shrinkers, which are triggered directly
 		 * by userspace, to catch signals and stop reclaim
@@ -1816,7 +1818,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 		 */
 		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
 			break;
-		if (mem_cgroup_margin(memcg))
+		if (mem_cgroup_margin(memcg) >= nr_pages)
 			break;
 		/*
 		 * If nothing was reclaimed after two attempts, there
@@ -2572,7 +2574,8 @@ retry:
 	if (!(gfp_mask & __GFP_WAIT))
 		goto nomem;
 
-	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, nr_pages,
+					  gfp_mask, flags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		goto retry;
@@ -3718,7 +3721,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_reclaim(memcg, GFP_KERNEL,
+		mem_cgroup_reclaim(memcg, 1, GFP_KERNEL,
 				   MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
@@ -3777,7 +3780,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_reclaim(memcg, GFP_KERNEL,
+		mem_cgroup_reclaim(memcg, 1, GFP_KERNEL,
 				   MEM_CGROUP_RECLAIM_NOSWAP |
 				   MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
@@ -4028,7 +4031,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 		if (signal_pending(current))
 			return -EINTR;
 
-		progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
+		progress = try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
 						false);
 		if (!progress) {
 			nr_retries--;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d698f4f7b0f2..7db33f100db4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2747,6 +2747,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 }
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
 					   bool noswap)
 {
@@ -2754,7 +2755,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	unsigned long nr_reclaimed;
 	int nid;
 	struct scan_control sc = {
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.target_mem_cgroup = memcg,
-- 
2.0.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

[parent not found: <1407186897-21048-2-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
       [not found]   ` <1407186897-21048-2-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2014-08-07 13:08     ` Michal Hocko
       [not found]       ` <20140807130822.GB12730-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2014-08-07 13:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 04-08-14 17:14:54, Johannes Weiner wrote:
> Instead of passing the request size to direct reclaim, memcg just
> manually loops around reclaiming SWAP_CLUSTER_MAX pages until the
> charge can succeed.  That potentially wastes scan progress when huge
> page allocations require multiple invocations, which always have to
> restart from the default scan priority.
> 
> Pass the request size as a reclaim target to direct reclaim and leave
> it to that code to reach the goal.

THP charge then will ask for 512 pages to be (direct) reclaimed. That
is _a lot_ and I would expect long stalls to achieve this target. I
would also expect quick priority drop down and potential over-reclaim
for small and moderately sized memcgs (e.g. memcg with 1G worth of pages
would need to drop down below DEF_PRIORITY-2 to have a chance to scan
that many pages). All that done for a charge which can fallback to a
single page charge.

The current code is quite hostile to THP when we are close to the limit
but solving this by introducing long stalls instead doesn't sound like a
proper approach to me.

> Charging will still have to loop in case concurrent allocations steal
> reclaim effort, but at least it doesn't have to loop to meet even the
> basic request size.  This also prepares the memcg reclaim API for use
> with the planned high limit, to reclaim excess with a single call.
> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> ---
>  include/linux/swap.h |  3 ++-
>  mm/memcontrol.c      | 15 +++++++++------
>  mm/vmscan.c          |  3 ++-
>  3 files changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 1b72060f093a..473a3ae4cdd6 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -327,7 +327,8 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> +						  unsigned long nr_pages,
>  						  gfp_t gfp_mask, bool noswap);
>  extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>  						gfp_t gfp_mask, bool noswap,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ec4dcf1b9562..ddffeeda2d52 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1793,6 +1793,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  }
>  
>  static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
> +					unsigned long nr_pages,
>  					gfp_t gfp_mask,
>  					unsigned long flags)
>  {
> @@ -1808,7 +1809,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
>  	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
>  		if (loop)
>  			drain_all_stock_async(memcg);
> -		total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap);
> +		total += try_to_free_mem_cgroup_pages(memcg, nr_pages,
> +						      gfp_mask, noswap);
>  		/*
>  		 * Allow limit shrinkers, which are triggered directly
>  		 * by userspace, to catch signals and stop reclaim
> @@ -1816,7 +1818,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
>  		 */
>  		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
>  			break;
> -		if (mem_cgroup_margin(memcg))
> +		if (mem_cgroup_margin(memcg) >= nr_pages)
>  			break;
>  		/*
>  		 * If nothing was reclaimed after two attempts, there
> @@ -2572,7 +2574,8 @@ retry:
>  	if (!(gfp_mask & __GFP_WAIT))
>  		goto nomem;
>  
> -	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> +	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, nr_pages,
> +					  gfp_mask, flags);
>  
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>  		goto retry;
> @@ -3718,7 +3721,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_reclaim(memcg, GFP_KERNEL,
> +		mem_cgroup_reclaim(memcg, 1, GFP_KERNEL,
>  				   MEM_CGROUP_RECLAIM_SHRINK);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
> @@ -3777,7 +3780,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_reclaim(memcg, GFP_KERNEL,
> +		mem_cgroup_reclaim(memcg, 1, GFP_KERNEL,
>  				   MEM_CGROUP_RECLAIM_NOSWAP |
>  				   MEM_CGROUP_RECLAIM_SHRINK);
>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> @@ -4028,7 +4031,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  		if (signal_pending(current))
>  			return -EINTR;
>  
> -		progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
> +		progress = try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
>  						false);
>  		if (!progress) {
>  			nr_retries--;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d698f4f7b0f2..7db33f100db4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2747,6 +2747,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
>  }
>  
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> +					   unsigned long nr_pages,
>  					   gfp_t gfp_mask,
>  					   bool noswap)
>  {
> @@ -2754,7 +2755,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  	unsigned long nr_reclaimed;
>  	int nid;
>  	struct scan_control sc = {
> -		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> +		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>  		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
>  		.target_mem_cgroup = memcg,
> -- 
> 2.0.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <20140807130822.GB12730-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
       [not found]       ` <20140807130822.GB12730-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2014-08-07 15:31         ` Johannes Weiner
  2014-08-07 16:10           ` Greg Thelen
  2014-08-08 12:32           ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Johannes Weiner @ 2014-08-07 15:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote:
> On Mon 04-08-14 17:14:54, Johannes Weiner wrote:
> > Instead of passing the request size to direct reclaim, memcg just
> > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the
> > charge can succeed.  That potentially wastes scan progress when huge
> > page allocations require multiple invocations, which always have to
> > restart from the default scan priority.
> > 
> > Pass the request size as a reclaim target to direct reclaim and leave
> > it to that code to reach the goal.
> 
> THP charge then will ask for 512 pages to be (direct) reclaimed. That
> is _a lot_ and I would expect long stalls to achieve this target. I
> would also expect quick priority drop down and potential over-reclaim
> for small and moderately sized memcgs (e.g. memcg with 1G worth of pages
> would need to drop down below DEF_PRIORITY-2 to have a chance to scan
> that many pages). All that done for a charge which can fallback to a
> single page charge.
> 
> The current code is quite hostile to THP when we are close to the limit
> but solving this by introducing long stalls instead doesn't sound like a
> proper approach to me.

THP latencies are actually the same when comparing high limit nr_pages
reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, although
system time is reduced with the high limit.

High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but
it doesn't actually contain the workload - with 1G high and a 4G load,
the consumption at the end of the run is 3.7G.

So what I'm proposing works and is of equal quality from a THP POV.
This change is complicated enough when we stick to the facts, let's
not make up things based on gut feeling.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
  2014-08-07 15:31         ` Johannes Weiner
@ 2014-08-07 16:10           ` Greg Thelen
  2014-08-08 12:47             ` Michal Hocko
  2014-08-08 12:32           ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Greg Thelen @ 2014-08-07 16:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Thu, Aug 07 2014, Johannes Weiner wrote:

> On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote:
>> On Mon 04-08-14 17:14:54, Johannes Weiner wrote:
>> > Instead of passing the request size to direct reclaim, memcg just
>> > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the
>> > charge can succeed.  That potentially wastes scan progress when huge
>> > page allocations require multiple invocations, which always have to
>> > restart from the default scan priority.
>> > 
>> > Pass the request size as a reclaim target to direct reclaim and leave
>> > it to that code to reach the goal.
>> 
>> THP charge then will ask for 512 pages to be (direct) reclaimed. That
>> is _a lot_ and I would expect long stalls to achieve this target. I
>> would also expect quick priority drop down and potential over-reclaim
>> for small and moderately sized memcgs (e.g. memcg with 1G worth of pages
>> would need to drop down below DEF_PRIORITY-2 to have a chance to scan
>> that many pages). All that done for a charge which can fallback to a
>> single page charge.
>> 
>> The current code is quite hostile to THP when we are close to the limit
>> but solving this by introducing long stalls instead doesn't sound like a
>> proper approach to me.
>
> THP latencies are actually the same when comparing high limit nr_pages
> reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, although
> system time is reduced with the high limit.
>
> High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but
> it doesn't actually contain the workload - with 1G high and a 4G load,
> the consumption at the end of the run is 3.7G.
>
> So what I'm proposing works and is of equal quality from a THP POV.
> This change is complicated enough when we stick to the facts, let's
> not make up things based on gut feeling.

I think that high order non THP page allocations also benefit from this.
Such allocations don't have a small page fallback.

This may be in flux, but linux-next shows me that:
* mem_cgroup_reclaim()
  frees at least SWAP_CLUSTER_MAX (32) pages.
* try_charge() calls mem_cgroup_reclaim() indefinitely for
  costly (3) or smaller orders assuming that something is reclaimed on
  each iteration.
* try_charge() uses a loop of MEM_CGROUP_RECLAIM_RETRIES (5) for
  larger-than-costly orders.

So for larger-than-costly allocations, try_charge() should be able to
reclaim 160 (5*32) pages which satisfies an order:7 allocation.  But for
order:8+ allocations try_charge() and mem_cgroup_reclaim() are too eager
to give up without something like this.  So I think this patch is a step
in the right direction.

Coincidentally, we've been recently been experimenting with something
like this.  Though we didn't modify the interface between
mem_cgroup_reclaim() and try_to_free_mem_cgroup_pages() - instead we
looped within mem_cgroup_reclaim() until nr_pages of margin were found.
But I have no objection the proposed plumbing of nr_pages all the way
into try_to_free_mem_cgroup_pages().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
  2014-08-07 16:10           ` Greg Thelen
@ 2014-08-08 12:47             ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2014-08-08 12:47 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, linux-mm, cgroups,
	linux-kernel

On Thu 07-08-14 09:10:43, Greg Thelen wrote:
> On Thu, Aug 07 2014, Johannes Weiner wrote:
[...]
> > So what I'm proposing works and is of equal quality from a THP POV.
> > This change is complicated enough when we stick to the facts, let's
> > not make up things based on gut feeling.
> 
> I think that high order non THP page allocations also benefit from this.
> Such allocations don't have a small page fallback.
> 
> This may be in flux, but linux-next shows me that:
> * mem_cgroup_reclaim()
>   frees at least SWAP_CLUSTER_MAX (32) pages.
> * try_charge() calls mem_cgroup_reclaim() indefinitely for
>   costly (3) or smaller orders assuming that something is reclaimed on
>   each iteration.
> * try_charge() uses a loop of MEM_CGROUP_RECLAIM_RETRIES (5) for
>   larger-than-costly orders.

Unless there is __GFP_NORETRY which fails the charge after the first
round of unsuccessful reclaim. This is the case regardless of nr_pages
but only THP are charged with __GFP_NORETRY currently.

> So for larger-than-costly allocations, try_charge() should be able to
> reclaim 160 (5*32) pages which satisfies an order:7 allocation.  But for
> order:8+ allocations try_charge() and mem_cgroup_reclaim() are too eager
> to give up without something like this.  So I think this patch is a step
> in the right direction.

I think we should be careful for charges which are OK to fail because
there is a fallback for them (THP). The only other high-order charges are
coming from kmem and I am yet not sure what to do about those without
memcg specific slab reclaim. I wouldn't make this discussion more
complicated for this case now.

> Coincidentally, we've been recently been experimenting with something
> like this.  Though we didn't modify the interface between
> mem_cgroup_reclaim() and try_to_free_mem_cgroup_pages() - instead we
> looped within mem_cgroup_reclaim() until nr_pages of margin were found.
> But I have no objection the proposed plumbing of nr_pages all the way
> into try_to_free_mem_cgroup_pages().

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
  2014-08-07 15:31         ` Johannes Weiner
  2014-08-07 16:10           ` Greg Thelen
@ 2014-08-08 12:32           ` Michal Hocko
  2014-08-08 13:26             ` Johannes Weiner
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2014-08-08 12:32 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Thu 07-08-14 11:31:41, Johannes Weiner wrote:
> On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote:
> > On Mon 04-08-14 17:14:54, Johannes Weiner wrote:
> > > Instead of passing the request size to direct reclaim, memcg just
> > > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the
> > > charge can succeed.  That potentially wastes scan progress when huge
> > > page allocations require multiple invocations, which always have to
> > > restart from the default scan priority.
> > > 
> > > Pass the request size as a reclaim target to direct reclaim and leave
> > > it to that code to reach the goal.
> > 
> > THP charge then will ask for 512 pages to be (direct) reclaimed. That
> > is _a lot_ and I would expect long stalls to achieve this target. I
> > would also expect quick priority drop down and potential over-reclaim
> > for small and moderately sized memcgs (e.g. memcg with 1G worth of pages
> > would need to drop down below DEF_PRIORITY-2 to have a chance to scan
> > that many pages). All that done for a charge which can fallback to a
> > single page charge.
> > 
> > The current code is quite hostile to THP when we are close to the limit
> > but solving this by introducing long stalls instead doesn't sound like a
> > proper approach to me.
> 
> THP latencies are actually the same when comparing high limit nr_pages
> reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim,

Are you sure about this? I fail to see how they can be same as THP
allocations/charges are __GFP_NORETRY so there is only one reclaim
round for the hard limit reclaim followed by the charge failure if
it is not successful.

> although system time is reduced with the high limit.
> High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but
> it doesn't actually contain the workload - with 1G high and a 4G load,
> the consumption at the end of the run is 3.7G.

Wouldn't it help to simply fail the charge and allow the charger to
fallback for THP allocations if the usage is above high limit too
much? The follow up single page charge fallback would be still
throttled.

> So what I'm proposing works and is of equal quality from a THP POV.
> This change is complicated enough when we stick to the facts, let's
> not make up things based on gut feeling.

Agreed and I would expect those _facts_ to be part of the changelog.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
  2014-08-08 12:32           ` Michal Hocko
@ 2014-08-08 13:26             ` Johannes Weiner
  2014-08-11  7:49               ` Michal Hocko
  2014-08-13 14:59               ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Johannes Weiner @ 2014-08-08 13:26 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Fri, Aug 08, 2014 at 02:32:58PM +0200, Michal Hocko wrote:
> On Thu 07-08-14 11:31:41, Johannes Weiner wrote:
> > On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote:
> > > On Mon 04-08-14 17:14:54, Johannes Weiner wrote:
> > > > Instead of passing the request size to direct reclaim, memcg just
> > > > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the
> > > > charge can succeed.  That potentially wastes scan progress when huge
> > > > page allocations require multiple invocations, which always have to
> > > > restart from the default scan priority.
> > > > 
> > > > Pass the request size as a reclaim target to direct reclaim and leave
> > > > it to that code to reach the goal.
> > > 
> > > THP charge then will ask for 512 pages to be (direct) reclaimed. That
> > > is _a lot_ and I would expect long stalls to achieve this target. I
> > > would also expect quick priority drop down and potential over-reclaim
> > > for small and moderately sized memcgs (e.g. memcg with 1G worth of pages
> > > would need to drop down below DEF_PRIORITY-2 to have a chance to scan
> > > that many pages). All that done for a charge which can fallback to a
> > > single page charge.
> > > 
> > > The current code is quite hostile to THP when we are close to the limit
> > > but solving this by introducing long stalls instead doesn't sound like a
> > > proper approach to me.
> > 
> > THP latencies are actually the same when comparing high limit nr_pages
> > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim,
> 
> Are you sure about this? I fail to see how they can be same as THP
> allocations/charges are __GFP_NORETRY so there is only one reclaim
> round for the hard limit reclaim followed by the charge failure if
> it is not successful.

I use this test program that faults in anon pages, reports average and
max for every 512-page chunk (THP size), then reports the aggregate at
the end:

memory.max:

avg=18729us max=450625us

real    0m14.335s
user    0m0.157s
sys     0m6.307s

memory.high:

avg=18676us max=457499us

real    0m14.375s
user    0m0.046s
sys     0m4.294s

> > although system time is reduced with the high limit.
> > High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but
> > it doesn't actually contain the workload - with 1G high and a 4G load,
> > the consumption at the end of the run is 3.7G.
> 
> Wouldn't it help to simply fail the charge and allow the charger to
> fallback for THP allocations if the usage is above high limit too
> much? The follow up single page charge fallback would be still
> throttled.

This is about defining the limit semantics in unified hierarchy, and
not really the time or place to optimize THP charge latency.

What are you trying to accomplish here?

> > So what I'm proposing works and is of equal quality from a THP POV.
> > This change is complicated enough when we stick to the facts, let's
> > not make up things based on gut feeling.
> 
> Agreed and I would expect those _facts_ to be part of the changelog.

You made unfounded claims about THP allocation latencies, and I showed
the numbers to refute it, but that doesn't make any of this relevant
for the changelogs of these patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
  2014-08-08 13:26             ` Johannes Weiner
@ 2014-08-11  7:49               ` Michal Hocko
  2014-08-13 14:59               ` Michal Hocko
  1 sibling, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2014-08-11  7:49 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Fri 08-08-14 09:26:35, Johannes Weiner wrote:
> On Fri, Aug 08, 2014 at 02:32:58PM +0200, Michal Hocko wrote:
> > On Thu 07-08-14 11:31:41, Johannes Weiner wrote:
[...]
> > > although system time is reduced with the high limit.
> > > High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but
> > > it doesn't actually contain the workload - with 1G high and a 4G load,
> > > the consumption at the end of the run is 3.7G.
> > 
> > Wouldn't it help to simply fail the charge and allow the charger to
> > fallback for THP allocations if the usage is above high limit too
> > much? The follow up single page charge fallback would be still
> > throttled.
> 
> This is about defining the limit semantics in unified hierarchy, and
> not really the time or place to optimize THP charge latency.
> 
> What are you trying to accomplish here?

Well there are two things. The first one is that this patch changes the way
how THP are charged for the hard limit without any data to back it up in the
changelog. This is the primary concern.

The other part is the high limit behavior for large excess. You have
chosen to reclaim all excessive charges even when quite a lot of pages
might be direct reclaimed. This is potentially dangerous because the
excess might be really huge (consider multiple tasks charging THPs
simultaneously on many CPUs). Do you really want to direct reclaim
nr_online_cpus * 512 pages in the single direct reclaim pass and for
all those cpus? This is an extreme case, all right, but the point
stays. There has to be a certain cap. Also it seems that the primary
source of troubles is THP so the question is. Do we really want to push
hard to reclaim enough charges or do we rather fail THP charge and go
with single page retry?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
  2014-08-08 13:26             ` Johannes Weiner
  2014-08-11  7:49               ` Michal Hocko
@ 2014-08-13 14:59               ` Michal Hocko
  2014-08-13 20:41                 ` Johannes Weiner
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2014-08-13 14:59 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Fri 08-08-14 09:26:35, Johannes Weiner wrote:
> On Fri, Aug 08, 2014 at 02:32:58PM +0200, Michal Hocko wrote:
> > On Thu 07-08-14 11:31:41, Johannes Weiner wrote:
[...]
> > > THP latencies are actually the same when comparing high limit nr_pages
> > > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim,
> > 
> > Are you sure about this? I fail to see how they can be same as THP
> > allocations/charges are __GFP_NORETRY so there is only one reclaim
> > round for the hard limit reclaim followed by the charge failure if
> > it is not successful.
> 
> I use this test program that faults in anon pages, reports average and
> max for every 512-page chunk (THP size), then reports the aggregate at
> the end:
> 
> memory.max:
> 
> avg=18729us max=450625us
> 
> real    0m14.335s
> user    0m0.157s
> sys     0m6.307s
> 
> memory.high:
> 
> avg=18676us max=457499us
> 
> real    0m14.375s
> user    0m0.046s
> sys     0m4.294s

I was playing with something like that as well. mmap 800MB anon mapping
in 256MB memcg (kvm guest had 1G RAM and 2G swap so the global reclaim
doesn't trigger and the host 2G free memory), start faulting in from
THP aligned address and measured each fault. Then I was recording
mm_vmscan_lru_shrink_inactive and mm_vmscan_memcg_reclaim_{begin,end}
tracepoints to see how the reclaim went.

I was testing two setups
1) fault in every 4k page
2) fault in only 2M aligned addresses.

The first simulates the case where successful THP allocation saves
follow up 511 fallback charges and so the excessive reclaim might
pay off.
The second one simulates potential time wasting when memory is used
extremely sparsely and any latencies would be unwelcome.

(new refers to nr_reclaim target, old to SWAP_CLUSTER_MAX, thponly
faults only 2M aligned addresses, 4k pages are faulted otherwise)

vmstat says:
out.256m.new-thponly.vmstat.after:pswpin 44
out.256m.new-thponly.vmstat.after:pswpout 154681
out.256m.new-thponly.vmstat.after:thp_fault_alloc 399
out.256m.new-thponly.vmstat.after:thp_fault_fallback 0
out.256m.new-thponly.vmstat.after:thp_split 302

out.256m.old-thponly.vmstat.after:pswpin 28
out.256m.old-thponly.vmstat.after:pswpout 31271
out.256m.old-thponly.vmstat.after:thp_fault_alloc 149
out.256m.old-thponly.vmstat.after:thp_fault_fallback 250
out.256m.old-thponly.vmstat.after:thp_split 61

out.256m.new.vmstat.after:pswpin 48
out.256m.new.vmstat.after:pswpout 169530
out.256m.new.vmstat.after:thp_fault_alloc 399
out.256m.new.vmstat.after:thp_fault_fallback 0
out.256m.new.vmstat.after:thp_split 331

out.256m.old.vmstat.after:pswpin 47
out.256m.old.vmstat.after:pswpout 156514
out.256m.old.vmstat.after:thp_fault_alloc 127
out.256m.old.vmstat.after:thp_fault_fallback 272
out.256m.old.vmstat.after:thp_split 127

As expected new managed to fault in all requests as THP without a single
fallback allocation while with the old reclaim we got to the limit and
then most of the THP charges failed and fallen back to single page
charge.

Note the increased swapout activity for new. It is almost 5x more for
thponly and +8% with per-page faults. This looks like a fallout from the
over-reclaim in smaller priorities.

Tracepoints will tell us the priority at which we ended up the reclaim
round:
- trace.new-thponly
  Count Priority
      1 3
      2 5
    159 6
     24 7
- trace.old-thponly
    230 10
      1 11
      1 12
      1 3
     39 9

Again expected that the priority is falling down for the new much more.

- trace.new
    229 0
      3 12
- trace.old
    294 0
      2 1
     25 10
      1 11
      3 12
      8 2
      8 3
     20 4
     33 5
     21 6
     43 7
   1286 8
   1279 9

And here as well, we have to reclaim much more because we do much more
charges so the load benefits a bit from the high reclaim target.

mm_vmscan_memcg_reclaim_end tracepoint tells us also how many pages were
reclaimed during each run and the cummulative numbers are:
- trace.new-thponly: 139029
- trace.old-thponly: 11344
- trace.new: 139687
- trace.old: 139887

time -v says:
out.256m.new-thponly.time:      System time (seconds): 1.50
out.256m.new-thponly.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.56
out.256m.old-thponly.time:      System time (seconds): 0.45
out.256m.old-thponly.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.76

out.256m.new.time:      System time (seconds): 1.45
out.256m.new.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.12
out.256m.old.time:      System time (seconds): 2.08
out.256m.old.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.26

I guess this is expected as well. Sparse access doesn't amortize the
costly reclaim for each charged THP. On the other hand it can help a bit
if the whole mmap is populated.

If we compare fault latencies then we get the following:
- the worst latency [ms]:
out.256m.new-thponly 1991
out.256m.old-thponly 1838
out.256m.new 6197
out.256m.old 5538

- top 5 worst latencies (sum in [ms]):
out.256m.new-thponly 5694
out.256m.old-thponly 3168
out.256m.new 9498
out.256m.old 8291

- top 10
out.256m.new-thponly 7139
out.256m.old-thponly 3193
out.256m.new 11786
out.256m.old 9347

- top 100
out.256m.new-thponly 13035
out.256m.old-thponly 3434
out.256m.new 14634
out.256m.old 12881

I think this shows up that my concern about excessive reclaim and stalls
is real and it is worse when the memory is used sparsely. It is true it
might help when the whole THP section is used and so the additional cost
is amortized but the more sparsely each THP section is used the higher
overhead you are adding without userspace actually asking for it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
  2014-08-13 14:59               ` Michal Hocko
@ 2014-08-13 20:41                 ` Johannes Weiner
  2014-08-14 16:12                   ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2014-08-13 20:41 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Wed, Aug 13, 2014 at 04:59:04PM +0200, Michal Hocko wrote:
> On Fri 08-08-14 09:26:35, Johannes Weiner wrote:
> > On Fri, Aug 08, 2014 at 02:32:58PM +0200, Michal Hocko wrote:
> > > On Thu 07-08-14 11:31:41, Johannes Weiner wrote:
> [...]
> > > > THP latencies are actually the same when comparing high limit nr_pages
> > > > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim,
> > > 
> > > Are you sure about this? I fail to see how they can be same as THP
> > > allocations/charges are __GFP_NORETRY so there is only one reclaim
> > > round for the hard limit reclaim followed by the charge failure if
> > > it is not successful.
> > 
> > I use this test program that faults in anon pages, reports average and
> > max for every 512-page chunk (THP size), then reports the aggregate at
> > the end:
> > 
> > memory.max:
> > 
> > avg=18729us max=450625us
> > 
> > real    0m14.335s
> > user    0m0.157s
> > sys     0m6.307s
> > 
> > memory.high:
> > 
> > avg=18676us max=457499us
> > 
> > real    0m14.375s
> > user    0m0.046s
> > sys     0m4.294s
> 
> I was playing with something like that as well. mmap 800MB anon mapping
> in 256MB memcg (kvm guest had 1G RAM and 2G swap so the global reclaim
> doesn't trigger and the host 2G free memory), start faulting in from
> THP aligned address and measured each fault. Then I was recording
> mm_vmscan_lru_shrink_inactive and mm_vmscan_memcg_reclaim_{begin,end}
> tracepoints to see how the reclaim went.
> 
> I was testing two setups
> 1) fault in every 4k page
> 2) fault in only 2M aligned addresses.
>
> The first simulates the case where successful THP allocation saves
> follow up 511 fallback charges and so the excessive reclaim might
> pay off.
> The second one simulates potential time wasting when memory is used
> extremely sparsely and any latencies would be unwelcome.
> 
> (new refers to nr_reclaim target, old to SWAP_CLUSTER_MAX, thponly
> faults only 2M aligned addresses, 4k pages are faulted otherwise)
> 
> vmstat says:
> out.256m.new-thponly.vmstat.after:pswpin 44
> out.256m.new-thponly.vmstat.after:pswpout 154681
> out.256m.new-thponly.vmstat.after:thp_fault_alloc 399
> out.256m.new-thponly.vmstat.after:thp_fault_fallback 0
> out.256m.new-thponly.vmstat.after:thp_split 302
> 
> out.256m.old-thponly.vmstat.after:pswpin 28
> out.256m.old-thponly.vmstat.after:pswpout 31271
> out.256m.old-thponly.vmstat.after:thp_fault_alloc 149
> out.256m.old-thponly.vmstat.after:thp_fault_fallback 250
> out.256m.old-thponly.vmstat.after:thp_split 61
> 
> out.256m.new.vmstat.after:pswpin 48
> out.256m.new.vmstat.after:pswpout 169530
> out.256m.new.vmstat.after:thp_fault_alloc 399
> out.256m.new.vmstat.after:thp_fault_fallback 0
> out.256m.new.vmstat.after:thp_split 331
> 
> out.256m.old.vmstat.after:pswpin 47
> out.256m.old.vmstat.after:pswpout 156514
> out.256m.old.vmstat.after:thp_fault_alloc 127
> out.256m.old.vmstat.after:thp_fault_fallback 272
> out.256m.old.vmstat.after:thp_split 127
> 
> As expected new managed to fault in all requests as THP without a single
> fallback allocation while with the old reclaim we got to the limit and
> then most of the THP charges failed and fallen back to single page
> charge.

In a more realistic workload, global reclaim and compaction have to
invest a decent amount of work to create the 2MB page in the first
place.  You would be wasting this work on the off-chance that only a
small part of the THP is actually used.  Once that 2MB page is already
assembled, I don't think the burden on memcg is high enough to justify
wasting that work speculatively.

If we really have a latency issue here, I think the right solution is
to attempt the charge first - because it's much less work - and only
if it succeeds allocate and commit an actual 2MB physical page.

But I have yet to be convinced that there is a practical issue here.
Who uses only 4k out of every 2MB area and enables THP?  The 'thponly'
scenario is absurd.

> Note the increased swapout activity for new. It is almost 5x more for
> thponly and +8% with per-page faults. This looks like a fallout from the
> over-reclaim in smaller priorities.
>
> - trace.new
>     229 0
>       3 12
> - trace.old
>     294 0
>       2 1
>      25 10
>       1 11
>       3 12
>       8 2
>       8 3
>      20 4
>      33 5
>      21 6
>      43 7
>    1286 8
>    1279 9
> 
> And here as well, we have to reclaim much more because we do much more
> charges so the load benefits a bit from the high reclaim target.
>
> mm_vmscan_memcg_reclaim_end tracepoint tells us also how many pages were
> reclaimed during each run and the cummulative numbers are:
> - trace.new-thponly: 139029
> - trace.old-thponly: 11344
> - trace.new: 139687
> - trace.old: 139887

Here the number of reclaimed pages is actually lower in new, so I'm
guessing the increase in swapouts above is variation between runs, as
there doesn't seem to be a significant amount of cache in that group.

> time -v says:
> out.256m.new-thponly.time:      System time (seconds): 1.50
> out.256m.new-thponly.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.56
> out.256m.old-thponly.time:      System time (seconds): 0.45
> out.256m.old-thponly.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.76
> 
> out.256m.new.time:      System time (seconds): 1.45
> out.256m.new.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.12
> out.256m.old.time:      System time (seconds): 2.08
> out.256m.old.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.26
> 
> I guess this is expected as well. Sparse access doesn't amortize the
> costly reclaim for each charged THP. On the other hand it can help a bit
> if the whole mmap is populated.
>
> If we compare fault latencies then we get the following:
> - the worst latency [ms]:
> out.256m.new-thponly 1991
> out.256m.old-thponly 1838
> out.256m.new 6197
> out.256m.old 5538
> 
> - top 5 worst latencies (sum in [ms]):
> out.256m.new-thponly 5694
> out.256m.old-thponly 3168
> out.256m.new 9498
> out.256m.old 8291
> 
> - top 10
> out.256m.new-thponly 7139
> out.256m.old-thponly 3193
> out.256m.new 11786
> out.256m.old 9347
> 
> - top 100
> out.256m.new-thponly 13035
> out.256m.old-thponly 3434
> out.256m.new 14634
> out.256m.old 12881
> 
> I think this shows up that my concern about excessive reclaim and stalls
> is real and it is worse when the memory is used sparsely. It is true it
> might help when the whole THP section is used and so the additional cost
> is amortized but the more sparsely each THP section is used the higher
> overhead you are adding without userspace actually asking for it.

THP is expected to have some overhead in terms of initial fault cost
and space efficiency, don't use it when you get little to no benefit
from it.  It can be argued that my patch moves that breakeven point a
little bit, but the THP-positive end of the spectrum is much better
off: THP coverage goes from 37% to 100%, while reclaim efficiency is
significantly improved and system time significantly reduced.

You demonstrated a THP-workload that really benefits from my change,
and another workload that shouldn't be using THP in the first place.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
  2014-08-13 20:41                 ` Johannes Weiner
@ 2014-08-14 16:12                   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2014-08-14 16:12 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Wed 13-08-14 16:41:34, Johannes Weiner wrote:
> On Wed, Aug 13, 2014 at 04:59:04PM +0200, Michal Hocko wrote:
[...]
> > I think this shows up that my concern about excessive reclaim and stalls
> > is real and it is worse when the memory is used sparsely. It is true it
> > might help when the whole THP section is used and so the additional cost
> > is amortized but the more sparsely each THP section is used the higher
> > overhead you are adding without userspace actually asking for it.
> 
> THP is expected to have some overhead in terms of initial fault cost

yes, but that overhead should be as small as possible. Direct reclaim
with such a big target will lead to all types of problems.

> and space efficiency, don't use it when you get little to no benefit
> from it. 

Do you really expect that all such users will use MADV_NOHUGEPAGE just
to prevent from reclaim stalls? This sounds unrealistic to me. Instead
we will end up with THP disabled globally. The same way we have seen it
when THP has been introduced and caused all kinds of reclaim issues.

> It can be argued that my patch moves that breakeven point a
> little bit, but the THP-positive end of the spectrum is much better
> off: THP coverage goes from 37% to 100%, while reclaim efficiency is
> significantly improved and system time significantly reduced.

I didn't see significantly improved reclaim efficiency the only
difference was that the reclaim happen less times.
The system time is reduced but the elapsed time is less than 1% improved
in the per-page walk but more than 3 times worse for the other extreme.

> You demonstrated a THP-workload that really benefits from my change,
> and another workload that shouldn't be using THP in the first place.

I do not think that the presented test case is appropriate for any
reclaim decision evaluation. Linear used-once walker usually benefits
from excessive reclaim in general.
The only point I wanted to raise is that the numbers look much worse
when the memory is used sparsely and thponly is the obvious worst case.

So if you want to increase THP charge success rate then back it by real
numbers from real loads and prove that the potential regressions are
unlikely and biased by the overall improvements. Until then NACK to this
patch from me. The change is too risky.

Besides that I do believe that you do not need this change for the high
limit as it can fail the charge for excessive THP charges same like the
hard limit. So you do not have the high limit escape problem.
As mentioned in other email, reclaiming the whole high limit excess as a
target is even more risky because heavy parallel load on many CPUs can
cause large excess and direct reclaim much more than 512 pages.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy
  2014-08-04 21:14 [patch 0/4] mm: memcontrol: populate unified hierarchy interface Johannes Weiner
  2014-08-04 21:14 ` [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests Johannes Weiner
@ 2014-08-04 21:14 ` Johannes Weiner
  2014-08-07 13:36   ` Michal Hocko
  2014-08-04 21:14 ` [patch 3/4] mm: memcontrol: add memory.max " Johannes Weiner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2014-08-04 21:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, Tejun Heo, linux-mm, cgroups, linux-kernel

Provide the most fundamental interface necessary for memory cgroups to
be useful in the default hierarchy: report the current usage and allow
setting an upper limit on it.

The upper limit, set in memory.high, is not a strict OOM limit and can
be breached under pressure.  But once it's breached, allocators are
throttled and forced into reclaim to clean up the excess.  This has
many advantages over more traditional hard upper limits and is thus
more suitable as the default upper boundary:

First, the limit is artificial and not due to shortness of actual
memory, so invoking the OOM killer to enforce it seems excessive for
the majority of usecases.  It's much preferable to breach the limit
temporarily and throttle the allocators, which in turn gives managing
software a chance to detect pressure in the group and intervene by
re-evaluating the limit, migrating the job to another machine,
hot-plugging memory etc., without fearing interfering OOM kills.

A secondary concern is allocation fairness: requiring the limit to
always be met allows the reclaim efforts of one allocator to be stolen
by concurrent allocations.  Most usecases would prefer temporarily
exceeding the default upper limit by a few pages over starving random
allocators indefinitely.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/unified-hierarchy.txt |  6 ++
 include/linux/res_counter.h                 | 29 ++++++++++
 kernel/res_counter.c                        |  3 +
 mm/memcontrol.c                             | 87 ++++++++++++++++++++++++++---
 4 files changed, 116 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index 4f4563277864..fd4f7f6847f6 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -327,6 +327,12 @@ supported and the interface files "release_agent" and
 - use_hierarchy is on by default and the cgroup file for the flag is
   not created.
 
+- memory.limit_in_bytes is removed as the primary upper boundary and
+  replaced with memory.high, a soft upper limit that will put memory
+  pressure on the group but can be breached in favor of OOM killing.
+
+- memory.usage_in_bytes is renamed to memory.current to be in line
+  with the new naming scheme
 
 5. Planned Changes
 
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 56b7bc32db4f..27394cfdf1fe 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -32,6 +32,10 @@ struct res_counter {
 	 */
 	unsigned long long max_usage;
 	/*
+	 * the high limit that creates pressure but can be exceeded
+	 */
+	unsigned long long high;
+	/*
 	 * the limit that usage cannot exceed
 	 */
 	unsigned long long limit;
@@ -85,6 +89,7 @@ int res_counter_memparse_write_strategy(const char *buf,
 enum {
 	RES_USAGE,
 	RES_MAX_USAGE,
+	RES_HIGH,
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
@@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
 u64 res_counter_uncharge_until(struct res_counter *counter,
 			       struct res_counter *top,
 			       unsigned long val);
+
+static inline unsigned long long res_counter_high(struct res_counter *cnt)
+{
+	unsigned long long high = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage > cnt->high)
+		high = cnt->usage - cnt->high;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return high;
+}
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
@@ -193,6 +211,17 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
 
+static inline int res_counter_set_high(struct res_counter *cnt,
+				       unsigned long long high)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high = high;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 static inline int res_counter_set_limit(struct res_counter *cnt,
 		unsigned long long limit)
 {
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index e791130f85a7..26a08be49a3d 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -17,6 +17,7 @@
 void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
+	counter->high = RES_COUNTER_MAX;
 	counter->limit = RES_COUNTER_MAX;
 	counter->soft_limit = RES_COUNTER_MAX;
 	counter->parent = parent;
@@ -130,6 +131,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->usage;
 	case RES_MAX_USAGE:
 		return &counter->max_usage;
+	case RES_HIGH:
+		return &counter->high;
 	case RES_LIMIT:
 		return &counter->limit;
 	case RES_FAILCNT:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ddffeeda2d52..5a64fa96c08a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2530,8 +2530,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
-	struct res_counter *fail_res;
 	unsigned long nr_reclaimed;
+	struct res_counter *res;
 	unsigned long flags = 0;
 	unsigned long long size;
 	int ret = 0;
@@ -2541,16 +2541,16 @@ retry:
 		goto done;
 
 	size = batch * PAGE_SIZE;
-	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
+	if (!res_counter_charge(&memcg->res, size, &res)) {
 		if (!do_swap_account)
 			goto done_restock;
-		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
+		if (!res_counter_charge(&memcg->memsw, size, &res))
 			goto done_restock;
 		res_counter_uncharge(&memcg->res, size);
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+		mem_over_limit = mem_cgroup_from_res_counter(res, memsw);
 		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+		mem_over_limit = mem_cgroup_from_res_counter(res, res);
 
 	if (batch > nr_pages) {
 		batch = nr_pages;
@@ -2621,6 +2621,20 @@ bypass:
 done_restock:
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
+
+	res = &memcg->res;
+	while (res) {
+		unsigned long long high = res_counter_high(res);
+
+		if (high) {
+			unsigned long high_pages = high >> PAGE_SHIFT;
+			struct mem_cgroup *memcg;
+
+			memcg = mem_cgroup_from_res_counter(res, res);
+			mem_cgroup_reclaim(memcg, high_pages, gfp_mask, 0);
+		}
+		res = res->parent;
+	}
 done:
 	return ret;
 }
@@ -5196,7 +5210,7 @@ out_kfree:
 	return ret;
 }
 
-static struct cftype mem_cgroup_files[] = {
+static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
@@ -5305,7 +5319,7 @@ static struct cftype mem_cgroup_files[] = {
 };
 
 #ifdef CONFIG_MEMCG_SWAP
-static struct cftype memsw_cgroup_files[] = {
+static struct cftype memsw_cgroup_legacy_files[] = {
 	{
 		.name = "memsw.usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
@@ -6250,6 +6264,60 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
 		mem_cgroup_from_css(root_css)->use_hierarchy = true;
 }
 
+static u64 memory_current_read(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return res_counter_read_u64(&memcg->res, RES_USAGE);
+}
+
+static u64 memory_high_read(struct cgroup_subsys_state *css,
+			    struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return res_counter_read_u64(&memcg->res, RES_HIGH);
+}
+
+static ssize_t memory_high_write(struct kernfs_open_file *of,
+				 char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	u64 high;
+	int ret;
+
+	if (mem_cgroup_is_root(memcg))
+		return -EINVAL;
+
+	buf = strim(buf);
+	ret = res_counter_memparse_write_strategy(buf, &high);
+	if (ret)
+		return ret;
+
+	ret = res_counter_set_high(&memcg->res, high);
+	if (ret)
+		return ret;
+
+	high = res_counter_high(&memcg->res);
+	if (high)
+		mem_cgroup_reclaim(memcg, high >> PAGE_SHIFT, GFP_KERNEL, 0);
+
+	return nbytes;
+}
+
+static struct cftype memory_files[] = {
+	{
+		.name = "current",
+		.read_u64 = memory_current_read,
+	},
+	{
+		.name = "high",
+		.read_u64 = memory_high_read,
+		.write = memory_high_write,
+	},
+};
+
 struct cgroup_subsys memory_cgrp_subsys = {
 	.css_alloc = mem_cgroup_css_alloc,
 	.css_online = mem_cgroup_css_online,
@@ -6260,7 +6328,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
 	.bind = mem_cgroup_bind,
-	.legacy_cftypes = mem_cgroup_files,
+	.dfl_cftypes = memory_files,
+	.legacy_cftypes = mem_cgroup_legacy_files,
 	.early_init = 0,
 };
 
@@ -6278,7 +6347,7 @@ __setup("swapaccount=", enable_swap_account);
 static void __init memsw_file_init(void)
 {
 	WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys,
-					  memsw_cgroup_files));
+					  memsw_cgroup_legacy_files));
 }
 
 static void __init enable_swap_cgroup(void)
-- 
2.0.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy
  2014-08-04 21:14 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner
@ 2014-08-07 13:36   ` Michal Hocko
  2014-08-07 13:39     ` Michal Hocko
  2014-08-07 15:47     ` Johannes Weiner
  0 siblings, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2014-08-07 13:36 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Mon 04-08-14 17:14:55, Johannes Weiner wrote:
[...]
> @@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
>  u64 res_counter_uncharge_until(struct res_counter *counter,
>  			       struct res_counter *top,
>  			       unsigned long val);
> +
> +static inline unsigned long long res_counter_high(struct res_counter *cnt)

soft limit used res_counter_soft_limit_excess which has quite a long
name but at least those two should be consistent.
I will post two helper patches which I have used to make this and other
operations on res counter easier as a reply to this.

> +{
> +	unsigned long long high = 0;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	if (cnt->usage > cnt->high)
> +		high = cnt->usage - cnt->high;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return high;
> +}
> +
>  /**
>   * res_counter_margin - calculate chargeable space of a counter
>   * @cnt: the counter
> @@ -193,6 +211,17 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
>  	spin_unlock_irqrestore(&cnt->lock, flags);
>  }
>  
> +static inline int res_counter_set_high(struct res_counter *cnt,
> +				       unsigned long long high)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	cnt->high = high;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return 0;
> +}
> +
[...]
> @@ -2541,16 +2541,16 @@ retry:
>  		goto done;
>  
>  	size = batch * PAGE_SIZE;
> -	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
> +	if (!res_counter_charge(&memcg->res, size, &res)) {
>  		if (!do_swap_account)
>  			goto done_restock;
> -		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
> +		if (!res_counter_charge(&memcg->memsw, size, &res))
>  			goto done_restock;
>  		res_counter_uncharge(&memcg->res, size);
> -		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> +		mem_over_limit = mem_cgroup_from_res_counter(res, memsw);
>  		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
>  	} else
> -		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> +		mem_over_limit = mem_cgroup_from_res_counter(res, res);
>  
>  	if (batch > nr_pages) {
>  		batch = nr_pages;
> @@ -2621,6 +2621,20 @@ bypass:
>  done_restock:
>  	if (batch > nr_pages)
>  		refill_stock(memcg, batch - nr_pages);
> +
> +	res = &memcg->res;
> +	while (res) {
> +		unsigned long long high = res_counter_high(res);
> +
> +		if (high) {
> +			unsigned long high_pages = high >> PAGE_SHIFT;
> +			struct mem_cgroup *memcg;
> +
> +			memcg = mem_cgroup_from_res_counter(res, res);
> +			mem_cgroup_reclaim(memcg, high_pages, gfp_mask, 0);
> +		}
> +		res = res->parent;
> +	}
>  done:
>  	return ret;
>  }

Why haven't you followed what we do for hard limit here? In my
implementation I have the following:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a37465fcd8ae..6a797c740ea5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2529,6 +2529,21 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+static bool high_limit_excess(struct mem_cgroup *memcg,
+		struct mem_cgroup **memcg_over_limit)
+{
+	struct mem_cgroup *parent = memcg;
+
+	do {
+		if (res_counter_limit_excess(&parent->res, RES_HIGH_LIMIT)) {
+			*memcg_over_limit = parent;
+			return true;
+		}
+	} while ((parent = parent_mem_cgroup(parent)));
+
+	return false;
+}
+
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		      unsigned int nr_pages)
 {
@@ -2623,6 +2638,10 @@ bypass:
 	goto retry;
 
 done_restock:
+	/* Throttle charger a bit if it is above high limit. */
+	if (high_limit_excess(memcg, &mem_over_limit))
+		mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 done:

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy
  2014-08-07 13:36   ` Michal Hocko
@ 2014-08-07 13:39     ` Michal Hocko
  2014-08-07 15:47     ` Johannes Weiner
  1 sibling, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2014-08-07 13:39 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Thu 07-08-14 15:36:14, Michal Hocko wrote:
> On Mon 04-08-14 17:14:55, Johannes Weiner wrote:
> [...]
> > @@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
> >  u64 res_counter_uncharge_until(struct res_counter *counter,
> >  			       struct res_counter *top,
> >  			       unsigned long val);
> > +
> > +static inline unsigned long long res_counter_high(struct res_counter *cnt)
> 
> soft limit used res_counter_soft_limit_excess which has quite a long
> name but at least those two should be consistent.
> I will post two helper patches which I have used to make this and other
> operations on res counter easier as a reply to this.

These two are sleeping in my queue for quite some time. I didn't get to
post them yet but if you think they will make sense I can try to rebase
them on the current tree and post.
---
From 3f3185306b225931a45387f288645ba9044565d0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 19 Jun 2014 19:14:31 +0200
Subject: [PATCH 1/2] res_counter: provide res_counter_write_u64

to allow setting member based value setting. This will reduce code
duplication for the new limits added by this patch series.

Use the new helper to replace one-off res_counter_set_soft_limit.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/res_counter.h | 14 ++------------
 kernel/res_counter.c        | 14 ++++++++++++++
 mm/memcontrol.c             |  2 +-
 3 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 56b7bc32db4f..bea7f9f45f7a 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -71,6 +71,8 @@ struct res_counter {
 
 u64 res_counter_read_u64(struct res_counter *counter, int member);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 ssize_t res_counter_read(struct res_counter *counter, int member,
 		const char __user *buf, size_t nbytes, loff_t *pos,
 		int (*read_strategy)(unsigned long long val, char *s));
@@ -208,16 +210,4 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 	return ret;
 }
 
-static inline int
-res_counter_set_soft_limit(struct res_counter *cnt,
-				unsigned long long soft_limit)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&cnt->lock, flags);
-	cnt->soft_limit = soft_limit;
-	spin_unlock_irqrestore(&cnt->lock, flags);
-	return 0;
-}
-
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index e791130f85a7..4789c2323a94 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -171,11 +171,25 @@ u64 res_counter_read_u64(struct res_counter *counter, int member)
 
 	return ret;
 }
+
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&counter->lock, flags);
+	*res_counter_member(counter, member) = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
 #else
 u64 res_counter_read_u64(struct res_counter *counter, int member)
 {
 	return *res_counter_member(counter, member);
 }
+
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	*res_counter_member(counter, member) = val;
+}
 #endif
 
 int res_counter_memparse_write_strategy(const char *buf,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1b311687769..1ad5d4a2bc4e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4378,7 +4378,7 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 		 * control without swap
 		 */
 		if (type == _MEM)
-			ret = res_counter_set_soft_limit(&memcg->res, val);
+			res_counter_write_u64(&memcg->res, name, val);
 		else
 			ret = -EINVAL;
 		break;
-- 
2.1.0.rc1

---
From 8c79f2c209f806b97ec368c3e649ef58caeb7e99 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 19 Jun 2014 19:42:25 +0200
Subject: [PATCH 2/2] memcg, res_counter: replace res_counter_soft_limit_excess
 by a more generic helper

Later patches in the series will add new limits which we will want to
check for excess as well so change the current on-off
res_counter_soft_limit_excess to a more generic helper.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/res_counter.h | 21 +++++++++++++++------
 mm/memcontrol.c             | 10 +++++-----
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index bea7f9f45f7a..9015013784fa 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -156,23 +156,32 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
 }
 
 /**
- * Get the difference between the usage and the soft limit
+ * Get the difference between the usage and the limit defined
+ * by the given member
  * @cnt: The counter
  *
- * Returns 0 if usage is less than or equal to soft limit
- * The difference between usage and soft limit, otherwise.
+ * Returns 0 if usage is less than or equal to the limit defined
+ * by member or the difference otherwise.
  */
 static inline unsigned long long
-res_counter_soft_limit_excess(struct res_counter *cnt)
+res_counter_limit_excess(struct res_counter *cnt, int member)
 {
 	unsigned long long excess;
+	unsigned long long limit;
 	unsigned long flags;
 
 	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= cnt->soft_limit)
+	switch(member) {
+		case RES_SOFT_LIMIT:
+			limit = cnt->soft_limit;
+			break;
+		default:
+			BUG();
+	}
+	if (cnt->usage <= limit)
 		excess = 0;
 	else
-		excess = cnt->usage - cnt->soft_limit;
+		excess = cnt->usage - limit;
 	spin_unlock_irqrestore(&cnt->lock, flags);
 	return excess;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ad5d4a2bc4e..75b5db78e9be 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -773,7 +773,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 	 */
 	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		mz = mem_cgroup_page_zoneinfo(memcg, page);
-		excess = res_counter_soft_limit_excess(&memcg->res);
+		excess = res_counter_limit_excess(&memcg->res, RES_SOFT_LIMIT);
 		/*
 		 * We have to update the tree if mz is on RB-tree or
 		 * mem is over its softlimit.
@@ -827,7 +827,7 @@ retry:
 	 * position in the tree.
 	 */
 	__mem_cgroup_remove_exceeded(mz, mctz);
-	if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
+	if (!res_counter_limit_excess(&mz->memcg->res, RES_SOFT_LIMIT) ||
 	    !css_tryget_online(&mz->memcg->css))
 		goto retry;
 done:
@@ -1983,7 +1983,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		.priority = 0,
 	};
 
-	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
+	excess = res_counter_limit_excess(&root_memcg->res, RES_SOFT_LIMIT) >> PAGE_SHIFT;
 
 	while (1) {
 		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
@@ -2014,7 +2014,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
 						     zone, &nr_scanned);
 		*total_scanned += nr_scanned;
-		if (!res_counter_soft_limit_excess(&root_memcg->res))
+		if (!res_counter_limit_excess(&root_memcg->res, RES_SOFT_LIMIT))
 			break;
 	}
 	mem_cgroup_iter_break(root_memcg, victim);
@@ -3941,7 +3941,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 			} while (1);
 		}
 		__mem_cgroup_remove_exceeded(mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->memcg->res);
+		excess = res_counter_limit_excess(&mz->memcg->res, RES_SOFT_LIMIT);
 		/*
 		 * One school of thought says that we should not add
 		 * back the node to the tree if reclaim returns 0.
-- 
2.1.0.rc1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy
  2014-08-07 13:36   ` Michal Hocko
  2014-08-07 13:39     ` Michal Hocko
@ 2014-08-07 15:47     ` Johannes Weiner
  1 sibling, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2014-08-07 15:47 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Thu, Aug 07, 2014 at 03:36:14PM +0200, Michal Hocko wrote:
> On Mon 04-08-14 17:14:55, Johannes Weiner wrote:
> [...]
> > @@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
> >  u64 res_counter_uncharge_until(struct res_counter *counter,
> >  			       struct res_counter *top,
> >  			       unsigned long val);
> > +
> > +static inline unsigned long long res_counter_high(struct res_counter *cnt)
> 
> soft limit used res_counter_soft_limit_excess which has quite a long
> name but at least those two should be consistent.

That name is horrible and a result from "soft_limit" being completely
nondescriptive.  I really see no point in trying to be consistent with
this stuff that we are trying hard to delete.

> > @@ -2621,6 +2621,20 @@ bypass:
> >  done_restock:
> >  	if (batch > nr_pages)
> >  		refill_stock(memcg, batch - nr_pages);
> > +
> > +	res = &memcg->res;
> > +	while (res) {
> > +		unsigned long long high = res_counter_high(res);
> > +
> > +		if (high) {
> > +			unsigned long high_pages = high >> PAGE_SHIFT;
> > +			struct mem_cgroup *memcg;
> > +
> > +			memcg = mem_cgroup_from_res_counter(res, res);
> > +			mem_cgroup_reclaim(memcg, high_pages, gfp_mask, 0);
> > +		}
> > +		res = res->parent;
> > +	}
> >  done:
> >  	return ret;
> >  }
> 
> Why haven't you followed what we do for hard limit here?

I did.

> In my implementation I have the following:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a37465fcd8ae..6a797c740ea5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2529,6 +2529,21 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	return NOTIFY_OK;
>  }
>  
> +static bool high_limit_excess(struct mem_cgroup *memcg,
> +		struct mem_cgroup **memcg_over_limit)
> +{
> +	struct mem_cgroup *parent = memcg;
> +
> +	do {
> +		if (res_counter_limit_excess(&parent->res, RES_HIGH_LIMIT)) {
> +			*memcg_over_limit = parent;
> +			return true;
> +		}
> +	} while ((parent = parent_mem_cgroup(parent)));
> +
> +	return false;
> +}
> +
>  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		      unsigned int nr_pages)
>  {
> @@ -2623,6 +2638,10 @@ bypass:
>  	goto retry;
>  
>  done_restock:
> +	/* Throttle charger a bit if it is above high limit. */
> +	if (high_limit_excess(memcg, &mem_over_limit))
> +		mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);

This is not what the hard limit does.

The hard limit, by its nature, can only be exceeded at one level at a
time, so we try to charge, check the closest limit that was hit,
reclaim, then retry.  This means we are reclaiming up the hierarchy to
enforce the hard limit on each level.

I do the same here: reclaim up the hierarchy to enforce the high limit
on each level.

Your proposal only reclaims the closest offender, leaving higher
hierarchy levels in excess.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 3/4] mm: memcontrol: add memory.max to default hierarchy
  2014-08-04 21:14 [patch 0/4] mm: memcontrol: populate unified hierarchy interface Johannes Weiner
  2014-08-04 21:14 ` [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests Johannes Weiner
  2014-08-04 21:14 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner
@ 2014-08-04 21:14 ` Johannes Weiner
  2014-08-04 21:14 ` [patch 4/4] mm: memcontrol: add memory.vmstat " Johannes Weiner
       [not found] ` <1407186897-21048-1-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  4 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2014-08-04 21:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, Tejun Heo, linux-mm, cgroups, linux-kernel

There are cases where a strict upper limit on a memcg is required, for
example, when containers are rented out and interference between them
can not be tolerated.

Provide memory.max, a limit that can not be breached and will trigger
group-internal OOM killing once page reclaim can no longer enforce it.

This can be combined with the high limit, to create a window in which
allocating tasks are throttled to approach the strict maximum limit
gracefully and with opportunity for the user or admin to intervene.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/unified-hierarchy.txt |  4 ++++
 mm/memcontrol.c                             | 35 +++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index fd4f7f6847f6..6c52c926810f 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -334,6 +334,10 @@ supported and the interface files "release_agent" and
 - memory.usage_in_bytes is renamed to memory.current to be in line
   with the new naming scheme
 
+- memory.max provides a hard upper limit as a last-resort backup to
+  memory.high for situations with aggressive isolation requirements.
+
+
 5. Planned Changes
 
 5-1. CAP for resource control
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5a64fa96c08a..461834c86b94 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6306,6 +6306,36 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static u64 memory_max_read(struct cgroup_subsys_state *css,
+			   struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return res_counter_read_u64(&memcg->res, RES_LIMIT);
+}
+
+static ssize_t memory_max_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	u64 max;
+	int ret;
+
+	if (mem_cgroup_is_root(memcg))
+		return -EINVAL;
+
+	buf = strim(buf);
+	ret = res_counter_memparse_write_strategy(buf, &max);
+	if (ret)
+		return ret;
+
+	ret = mem_cgroup_resize_limit(memcg, max);
+	if (ret)
+		return ret;
+
+	return nbytes;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -6316,6 +6346,11 @@ static struct cftype memory_files[] = {
 		.read_u64 = memory_high_read,
 		.write = memory_high_write,
 	},
+	{
+		.name = "max",
+		.read_u64 = memory_max_read,
+		.write = memory_max_write,
+	},
 };
 
 struct cgroup_subsys memory_cgrp_subsys = {
-- 
2.0.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [patch 4/4] mm: memcontrol: add memory.vmstat to default hierarchy
  2014-08-04 21:14 [patch 0/4] mm: memcontrol: populate unified hierarchy interface Johannes Weiner
                   ` (2 preceding siblings ...)
  2014-08-04 21:14 ` [patch 3/4] mm: memcontrol: add memory.max " Johannes Weiner
@ 2014-08-04 21:14 ` Johannes Weiner
       [not found] ` <1407186897-21048-1-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  4 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2014-08-04 21:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, Tejun Heo, linux-mm, cgroups, linux-kernel

Provide basic per-memcg vmstat-style statistics on LRU sizes,
allocated and freed pages, major and minor faults.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/unified-hierarchy.txt |  8 ++++++
 mm/memcontrol.c                             | 40 +++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index 6c52c926810f..180b260c510a 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -337,6 +337,14 @@ supported and the interface files "release_agent" and
 - memory.max provides a hard upper limit as a last-resort backup to
   memory.high for situations with aggressive isolation requirements.
 
+- memory.stat has been replaced by memory.vmstat, which provides
+  page-based statistics in the style of /proc/vmstat.
+
+  As cgroups are now always hierarchical and no longer allow tasks in
+  intermediate levels, the local state is irrelevant and all
+  statistics represent the state of the entire hierarchy rooted at the
+  given group.
+
 
 5. Planned Changes
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 461834c86b94..6502e1cfc0fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6336,6 +6336,42 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static u64 tree_events(struct mem_cgroup *memcg, int event)
+{
+	struct mem_cgroup *mi;
+	u64 val = 0;
+
+	for_each_mem_cgroup_tree(mi, memcg)
+		val += mem_cgroup_read_events(mi, event);
+	return val;
+}
+
+static int memory_vmstat_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	struct mem_cgroup *mi;
+	int i;
+
+	for (i = 0; i < NR_LRU_LISTS; i++) {
+		u64 val = 0;
+
+		for_each_mem_cgroup_tree(mi, memcg)
+			val += mem_cgroup_nr_lru_pages(mi, BIT(i));
+		seq_printf(m, "%s %llu\n", vmstat_text[NR_LRU_BASE + i], val);
+	}
+
+	seq_printf(m, "pgalloc %llu\n",
+		   tree_events(memcg, MEM_CGROUP_EVENTS_PGPGIN));
+	seq_printf(m, "pgfree %llu\n",
+		   tree_events(memcg, MEM_CGROUP_EVENTS_PGPGOUT));
+	seq_printf(m, "pgfault %llu\n",
+		   tree_events(memcg, MEM_CGROUP_EVENTS_PGFAULT));
+	seq_printf(m, "pgmajfault %llu\n",
+		   tree_events(memcg, MEM_CGROUP_EVENTS_PGMAJFAULT));
+
+	return 0;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -6351,6 +6387,10 @@ static struct cftype memory_files[] = {
 		.read_u64 = memory_max_read,
 		.write = memory_max_write,
 	},
+	{
+		.name = "vmstat",
+		.seq_show = memory_vmstat_show,
+	},
 };
 
 struct cgroup_subsys memory_cgrp_subsys = {
-- 
2.0.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

[parent not found: <1407186897-21048-1-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [patch 0/4] mm: memcontrol: populate unified hierarchy interface
       [not found] ` <1407186897-21048-1-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2014-08-05 12:40   ` Michal Hocko
  2014-08-05 13:53     ` Johannes Weiner
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2014-08-05 12:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 04-08-14 17:14:53, Johannes Weiner wrote:
> Hi,
> 
> the ongoing versioning of the cgroup user interface gives us a chance
> to clean up the memcg control interface and fix a lot of
> inconsistencies and ugliness that crept in over time.

The first patch doesn't fit into the series and should be posted
separately.

> This series adds a minimal set of control files to the new memcg
> interface to get basic memcg functionality going in unified hierarchy:

Hmm, I have posted RFC for new knobs quite some time ago and the
discussion died without some questions answered and now you are coming
with a new one. I cannot say I would be happy about that.

One of the concern was renaming knobs which represent the same
functionality as before. I have posted some concerns but haven't heard
back anything. This series doesn't give any rationale for renaming
either.
It is true we have a v2 but that doesn't necessarily mean we should put
everything upside down.

> - memory.current: a read-only file that shows current memory usage.

Even if we go with renaming existing knobs I really hate this name. The
old one was too long but this is not descriptive enough. Same applies to
max and high. I would expect at least limit in the name.

> - memory.high: a file that allows setting a high limit on the memory
>   usage.  This is an elastic limit, which is enforced via direct
>   reclaim, so allocators are throttled once it's reached, but it can
>   be exceeded and does not trigger OOM kills.  This should be a much
>   more suitable default upper boundary for the majority of use cases
>   that are better off with some elasticity than with sudden OOM kills.

I also thought you wanted to have all the new limits in the single
series. My series is sitting idle until we finally come to conclusion
which is the first set of exposed knobs. So I do not understand why are
you coming with it right now.

> - memory.max: a file that allows setting a maximum limit on memory
>   usage which is ultimately enforced by OOM killing tasks in the
>   group.  This is for setups that want strict isolation at the cost of
>   task death above a certain point.  However, even those can still
>   combine the max limit with the high limit to approach OOM situations
>   gracefully and with time to intervene.
> 
> - memory.vmstat: vmstat-style per-memcg statistics.  Very minimal for
>   now (lru stats, allocations and frees, faults), but fixing
>   fundamental issues of the old memory.stat file, including gross
>   misnomers like pgpgin/pgpgout for pages charged/uncharged etc.

I am definitely for exposing LRU stats and have a half baked patch
sitting and waiting for some polishing. So I agree with the vmstat part.
Putting it into stat file is not the greatest match so a separate file
is good here.

>  Documentation/cgroups/unified-hierarchy.txt |  18 +++
>  include/linux/res_counter.h                 |  29 +++++
>  include/linux/swap.h                        |   3 +-
>  kernel/res_counter.c                        |   3 +
>  mm/memcontrol.c                             | 177 +++++++++++++++++++++++++---
>  mm/vmscan.c                                 |   3 +-
>  6 files changed, 216 insertions(+), 17 deletions(-)
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/4] mm: memcontrol: populate unified hierarchy interface
  2014-08-05 12:40   ` [patch 0/4] mm: memcontrol: populate unified hierarchy interface Michal Hocko
@ 2014-08-05 13:53     ` Johannes Weiner
       [not found]       ` <20140805135325.GB14734-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2014-08-05 13:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Tue, Aug 05, 2014 at 02:40:33PM +0200, Michal Hocko wrote:
> On Mon 04-08-14 17:14:53, Johannes Weiner wrote:
> > Hi,
> > 
> > the ongoing versioning of the cgroup user interface gives us a chance
> > to clean up the memcg control interface and fix a lot of
> > inconsistencies and ugliness that crept in over time.
> 
> The first patch doesn't fit into the series and should be posted
> separately.

It's a prerequisite for the high limit implementation.

> > This series adds a minimal set of control files to the new memcg
> > interface to get basic memcg functionality going in unified hierarchy:
> 
> Hmm, I have posted RFC for new knobs quite some time ago and the
> discussion died without some questions answered and now you are coming
> with a new one. I cannot say I would be happy about that.

I remembered open questions mainly about other things like swappiness,
charge immigration, kmem limits.  My bad, I should have checked.  Here
are your concerns on these basic knobs from that email:

---

On Thu, Jul 17, 2014 at 03:45:09PM +0200, Michal Hocko wrote:
> On Wed 16-07-14 11:58:14, Johannes Weiner wrote:
> > How about "memory.current"?
> 
> I wanted old users to change the minimum possible when moving to unified
> hierarchy so I didn't touch the old names.
> Why should we make the end users life harder? If there is general
> agreement I have no problem with renaming I just do not think it is
> really necessary because there is no real reason why configurations
> which do not use any of the deprecated or unified-hierarchy-only
> features shouldn't run in both unified and legacy hierarchies without
> any changes.

There is the rub, though: you can't *not* use new interfaces.  We are
getting rid of the hard limit as the default and we really want people
to rethink their configuration in the light of this.  And even if you
would just use the hard limit as before, there is no way we can leave
the name 'memory.limit_in_bytes' when we have in fact 4 different
limits.

So I don't see any way how we can stay 100% backward compatible even
with the most basic memcg functionality of setting an upper limit.

And once you acknowledge that current users don't get around *some*
adjustments, we really owe it to new users to present a clean and
consistent interface.

> I do realize that this is a _new_ API so we can do such radical changes
> but I am also aware that some people have to maintain their stacks on
> top of different kernels and it really sucks to maintain two different
> configurations. In such a case it would be easier for those users to
> stay with the legacy mode which is a fair option but I would much rather
> see them move to the new API sooner rather than later.

There is no way you can use the exact same scripts/configurations for
the old and new API at the same time when the most basic way of using
cgroups and memcg changed in v2.

---

> One of the concern was renaming knobs which represent the same
> functionality as before. I have posted some concerns but haven't heard
> back anything. This series doesn't give any rationale for renaming
> either.
> It is true we have a v2 but that doesn't necessarily mean we should put
> everything upside down.

I'm certainly not going out of my way to turn things upside down, but
the old interface is outrageous.  I'm sorry if you can't see that it
badly needs to be cleaned up and fixed.  This is the time to do that.

> > - memory.current: a read-only file that shows current memory usage.
> 
> Even if we go with renaming existing knobs I really hate this name. The
> old one was too long but this is not descriptive enough. Same applies to
> max and high. I would expect at least limit in the name.

Memory cgroups are about accounting and limiting memory usage.  That's
all they do.  In that context, current, min, low, high, max seem
perfectly descriptive to me, adding usage and limit seems redundant.

We name syscalls creat() and open() and stat() because, while you have
to look at the manpage once, they are easy to remember, easy to type,
and they keep the code using them readable.

memory.usage_in_bytes was the opposite approach: it tried to describe
all there is to this knob in the name itself, assuming tab completion
would help you type that long name.  But we are more and more moving
away from ad-hoc scripting of cgroups and I don't want to optimize for
that anymore at the cost of really unwieldy identifiers.

Like with all user interfaces, we should provide a short and catchy
name and then provide the details in the documentation.

> > - memory.high: a file that allows setting a high limit on the memory
> >   usage.  This is an elastic limit, which is enforced via direct
> >   reclaim, so allocators are throttled once it's reached, but it can
> >   be exceeded and does not trigger OOM kills.  This should be a much
> >   more suitable default upper boundary for the majority of use cases
> >   that are better off with some elasticity than with sudden OOM kills.
> 
> I also thought you wanted to have all the new limits in the single
> series. My series is sitting idle until we finally come to conclusion
> which is the first set of exposed knobs. So I do not understand why are
> you coming with it right now.

I still would like to, but I'm not sure we can get the guarantees
working in time as unified hierarchy leaves its experimental status.

And I'm fairly confident that we know how the upper limits should
behave and that we are no longer going to change that, and that we
have a decent understanding on how the guarantees are going to work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <20140805135325.GB14734-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [patch 0/4] mm: memcontrol: populate unified hierarchy interface
       [not found]       ` <20140805135325.GB14734-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2014-08-05 15:27         ` Michal Hocko
  2014-08-07 14:21           ` Johannes Weiner
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2014-08-05 15:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue 05-08-14 09:53:25, Johannes Weiner wrote:
> On Tue, Aug 05, 2014 at 02:40:33PM +0200, Michal Hocko wrote:
> > On Mon 04-08-14 17:14:53, Johannes Weiner wrote:
> > > Hi,
> > > 
> > > the ongoing versioning of the cgroup user interface gives us a chance
> > > to clean up the memcg control interface and fix a lot of
> > > inconsistencies and ugliness that crept in over time.
> > 
> > The first patch doesn't fit into the series and should be posted
> > separately.
> 
> It's a prerequisite for the high limit implementation.

I do not think it is strictly needed. I am even not sure whether the
patch is OK and have to think more about it. I think you can throttle
high limit breachers by SWAP_CLUSTER_MAX for now.

> > > This series adds a minimal set of control files to the new memcg
> > > interface to get basic memcg functionality going in unified hierarchy:
> > 
> > Hmm, I have posted RFC for new knobs quite some time ago and the
> > discussion died without some questions answered and now you are coming
> > with a new one. I cannot say I would be happy about that.
> 
> I remembered open questions mainly about other things like swappiness,
> charge immigration, kmem limits.  My bad, I should have checked.  Here
> are your concerns on these basic knobs from that email:
> 
> ---
> 
> On Thu, Jul 17, 2014 at 03:45:09PM +0200, Michal Hocko wrote:
> > On Wed 16-07-14 11:58:14, Johannes Weiner wrote:
> > > How about "memory.current"?
> > 
> > I wanted old users to change the minimum possible when moving to unified
> > hierarchy so I didn't touch the old names.
> > Why should we make the end users life harder? If there is general
> > agreement I have no problem with renaming I just do not think it is
> > really necessary because there is no real reason why configurations
> > which do not use any of the deprecated or unified-hierarchy-only
> > features shouldn't run in both unified and legacy hierarchies without
> > any changes.
> 
> There is the rub, though: you can't *not* use new interfaces.  We are
> getting rid of the hard limit as the default and we really want people
> to rethink their configuration in the light of this.  And even if you
> would just use the hard limit as before, there is no way we can leave
> the name 'memory.limit_in_bytes' when we have in fact 4 different
> limits.

We could theoretically keep a single limit and turn other limits into
watermarks. I am _not_ suggesting that now because I haven't thought
that through but I just think we should discuss other possible ways
before we go on.

> So I don't see any way how we can stay 100% backward compatible even
> with the most basic memcg functionality of setting an upper limit.
> 
> And once you acknowledge that current users don't get around *some*
> adjustments, we really owe it to new users to present a clean and
> consistent interface.
>
> > I do realize that this is a _new_ API so we can do such radical changes
> > but I am also aware that some people have to maintain their stacks on
> > top of different kernels and it really sucks to maintain two different
> > configurations. In such a case it would be easier for those users to
> > stay with the legacy mode which is a fair option but I would much rather
> > see them move to the new API sooner rather than later.
> 
> There is no way you can use the exact same scripts/configurations for
> the old and new API at the same time when the most basic way of using
> cgroups and memcg changed in v2.

OK, this is a fair point. Cgroups configuration is probably a bigger
problem. If a script rely on a tool/library to setup the hierarchy then
that tool/library can probably do the mapping from old names to new as
well otherwise it would need to be rewritten at least for cgroup part.

I have no idea how tools (e.g. libcgroup, libvirt and others) will adapt
to the new API and support of both APIs at the same time, though.

> ---
> 
> > One of the concern was renaming knobs which represent the same
> > functionality as before. I have posted some concerns but haven't heard
> > back anything. This series doesn't give any rationale for renaming
> > either.
> > It is true we have a v2 but that doesn't necessarily mean we should put
> > everything upside down.
> 
> I'm certainly not going out of my way to turn things upside down, but
> the old interface is outrageous.  I'm sorry if you can't see that it
> badly needs to be cleaned up and fixed.  This is the time to do that.

Of course I can see many problems. But please let's think twice and even
more times when doing radical changes. Many decisions sound reasonable
at the time but then they turn out bad much later.

> > > - memory.current: a read-only file that shows current memory usage.
> > 
> > Even if we go with renaming existing knobs I really hate this name. The
> > old one was too long but this is not descriptive enough. Same applies to
> > max and high. I would expect at least limit in the name.
> 
> Memory cgroups are about accounting and limiting memory usage.  That's
> all they do.  In that context, current, min, low, high, max seem
> perfectly descriptive to me, adding usage and limit seems redundant.

Getting naming right is always pain and different people will always
have different views. For example I really do not like memory.current
and would prefer memory.usage much more. I am not a native speaker but
`usage' sounds much less ambiguous to me. Whether shorter (without _limit
suffix) names for limits are better I don't know. They certainly seem
more descriptive with the suffix to me.

> We name syscalls creat() and open() and stat() because, while you have
> to look at the manpage once, they are easy to remember, easy to type,
> and they keep the code using them readable.
> 
> memory.usage_in_bytes was the opposite approach: it tried to describe
> all there is to this knob in the name itself, assuming tab completion
> would help you type that long name.  But we are more and more moving
> away from ad-hoc scripting of cgroups and I don't want to optimize for
> that anymore at the cost of really unwieldy identifiers.

I agree with you. _in_bytes is definitely excessive. It can be nicely
demonstrated by the fact that different units are allowed when setting
the value.
 
> Like with all user interfaces, we should provide a short and catchy
> name and then provide the details in the documentation.
> 
> > > - memory.high: a file that allows setting a high limit on the memory
> > >   usage.  This is an elastic limit, which is enforced via direct
> > >   reclaim, so allocators are throttled once it's reached, but it can
> > >   be exceeded and does not trigger OOM kills.  This should be a much
> > >   more suitable default upper boundary for the majority of use cases
> > >   that are better off with some elasticity than with sudden OOM kills.
> > 
> > I also thought you wanted to have all the new limits in the single
> > series. My series is sitting idle until we finally come to conclusion
> > which is the first set of exposed knobs. So I do not understand why are
> > you coming with it right now.
> 
> I still would like to, but I'm not sure we can get the guarantees
> working in time as unified hierarchy leaves its experimental status.

That shouldn't happen sooner than in next (maybe in 2) devel cycle(s)
(http://marc.info/?l=linux-kernel&m=140716172618366).

> And I'm fairly confident that we know how the upper limits should
> behave and that we are no longer going to change that, and that we
> have a decent understanding on how the guarantees are going to work.

I think we should first settle with the new knobs before we introduce
new ones. I understand you would like to have high limit as a preferable
way from the early beginning but I think that can wait while the new API
is still in devel mode.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/4] mm: memcontrol: populate unified hierarchy interface
  2014-08-05 15:27         ` Michal Hocko
@ 2014-08-07 14:21           ` Johannes Weiner
  0 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2014-08-07 14:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tejun Heo, linux-mm, cgroups, linux-kernel

On Tue, Aug 05, 2014 at 05:27:40PM +0200, Michal Hocko wrote:
> On Tue 05-08-14 09:53:25, Johannes Weiner wrote:
> > On Tue, Aug 05, 2014 at 02:40:33PM +0200, Michal Hocko wrote:
> > > On Mon 04-08-14 17:14:53, Johannes Weiner wrote:
> > > > Hi,
> > > > 
> > > > the ongoing versioning of the cgroup user interface gives us a chance
> > > > to clean up the memcg control interface and fix a lot of
> > > > inconsistencies and ugliness that crept in over time.
> > > 
> > > The first patch doesn't fit into the series and should be posted
> > > separately.
> > 
> > It's a prerequisite for the high limit implementation.
> 
> I do not think it is strictly needed. I am even not sure whether the
> patch is OK and have to think more about it. I think you can throttle
> high limit breachers by SWAP_CLUSTER_MAX for now.

It really doesn't work once you have higher order pages.  THP-heavy
workloads overshoot the high limit by a lot if you reclaim 32 pages
for every 512 charged.

In my tests, with the change in question, even heavily swapping THP
loads consistently stay around the high limit, whereas without it the
memory consumption quickly overshoots.

> > > > This series adds a minimal set of control files to the new memcg
> > > > interface to get basic memcg functionality going in unified hierarchy:
> > > 
> > > Hmm, I have posted RFC for new knobs quite some time ago and the
> > > discussion died without some questions answered and now you are coming
> > > with a new one. I cannot say I would be happy about that.
> > 
> > I remembered open questions mainly about other things like swappiness,
> > charge immigration, kmem limits.  My bad, I should have checked.  Here
> > are your concerns on these basic knobs from that email:
> > 
> > ---
> > 
> > On Thu, Jul 17, 2014 at 03:45:09PM +0200, Michal Hocko wrote:
> > > On Wed 16-07-14 11:58:14, Johannes Weiner wrote:
> > > > How about "memory.current"?
> > > 
> > > I wanted old users to change the minimum possible when moving to unified
> > > hierarchy so I didn't touch the old names.
> > > Why should we make the end users life harder? If there is general
> > > agreement I have no problem with renaming I just do not think it is
> > > really necessary because there is no real reason why configurations
> > > which do not use any of the deprecated or unified-hierarchy-only
> > > features shouldn't run in both unified and legacy hierarchies without
> > > any changes.
> > 
> > There is the rub, though: you can't *not* use new interfaces.  We are
> > getting rid of the hard limit as the default and we really want people
> > to rethink their configuration in the light of this.  And even if you
> > would just use the hard limit as before, there is no way we can leave
> > the name 'memory.limit_in_bytes' when we have in fact 4 different
> > limits.
> 
> We could theoretically keep a single limit and turn other limits into
> watermarks. I am _not_ suggesting that now because I haven't thought
> that through but I just think we should discuss other possible ways
> before we go on.

I am definitely open to discuss your alternative suggestions, but for
that you have to actually propose them. :)

The reason why I want existing users to rethink the approach to memory
limiting is that the current hard limit completely fails to partition
and isolate in a practical manner, and we are changing the fundamental
approach here.  Pretending to provide backward compatibility through
the names of control knobs is specious, and will lead to more issues
than it actually solves.

> > > One of the concern was renaming knobs which represent the same
> > > functionality as before. I have posted some concerns but haven't heard
> > > back anything. This series doesn't give any rationale for renaming
> > > either.
> > > It is true we have a v2 but that doesn't necessarily mean we should put
> > > everything upside down.
> > 
> > I'm certainly not going out of my way to turn things upside down, but
> > the old interface is outrageous.  I'm sorry if you can't see that it
> > badly needs to be cleaned up and fixed.  This is the time to do that.
> 
> Of course I can see many problems. But please let's think twice and even
> more times when doing radical changes. Many decisions sound reasonable
> at the time but then they turn out bad much later.

These are radical changes, and I'm sorry that my justifications were
very terse.  I've updated this patch to include the following in
Documentation/cgroups/unified-hierarchy.txt:

---

4-3-3. memory

Memory cgroups account and limit the memory consumption of cgroups,
but the current limit semantics make the feature hard to use and
creates problems in existing configurations.

4.3.3.1 No more default hard limit

'memory.limit_in_bytes' is the current upper limit that can not be
exceeded under any circumstances.  If it can not be met by direct
reclaim, the tasks in the cgroup are OOM killed.

While this may look like a valid approach to partition the machine, in
practice workloads expand and contract during runtime, and it's
impossible to get the machine-wide configuration right: if users set
this hard limit conservatively, they are plagued by cgroup-internal
OOM kills while another group's memory might be idle.  If they set it
too generously, precious resources are wasted.  As a result, many
users overcommit such that the sum of all hard limits exceed the
machine size, but this puts the actual burden of containment on global
reclaim and OOM handling.  This led to further extremes, such as the
demand for having global reclaim honor group-specific priorities and
minimums, and the ability to handle global OOM situations from
userspace using task-specific physical memory reserves.  All these
outcomes and developments show the utter failure of hard limits to
practically partition the machine for maximum resource utilization.

In unified hierarchy, the primary means of limiting memory consumption
is 'memory.high'.  It's enforced by direct reclaim but can be exceeded
under severe memory pressure.  Memory pressure created by this limit
still applies mainly to the group itself, but it prefers offloading
the excess to the rest of the system in order to avoid OOM killing.
Configurations can start out by setting this limit to a conservative
estimate of the average workload size and then make upward adjustments
based on monitoring high limit excess, workload performance, and the
global memory situation.

In untrusted environments, users may wish to limit the amount of such
offloading in order to contain malicious workloads.  For that purpose,
a hard upper limit can be set through 'memory.max'.

'memory.pressure_level' was added for userspace to monitor memory
pressure based on reclaim efficiency, but the window between initial
memory pressure and an OOM kill is very short with hard limits.  By
the time high pressure is reported to userspace it's often too late to
still intervene before the group goes OOM, thus severely limiting the
usefulness of this feature for anticipating looming OOM situations.

This new approach to limiting allows packing workloads more densely
based on their average workingset size.  Coinciding peaks of multiple
groups are handled by throttling allocations within the groups rather
than putting the full burden on global reclaim and OOM handling, and
pressure situations build up gracefully and allow better monitoring.

---

> > > > - memory.current: a read-only file that shows current memory usage.
> > > 
> > > Even if we go with renaming existing knobs I really hate this name. The
> > > old one was too long but this is not descriptive enough. Same applies to
> > > max and high. I would expect at least limit in the name.
> > 
> > Memory cgroups are about accounting and limiting memory usage.  That's
> > all they do.  In that context, current, min, low, high, max seem
> > perfectly descriptive to me, adding usage and limit seems redundant.
> 
> Getting naming right is always pain and different people will always
> have different views. For example I really do not like memory.current
> and would prefer memory.usage much more. I am not a native speaker but
> `usage' sounds much less ambiguous to me. Whether shorter (without _limit
> suffix) names for limits are better I don't know. They certainly seem
> more descriptive with the suffix to me.

These knobs control the most fundamental behavior of memory cgroups,
which is accounting and then limiting memory consumption, so I think
we can agree at least that we want something short and poignant here
that stands out compared to secondary controls, feature toggles etc.

The reason I went with memory.current over memory.usage is that it's
more consistent with the limit names I chose, which is memory.high and
memory.max.  Going with memory.usage begs the question what memory.max
applies to, and now you need to add 'usage' or 'limit' to high/max as
well (and 'guarantee' to min/low), which moves us away from short and
poignant towards more specific niche control names.  memory.current,
memory.high, memory.max all imply the same thing: memory consumption -
what memory cgroups is fundamentally about.

> > We name syscalls creat() and open() and stat() because, while you have
> > to look at the manpage once, they are easy to remember, easy to type,
> > and they keep the code using them readable.
> >
> > memory.usage_in_bytes was the opposite approach: it tried to describe
> > all there is to this knob in the name itself, assuming tab completion
> > would help you type that long name.  But we are more and more moving
> > away from ad-hoc scripting of cgroups and I don't want to optimize for
> > that anymore at the cost of really unwieldy identifiers.
> 
> I agree with you. _in_bytes is definitely excessive. It can be nicely
> demonstrated by the fact that different units are allowed when setting
> the value.

It's maybe misleading for that reason, but that wasn't my main point.

There are certain things we can imply in the name and either explain
in the documentation or assume from the context, and _in_bytes is one
such piece of information.  It's something that you need to know once
and don't need to be reminded of everytime you type that control name.

Likewise, I'm extending this argument that we don't need to include
'usage' or 'limit' in any of these basic knobs, because that's *the*
thing that memory cgroups do.  It's up to secondary controls to pick
names that do not create ambiguity with those core controls.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2
@ 2014-08-08 21:38 Johannes Weiner
  2014-08-08 21:38 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2014-08-08 21:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Greg Thelen, Vladimir Davydov, Tejun Heo, cgroups,
	linux-kernel

Hi,

memory cgroups are fundamentally broken when it comes to partitioning
the machine for many concurrent jobs.  In real life, workloads expand
and contract over time, and the hard limit is too static to reflect
this - it either wastes memory outside of the group, or wastes memory
inside the group.  As a result, the hard limit is mostly just used to
catch extreme consumption peaks, while workload trimming and balancing
is left to global reclaim and global OOM handling.  That in turn
requires more and more cgroup-awareness on the global level to make up
for the lack of useful policy enforcement on the cgroup level itself.

The ongoing versioning of the cgroup user interface gives us a chance
to fix such brokenness, and also clean up the interface and fix a lot
of the inconsistencies and ugliness that crept in over time.

This series adds a minimal set of control files to version 2 of the
memcg interface, implementing a new approach to machine partitioning.

Version 2 of this series is in response to feedback from Michal.  Some
of the changes are in code, but mostly it improves the documentation
and changelogs to describe the fundamental problems with the original
approach to machine partitioning and makes a case for the new model.

 Documentation/cgroups/unified-hierarchy.txt |  65 ++++++++
 include/linux/res_counter.h                 |  29 ++++
 include/linux/swap.h                        |   6 +-
 kernel/res_counter.c                        |   3 +
 mm/memcontrol.c                             | 250 +++++++++++++++++++---------
 mm/vmscan.c                                 |   7 +-
 6 files changed, 277 insertions(+), 83 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy
  2014-08-08 21:38 [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2 Johannes Weiner
@ 2014-08-08 21:38 ` Johannes Weiner
  0 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2014-08-08 21:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Greg Thelen, Vladimir Davydov, Tejun Heo, cgroups,
	linux-kernel

Provide the most fundamental interface necessary for memory cgroups to
partition the machine for concurrent workloads in unified hierarchy:
report the current usage and allow setting an upper limit on it.

The upper limit, set in memory.high, is not a strict OOM limit and is
enforced purely by direct reclaim.  This is a deviation from the old
hard upper limit, which history has shown to fail at partitioning a
machine for real workloads in a resource-efficient manner: if chosen
conservatively, the hard limit risks OOM kills; if chosen generously,
memory is underutilized most of the time.  As a result, in practice
the limit is mostly used to contain extremes and balancing of regular
workingset fluctuations and cache trimming is left to global reclaim
and the global OOM killer, which creates an increasing demand for
complicated cgroup-specific prioritization features in both of them.

The high limit on the other hand is a target size limit that is meant
to trim caches and keep consumption at the average working set size
while providing elasticity for peaks.  This allows memory cgroups to
be useful for workload packing without relying too much on global VM
interventions, except for parallel peaks or inadequate configurations.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/unified-hierarchy.txt | 52 +++++++++++++++++
 include/linux/res_counter.h                 | 29 ++++++++++
 kernel/res_counter.c                        |  3 +
 mm/memcontrol.c                             | 89 ++++++++++++++++++++++++++---
 4 files changed, 164 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index 4f4563277864..2d91530b8d6c 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -324,9 +324,61 @@ supported and the interface files "release_agent" and
 
 4-3-3. memory
 
+Memory cgroups account and limit the memory consumption of cgroups,
+but the current limit semantics make the feature hard to use and
+creates problems in existing configurations.
+
+4.3.3.1 No more default hard limit
+
+'memory.limit_in_bytes' is the current upper limit that can not be
+exceeded under any circumstances.  If it can not be met by direct
+reclaim, the tasks in the cgroup are OOM killed.
+
+While this may look like a valid approach to partition the machine, in
+practice workloads expand and contract during runtime, and it's
+impossible to get the machine-wide configuration right: if users set
+this hard limit conservatively, they are plagued by cgroup-internal
+OOM kills during peaks while memory might be idle (external waste).
+If they set it too generously, precious resources are either unused or
+wasted on old cache (internal waste).  Because of that, in practice
+users set the hard limit only to handle extremes and then overcommit
+the machine.  This leaves the actual partitioning and group trimming
+to global reclaim and OOM handling, which has led to increasing
+demands for recognizing cgroup policy during global reclaim, and even
+the ability to handle global OOM situations from userspace using
+task-specific memory reserves.  All these outcomes and developments
+show the utter failure of hard limits to effectively partition the
+machine for maximum utilization.
+
+When it comes to monitoring cgroup health, 'memory.pressure_level' was
+added for userspace to monitor memory pressure based on group-internal
+reclaim efficiency.  But as per above the group trimming is mostly
+done by global reclaim and the pressure the group experiences is not
+proportional to its excess.  And once internal pressure actually
+builds, the window between onset and an OOM kill can be very short
+with hard limits - by the time internal pressure is reported to
+userspace, it's often too late to intervene before the group goes OOM.
+Both aspects severely limit the ability to monitor cgroup health,
+detect looming OOM situations, and pinpoint offenders.
+
+In unified hierarchy, the primary means of limiting memory consumption
+is 'memory.high'.  It's enforced by direct reclaim to trim caches and
+keep the workload lean, but can be exceeded during working set peaks.
+This moves the responsibility of partitioning mostly back to memory
+cgroups, and global handling only enganges during concurrent peaks.
+
+Configurations can start out by setting this limit to a conservative
+estimate of the average working set size and then make upward
+adjustments based on monitoring high limit excess, workload
+performance, and the global memory situation.
+
+4.3.3.2 Misc changes
+
 - use_hierarchy is on by default and the cgroup file for the flag is
   not created.
 
+- memory.usage_in_bytes is renamed to memory.current to be in line
+  with the new limit naming scheme
 
 5. Planned Changes
 
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 56b7bc32db4f..27394cfdf1fe 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -32,6 +32,10 @@ struct res_counter {
 	 */
 	unsigned long long max_usage;
 	/*
+	 * the high limit that creates pressure but can be exceeded
+	 */
+	unsigned long long high;
+	/*
 	 * the limit that usage cannot exceed
 	 */
 	unsigned long long limit;
@@ -85,6 +89,7 @@ int res_counter_memparse_write_strategy(const char *buf,
 enum {
 	RES_USAGE,
 	RES_MAX_USAGE,
+	RES_HIGH,
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
@@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
 u64 res_counter_uncharge_until(struct res_counter *counter,
 			       struct res_counter *top,
 			       unsigned long val);
+
+static inline unsigned long long res_counter_high(struct res_counter *cnt)
+{
+	unsigned long long high = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage > cnt->high)
+		high = cnt->usage - cnt->high;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return high;
+}
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
@@ -193,6 +211,17 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
 
+static inline int res_counter_set_high(struct res_counter *cnt,
+				       unsigned long long high)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high = high;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 static inline int res_counter_set_limit(struct res_counter *cnt,
 		unsigned long long limit)
 {
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index e791130f85a7..26a08be49a3d 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -17,6 +17,7 @@
 void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
+	counter->high = RES_COUNTER_MAX;
 	counter->limit = RES_COUNTER_MAX;
 	counter->soft_limit = RES_COUNTER_MAX;
 	counter->parent = parent;
@@ -130,6 +131,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->usage;
 	case RES_MAX_USAGE:
 		return &counter->max_usage;
+	case RES_HIGH:
+		return &counter->high;
 	case RES_LIMIT:
 		return &counter->limit;
 	case RES_FAILCNT:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4146c0f47ba2..81627387fbd7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2481,8 +2481,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
-	struct res_counter *fail_res;
 	unsigned long nr_reclaimed;
+	struct res_counter *res;
 	unsigned long long size;
 	bool may_swap = true;
 	bool drained = false;
@@ -2493,16 +2493,16 @@ retry:
 		goto done;
 
 	size = batch * PAGE_SIZE;
-	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
+	if (!res_counter_charge(&memcg->res, size, &res)) {
 		if (!do_swap_account)
 			goto done_restock;
-		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
+		if (!res_counter_charge(&memcg->memsw, size, &res))
 			goto done_restock;
 		res_counter_uncharge(&memcg->res, size);
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+		mem_over_limit = mem_cgroup_from_res_counter(res, memsw);
 		may_swap = false;
 	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+		mem_over_limit = mem_cgroup_from_res_counter(res, res);
 
 	if (batch > nr_pages) {
 		batch = nr_pages;
@@ -2579,6 +2579,21 @@ bypass:
 done_restock:
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
+
+	res = &memcg->res;
+	while (res) {
+		unsigned long long high = res_counter_high(res);
+
+		if (high) {
+			unsigned long high_pages = high >> PAGE_SHIFT;
+			struct mem_cgroup *memcg;
+
+			memcg = mem_cgroup_from_res_counter(res, res);
+			try_to_free_mem_cgroup_pages(memcg, high_pages,
+						     gfp_mask, true);
+		}
+		res = res->parent;
+	}
 done:
 	return ret;
 }
@@ -5141,7 +5156,7 @@ out_kfree:
 	return ret;
 }
 
-static struct cftype mem_cgroup_files[] = {
+static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
@@ -5250,7 +5265,7 @@ static struct cftype mem_cgroup_files[] = {
 };
 
 #ifdef CONFIG_MEMCG_SWAP
-static struct cftype memsw_cgroup_files[] = {
+static struct cftype memsw_cgroup_legacy_files[] = {
 	{
 		.name = "memsw.usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
@@ -6195,6 +6210,61 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
 		mem_cgroup_from_css(root_css)->use_hierarchy = true;
 }
 
+static u64 memory_current_read(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return res_counter_read_u64(&memcg->res, RES_USAGE);
+}
+
+static u64 memory_high_read(struct cgroup_subsys_state *css,
+			    struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return res_counter_read_u64(&memcg->res, RES_HIGH);
+}
+
+static ssize_t memory_high_write(struct kernfs_open_file *of,
+				 char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	u64 high;
+	int ret;
+
+	if (mem_cgroup_is_root(memcg))
+		return -EINVAL;
+
+	buf = strim(buf);
+	ret = res_counter_memparse_write_strategy(buf, &high);
+	if (ret)
+		return ret;
+
+	ret = res_counter_set_high(&memcg->res, high);
+	if (ret)
+		return ret;
+
+	high = res_counter_high(&memcg->res);
+	if (high)
+		try_to_free_mem_cgroup_pages(memcg, high >> PAGE_SHIFT,
+					     GFP_KERNEL, true);
+
+	return nbytes;
+}
+
+static struct cftype memory_files[] = {
+	{
+		.name = "current",
+		.read_u64 = memory_current_read,
+	},
+	{
+		.name = "high",
+		.read_u64 = memory_high_read,
+		.write = memory_high_write,
+	},
+};
+
 struct cgroup_subsys memory_cgrp_subsys = {
 	.css_alloc = mem_cgroup_css_alloc,
 	.css_online = mem_cgroup_css_online,
@@ -6205,7 +6275,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
 	.bind = mem_cgroup_bind,
-	.legacy_cftypes = mem_cgroup_files,
+	.dfl_cftypes = memory_files,
+	.legacy_cftypes = mem_cgroup_legacy_files,
 	.early_init = 0,
 };
 
@@ -6223,7 +6294,7 @@ __setup("swapaccount=", enable_swap_account);
 static void __init memsw_file_init(void)
 {
 	WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys,
-					  memsw_cgroup_files));
+					  memsw_cgroup_legacy_files));
 }
 
 static void __init enable_swap_cgroup(void)
-- 
2.0.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-08-14 16:12 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-04 21:14 [patch 0/4] mm: memcontrol: populate unified hierarchy interface Johannes Weiner
2014-08-04 21:14 ` [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests Johannes Weiner
     [not found]   ` <1407186897-21048-2-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-08-07 13:08     ` Michal Hocko
     [not found]       ` <20140807130822.GB12730-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-08-07 15:31         ` Johannes Weiner
2014-08-07 16:10           ` Greg Thelen
2014-08-08 12:47             ` Michal Hocko
2014-08-08 12:32           ` Michal Hocko
2014-08-08 13:26             ` Johannes Weiner
2014-08-11  7:49               ` Michal Hocko
2014-08-13 14:59               ` Michal Hocko
2014-08-13 20:41                 ` Johannes Weiner
2014-08-14 16:12                   ` Michal Hocko
2014-08-04 21:14 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner
2014-08-07 13:36   ` Michal Hocko
2014-08-07 13:39     ` Michal Hocko
2014-08-07 15:47     ` Johannes Weiner
2014-08-04 21:14 ` [patch 3/4] mm: memcontrol: add memory.max " Johannes Weiner
2014-08-04 21:14 ` [patch 4/4] mm: memcontrol: add memory.vmstat " Johannes Weiner
     [not found] ` <1407186897-21048-1-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-08-05 12:40   ` [patch 0/4] mm: memcontrol: populate unified hierarchy interface Michal Hocko
2014-08-05 13:53     ` Johannes Weiner
     [not found]       ` <20140805135325.GB14734-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-08-05 15:27         ` Michal Hocko
2014-08-07 14:21           ` Johannes Weiner
  -- strict thread matches above, loose matches on Subject: below --
2014-08-08 21:38 [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2 Johannes Weiner
2014-08-08 21:38 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox