[PATCHSET] memcg: improve high limit behavior and always enable kmemcg on dfl hier

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET] memcg: improve high limit behavior and always enable kmemcg on dfl hier
@ 2015-08-28 15:25 ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
  To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

Hello,

This patchset contains the following four patches.

 0001-memcg-fix-over-high-reclaim-amount.patch
 0002-memcg-flatten-task_struct-memcg_oom.patch
 0003-memcg-punt-high-overage-reclaim-to-return-to-userlan.patch
 0004-memcg-always-enable-kmemcg-on-the-default-hierarchy.patch

0001-0002 are simple fix and prep patches.  0003 makes memcg alwyas
punt direct reclaim of high limit overages to return-to-user path.
0004 always enables kmemcg on the default hierarchy.

This patchset is on top of next/akpm 1f3bd508da15
("drivers/w1/w1_int.c: call put_device if device_register fails") and
also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-memcg-dfl

diffstat follows.  Thanks.

 include/linux/memcontrol.h |   16 +++++---
 include/linux/sched.h      |   14 +++----
 include/linux/tracehook.h  |    3 +
 mm/memcontrol.c            |   88 +++++++++++++++++++++++++++++++++++----------
 4 files changed, 91 insertions(+), 30 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCHSET] memcg: improve high limit behavior and always enable kmemcg on dfl hier
@ 2015-08-28 15:25 ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
  To: hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team

Hello,

This patchset contains the following four patches.

 0001-memcg-fix-over-high-reclaim-amount.patch
 0002-memcg-flatten-task_struct-memcg_oom.patch
 0003-memcg-punt-high-overage-reclaim-to-return-to-userlan.patch
 0004-memcg-always-enable-kmemcg-on-the-default-hierarchy.patch

0001-0002 are simple fix and prep patches.  0003 makes memcg alwyas
punt direct reclaim of high limit overages to return-to-user path.
0004 always enables kmemcg on the default hierarchy.

This patchset is on top of next/akpm 1f3bd508da15
("drivers/w1/w1_int.c: call put_device if device_register fails") and
also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-memcg-dfl

diffstat follows.  Thanks.

 include/linux/memcontrol.h |   16 +++++---
 include/linux/sched.h      |   14 +++----
 include/linux/tracehook.h  |    3 +
 mm/memcontrol.c            |   88 +++++++++++++++++++++++++++++++++++----------
 4 files changed, 91 insertions(+), 30 deletions(-)

--
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]

* [PATCH 1/4] memcg: fix over-high reclaim amount
  2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 15:25     ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
  To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg,
	Tejun Heo

When memory usage is over the high limit, try_charge() performs direct
reclaim; however, it uses the current charging amount @nr_pages as the
reclamation target which is incorrect as we want to reclaim down to
the high limit.  In practice, this doesn't matter all that much
because the minimum target pages that try_to_free_mem_cgroup_pages()
uses is SWAP_CLUSTER_MAX which is rather large.

Fix it by setting the target number of pages to the difference between
the current usage and the high limit.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 mm/memcontrol.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aacc767..18ecf75 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2078,10 +2078,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * make the charging task trim their excess contribution.
 	 */
 	do {
-		if (page_counter_read(&memcg->memory) <= memcg->high)
+		unsigned long usage = page_counter_read(&memcg->memory);
+		unsigned long high = ACCESS_ONCE(memcg->high);
+
+		if (usage <= high)
 			continue;
 		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
-		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+		try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
 	} while ((memcg = parent_mem_cgroup(memcg)));
 done:
 	return ret;
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 1/4] memcg: fix over-high reclaim amount
@ 2015-08-28 15:25     ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
  To: hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team, Tejun Heo

When memory usage is over the high limit, try_charge() performs direct
reclaim; however, it uses the current charging amount @nr_pages as the
reclamation target which is incorrect as we want to reclaim down to
the high limit.  In practice, this doesn't matter all that much
because the minimum target pages that try_to_free_mem_cgroup_pages()
uses is SWAP_CLUSTER_MAX which is rather large.

Fix it by setting the target number of pages to the difference between
the current usage and the high limit.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 mm/memcontrol.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aacc767..18ecf75 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2078,10 +2078,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * make the charging task trim their excess contribution.
 	 */
 	do {
-		if (page_counter_read(&memcg->memory) <= memcg->high)
+		unsigned long usage = page_counter_read(&memcg->memory);
+		unsigned long high = ACCESS_ONCE(memcg->high);
+
+		if (usage <= high)
 			continue;
 		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
-		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+		try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
 	} while ((memcg = parent_mem_cgroup(memcg)));
 done:
 	return ret;
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
  2015-08-28 15:25     ` Tejun Heo
  (?)
@ 2015-08-28 17:06     ` Michal Hocko
       [not found]       ` <20150828170612.GA21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  -1 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:06 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team

On Fri 28-08-15 11:25:27, Tejun Heo wrote:
> When memory usage is over the high limit, try_charge() performs direct
> reclaim; however, it uses the current charging amount @nr_pages as the
> reclamation target which is incorrect as we want to reclaim down to
> the high limit.  In practice, this doesn't matter all that much
> because the minimum target pages that try_to_free_mem_cgroup_pages()
> uses is SWAP_CLUSTER_MAX which is rather large.
> 
> Fix it by setting the target number of pages to the difference between
> the current usage and the high limit.

I do not think this a better behavior. If you have parallel charges to
the same memcg then you can easilly over-reclaim  because everybody
will reclaim the maximum rather than its contribution.

Sure we can fail to reclaim the target and slowly grow over high limit
but that is to be expected. This is not the max limit which cannot be
breached and external memory pressure/reclaim is there to mitigate that.

> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  mm/memcontrol.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index aacc767..18ecf75 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2078,10 +2078,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * make the charging task trim their excess contribution.
>  	 */
>  	do {
> -		if (page_counter_read(&memcg->memory) <= memcg->high)
> +		unsigned long usage = page_counter_read(&memcg->memory);
> +		unsigned long high = ACCESS_ONCE(memcg->high);
> +
> +		if (usage <= high)
>  			continue;
>  		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> -		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> +		try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
>  	} while ((memcg = parent_mem_cgroup(memcg)));
>  done:
>  	return ret;
> -- 
> 2.4.3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <20150828170612.GA21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
  2015-08-28 17:06     ` Michal Hocko
@ 2015-08-28 18:32           ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 18:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

Hello,

On Fri, Aug 28, 2015 at 07:06:13PM +0200, Michal Hocko wrote:
> I do not think this a better behavior. If you have parallel charges to
> the same memcg then you can easilly over-reclaim  because everybody
> will reclaim the maximum rather than its contribution.
> 
> Sure we can fail to reclaim the target and slowly grow over high limit
> but that is to be expected. This is not the max limit which cannot be
> breached and external memory pressure/reclaim is there to mitigate that.

Ah, I see, yeah, over-reclaim can happen.  How about just wrapping the
over-high reclaim with a per-memcg mutex?  Do we gain anything by
putting multiple tasks into the reclaim path?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
@ 2015-08-28 18:32           ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 18:32 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team

Hello,

On Fri, Aug 28, 2015 at 07:06:13PM +0200, Michal Hocko wrote:
> I do not think this a better behavior. If you have parallel charges to
> the same memcg then you can easilly over-reclaim  because everybody
> will reclaim the maximum rather than its contribution.
> 
> Sure we can fail to reclaim the target and slowly grow over high limit
> but that is to be expected. This is not the max limit which cannot be
> breached and external memory pressure/reclaim is there to mitigate that.

Ah, I see, yeah, over-reclaim can happen.  How about just wrapping the
over-high reclaim with a per-memcg mutex?  Do we gain anything by
putting multiple tasks into the reclaim path?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <20150828183209.GA9423-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]

* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
  2015-08-28 18:32           ` Tejun Heo
@ 2015-08-31  7:51               ` Michal Hocko
  -1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-31  7:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

On Fri 28-08-15 14:32:09, Tejun Heo wrote:
> Hello,
> 
> On Fri, Aug 28, 2015 at 07:06:13PM +0200, Michal Hocko wrote:
> > I do not think this a better behavior. If you have parallel charges to
> > the same memcg then you can easilly over-reclaim  because everybody
> > will reclaim the maximum rather than its contribution.
> > 
> > Sure we can fail to reclaim the target and slowly grow over high limit
> > but that is to be expected. This is not the max limit which cannot be
> > breached and external memory pressure/reclaim is there to mitigate that.
> 
> Ah, I see, yeah, over-reclaim can happen.  How about just wrapping the
> over-high reclaim with a per-memcg mutex?  Do we gain anything by
> putting multiple tasks into the reclaim path?

The overall reclaim throughput will be higher with the parallel reclaim.
Threads might still get synchronized on the zone lru lock but this is only
for isolating them from the LRU. In a larger hierarchies this even might
not be the case because the hierarchy iterator tries to spread the reclaim
over different memcgs.

So the per-memcg mutex would solve the potential over-reclaim but it
will restrain the reclaim activity unnecessarily. Why is per-contribution
reclaim such a big deal in the first place? If there are runaways
allocation requests like GFP_NOWAIT then we should look after those. And
I would argue that your delayed reclaim idea is a great fit for that. We
just should track how many pages were charged over high limit in the
process context and reclaim that amount on the way out from the kernel.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
@ 2015-08-31  7:51               ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-31  7:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team

On Fri 28-08-15 14:32:09, Tejun Heo wrote:
> Hello,
> 
> On Fri, Aug 28, 2015 at 07:06:13PM +0200, Michal Hocko wrote:
> > I do not think this a better behavior. If you have parallel charges to
> > the same memcg then you can easilly over-reclaim  because everybody
> > will reclaim the maximum rather than its contribution.
> > 
> > Sure we can fail to reclaim the target and slowly grow over high limit
> > but that is to be expected. This is not the max limit which cannot be
> > breached and external memory pressure/reclaim is there to mitigate that.
> 
> Ah, I see, yeah, over-reclaim can happen.  How about just wrapping the
> over-high reclaim with a per-memcg mutex?  Do we gain anything by
> putting multiple tasks into the reclaim path?

The overall reclaim throughput will be higher with the parallel reclaim.
Threads might still get synchronized on the zone lru lock but this is only
for isolating them from the LRU. In a larger hierarchies this even might
not be the case because the hierarchy iterator tries to spread the reclaim
over different memcgs.

So the per-memcg mutex would solve the potential over-reclaim but it
will restrain the reclaim activity unnecessarily. Why is per-contribution
reclaim such a big deal in the first place? If there are runaways
allocation requests like GFP_NOWAIT then we should look after those. And
I would argue that your delayed reclaim idea is a great fit for that. We
just should track how many pages were charged over high limit in the
process context and reclaim that amount on the way out from the kernel.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <20150831075133.GA29723-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
  2015-08-31  7:51               ` Michal Hocko
@ 2015-08-31 13:38                   ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-31 13:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

Hello, Michal.

On Mon, Aug 31, 2015 at 09:51:33AM +0200, Michal Hocko wrote:
> The overall reclaim throughput will be higher with the parallel reclaim.

Is reclaim throughput as determined by CPU cycle bandwidth a
meaningful metric?  I'm having a bit of trouble imagining that this
actually would matter especially given that writeback is single
threaded per bdi_writeback.

Shoving in a large number of threads into the same path which walks
the same data structures when there's no clear benefit doesn't usually
end up buying anything.  Cachelines get thrown back and forth, locks
get contended, CPU cycles which could have been used for a lot of more
useful things get wasted and IO pattern becomes random.  We can still
choose to do that but I think we should have explicit justifications
(e.g. it really harms scalability otherwise).

> Threads might still get synchronized on the zone lru lock but this is only
> for isolating them from the LRU. In a larger hierarchies this even might
> not be the case because the hierarchy iterator tries to spread the reclaim
> over different memcgs.
> 
> So the per-memcg mutex would solve the potential over-reclaim but it
> will restrain the reclaim activity unnecessarily. Why is per-contribution
> reclaim such a big deal in the first place? If there are runaways
> allocation requests like GFP_NOWAIT then we should look after those. And
> I would argue that your delayed reclaim idea is a great fit for that. We
> just should track how many pages were charged over high limit in the
> process context and reclaim that amount on the way out from the kernel.

Per-contribution reclaim is not necessarily a "big deal" but is kinda
mushy on the edges which get more pronounced with async reclaim.

* It still can over reclaim to a considerable extent.  The reclaim
  path uses mean reclaim size of 1M and when the high limit is used as
  the main mechanism for reclaim rather than global limit, many
  threads performing simultaneous 1M reclaims will happen.

* We need to keep track of an additional state.  What if a task
  performs multiple NOWAIT try_charge()'s?  I guess we should add up
  those numbers.

* But even if we do that, what does that actually mean?  These numbers
  are arbitrary in nature.  A thread may have just performed
  high-order allocations back to back at the point where it steps over
  the limit with an order-0 allocation or maybe a different thread
  which wasn't consuming much memory can hit it right after.
  @nr_pages is a convenient number that we can use on the spot which
  will make the consumption converge on the limit but I'm not sure
  this is a number that we should keep track of.

Also, having a central point of control means that we can actually
define policies there - e.g. if the overage is less than 10%, let
tasks through as long as there's at least one reclaiming.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
@ 2015-08-31 13:38                   ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-31 13:38 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team

Hello, Michal.

On Mon, Aug 31, 2015 at 09:51:33AM +0200, Michal Hocko wrote:
> The overall reclaim throughput will be higher with the parallel reclaim.

Is reclaim throughput as determined by CPU cycle bandwidth a
meaningful metric?  I'm having a bit of trouble imagining that this
actually would matter especially given that writeback is single
threaded per bdi_writeback.

Shoving in a large number of threads into the same path which walks
the same data structures when there's no clear benefit doesn't usually
end up buying anything.  Cachelines get thrown back and forth, locks
get contended, CPU cycles which could have been used for a lot of more
useful things get wasted and IO pattern becomes random.  We can still
choose to do that but I think we should have explicit justifications
(e.g. it really harms scalability otherwise).

> Threads might still get synchronized on the zone lru lock but this is only
> for isolating them from the LRU. In a larger hierarchies this even might
> not be the case because the hierarchy iterator tries to spread the reclaim
> over different memcgs.
> 
> So the per-memcg mutex would solve the potential over-reclaim but it
> will restrain the reclaim activity unnecessarily. Why is per-contribution
> reclaim such a big deal in the first place? If there are runaways
> allocation requests like GFP_NOWAIT then we should look after those. And
> I would argue that your delayed reclaim idea is a great fit for that. We
> just should track how many pages were charged over high limit in the
> process context and reclaim that amount on the way out from the kernel.

Per-contribution reclaim is not necessarily a "big deal" but is kinda
mushy on the edges which get more pronounced with async reclaim.

* It still can over reclaim to a considerable extent.  The reclaim
  path uses mean reclaim size of 1M and when the high limit is used as
  the main mechanism for reclaim rather than global limit, many
  threads performing simultaneous 1M reclaims will happen.

* We need to keep track of an additional state.  What if a task
  performs multiple NOWAIT try_charge()'s?  I guess we should add up
  those numbers.

* But even if we do that, what does that actually mean?  These numbers
  are arbitrary in nature.  A thread may have just performed
  high-order allocations back to back at the point where it steps over
  the limit with an order-0 allocation or maybe a different thread
  which wasn't consuming much memory can hit it right after.
  @nr_pages is a convenient number that we can use on the spot which
  will make the consumption converge on the limit but I'm not sure
  this is a number that we should keep track of.

Also, having a central point of control means that we can actually
define policies there - e.g. if the overage is less than 10%, let
tasks through as long as there's at least one reclaiming.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
  2015-08-31 13:38                   ` Tejun Heo
  (?)
@ 2015-09-01 12:51                   ` Michal Hocko
       [not found]                     ` <20150901125149.GD8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  -1 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2015-09-01 12:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team

On Mon 31-08-15 09:38:40, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Aug 31, 2015 at 09:51:33AM +0200, Michal Hocko wrote:
> > The overall reclaim throughput will be higher with the parallel reclaim.
> 
> Is reclaim throughput as determined by CPU cycle bandwidth a
> meaningful metric? 

Well, considering it has a direct effect on the latency I would consider
it quite meaningful.

> I'm having a bit of trouble imagining that this
> actually would matter especially given that writeback is single
> threaded per bdi_writeback.

Sure, if the LRU contains a lot of dirty pages then the writeback will be
a bottleneck. But LRUs are quite often full of the clean pagecache pages
which can be reclaimed quickly and efficiently.

> Shoving in a large number of threads into the same path which walks
> the same data structures when there's no clear benefit doesn't usually
> end up buying anything. 

I agree that a certain level of throttling is reasonable. We are doing
some of that in the lower layers of the reclaim. It is true that some of
that throttling is specific to the global reclaim and some heuristics
might be applicable on the try_charge level as well but we should be
careful here. Throttling has been quite tricky and there were some
issues where it led to unexpected stalls in the past.

> Cachelines get thrown back and forth, locks
> get contended, CPU cycles which could have been used for a lot of more
> useful things get wasted and IO pattern becomes random.  We can still
> choose to do that but I think we should have explicit justifications
> (e.g. it really harms scalability otherwise).

I do not remember heavy contention on the lru lock and we are not doing
an IO from the direct reclaim (other than swap) so a random IO pattern
shouldn't be an issue as well.

> > Threads might still get synchronized on the zone lru lock but this is only
> > for isolating them from the LRU. In a larger hierarchies this even might
> > not be the case because the hierarchy iterator tries to spread the reclaim
> > over different memcgs.
> > 
> > So the per-memcg mutex would solve the potential over-reclaim but it
> > will restrain the reclaim activity unnecessarily. Why is per-contribution
> > reclaim such a big deal in the first place? If there are runaways
> > allocation requests like GFP_NOWAIT then we should look after those. And
> > I would argue that your delayed reclaim idea is a great fit for that. We
> > just should track how many pages were charged over high limit in the
> > process context and reclaim that amount on the way out from the kernel.
> 
> Per-contribution reclaim is not necessarily a "big deal" but is kinda
> mushy on the edges which get more pronounced with async reclaim.

Yes, I am not saying it is a perfect solution. It has its issues as well.
We have been doing this for ages though and there should be really good
reasons with numbers demonstrating improvements to change it to
something else.

> * It still can over reclaim to a considerable extent.  The reclaim
>   path uses mean reclaim size of 1M and when the high limit is used as
>   the main mechanism for reclaim rather than global limit, many
>   threads performing simultaneous 1M reclaims will happen.

Yes this is something that has been changed recently and I am not sure
the new SWAP_CLUSTER_MAX value fits well into the current memcg direct
reclaim implementation.
I didn't get to measure the effect yet, though, but maybe we will have
to go back to 32 or something small for the memcg reclaim. This is just
an implementation detail, though.

> * We need to keep track of an additional state.  What if a task
>   performs multiple NOWAIT try_charge()'s?  I guess we should add up
>   those numbers.

Yes, that was the idea. Just accumulate nr_pages attempts when the
current > high and then attempt to reclaim them on the way out as you
were suggesting.

> * But even if we do that, what does that actually mean?  These numbers
>   are arbitrary in nature.

nr_pages at least reflects the request size so we, at least
theoretically, throttle larger consumers more.

>   A thread may have just performed
>   high-order allocations back to back at the point where it steps over
>   the limit with an order-0 allocation or maybe a different thread
>   which wasn't consuming much memory can hit it right after.
>   @nr_pages is a convenient number that we can use on the spot which
>   will make the consumption converge on the limit but I'm not sure
>   this is a number that we should keep track of.

I agree that this is an inherently racy environment. It really depends
on who hits the limit and how good are reclaimers at doing their work
when others piggy back on their work.
We have the background reclaim to reduce that effect for the global
case.  Something similar have been discussed in the past for memcg as
well but it hits its own issues as it has to scale with the potentially
large number of memcgs.

> Also, having a central point of control means that we can actually
> define policies there - e.g. if the overage is less than 10%, let
> tasks through as long as there's at least one reclaiming.

I am not sure whether throttling at this level would be more beneficial
than doing that down at the reclaim paths.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <20150901125149.GD8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
  2015-09-01 12:51                   ` Michal Hocko
@ 2015-09-01 18:33                         ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-09-01 18:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

Hello,

On Tue, Sep 01, 2015 at 02:51:50PM +0200, Michal Hocko wrote:
> > Is reclaim throughput as determined by CPU cycle bandwidth a
> > meaningful metric? 
> 
> Well, considering it has a direct effect on the latency I would consider
> it quite meaningful.
>
> > I'm having a bit of trouble imagining that this
> > actually would matter especially given that writeback is single
> > threaded per bdi_writeback.
> 
> Sure, if the LRU contains a lot of dirty pages then the writeback will be
> a bottleneck. But LRUs are quite often full of the clean pagecache pages
> which can be reclaimed quickly and efficiently.

I see.  Hmmm... I can imagine the scheduling latencies from
synchronization being a factor.  Alright, if we decide to do this
return-path reclaiming, I'll update the patch to accumulate nr_pages.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
@ 2015-09-01 18:33                         ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-09-01 18:33 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team

Hello,

On Tue, Sep 01, 2015 at 02:51:50PM +0200, Michal Hocko wrote:
> > Is reclaim throughput as determined by CPU cycle bandwidth a
> > meaningful metric? 
> 
> Well, considering it has a direct effect on the latency I would consider
> it quite meaningful.
>
> > I'm having a bit of trouble imagining that this
> > actually would matter especially given that writeback is single
> > threaded per bdi_writeback.
> 
> Sure, if the LRU contains a lot of dirty pages then the writeback will be
> a bottleneck. But LRUs are quite often full of the clean pagecache pages
> which can be reclaimed quickly and efficiently.

I see.  Hmmm... I can imagine the scheduling latencies from
synchronization being a factor.  Alright, if we decide to do this
return-path reclaiming, I'll update the patch to accumulate nr_pages.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 2/4] memcg: flatten task_struct->memcg_oom
  2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 15:25     ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
  To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg,
	Tejun Heo

task_struct->memcg_oom is a sub-struct containing fields which are
used for async memcg oom handling.  Most task_struct fields aren't
packaged this way and it can lead to unnecessary alignment paddings.
This patch flattens it.

* task.memcg_oom.memcg		-> task.memcg_in_oom
* task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
* task.memcg_oom.order		-> task.memcg_oom_order
* task.memcg_oom.may_oom	-> task.memcg_may_oom

In addition, task.memcg_may_oom is relocated to where other bitfields
are which reduces the size of task_struct.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/memcontrol.h | 10 +++++-----
 include/linux/sched.h      | 13 ++++++-------
 mm/memcontrol.c            | 16 ++++++++--------
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad800e6..3d28656 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 
 static inline void mem_cgroup_oom_enable(void)
 {
-	WARN_ON(current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 1;
+	WARN_ON(current->memcg_may_oom);
+	current->memcg_may_oom = 1;
 }
 
 static inline void mem_cgroup_oom_disable(void)
 {
-	WARN_ON(!current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 0;
+	WARN_ON(!current->memcg_may_oom);
+	current->memcg_may_oom = 0;
 }
 
 static inline bool task_in_memcg_oom(struct task_struct *p)
 {
-	return p->memcg_oom.memcg;
+	return p->memcg_in_oom;
 }
 
 bool mem_cgroup_oom_synchronize(bool wait);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4ab9da..ef73b54 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1451,7 +1451,9 @@ struct task_struct {
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
-
+#ifdef CONFIG_MEMCG
+	unsigned memcg_may_oom:1;
+#endif
 #ifdef CONFIG_MEMCG_KMEM
 	unsigned memcg_kmem_skip_account:1;
 #endif
@@ -1782,12 +1784,9 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_MEMCG
-	struct memcg_oom_info {
-		struct mem_cgroup *memcg;
-		gfp_t gfp_mask;
-		int order;
-		unsigned int may_oom:1;
-	} memcg_oom;
+	struct mem_cgroup *memcg_in_oom;
+	gfp_t memcg_oom_gfp_mask;
+	int memcg_oom_order;
 #endif
 #ifdef CONFIG_UPROBES
 	struct uprobe_task *utask;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18ecf75..74abb31 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 
 static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_oom.may_oom)
+	if (!current->memcg_may_oom)
 		return;
 	/*
 	 * We are in the middle of the charge context here, so we
@@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 	 * and when we know whether the fault was overall successful.
 	 */
 	css_get(&memcg->css);
-	current->memcg_oom.memcg = memcg;
-	current->memcg_oom.gfp_mask = mask;
-	current->memcg_oom.order = order;
+	current->memcg_in_oom = memcg;
+	current->memcg_oom_gfp_mask = mask;
+	current->memcg_oom_order = order;
 }
 
 /**
@@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
  */
 bool mem_cgroup_oom_synchronize(bool handle)
 {
-	struct mem_cgroup *memcg = current->memcg_oom.memcg;
+	struct mem_cgroup *memcg = current->memcg_in_oom;
 	struct oom_wait_info owait;
 	bool locked;
 
@@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (locked && !memcg->oom_kill_disable) {
 		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
-		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
-					 current->memcg_oom.order);
+		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
+					 current->memcg_oom_order);
 	} else {
 		schedule();
 		mem_cgroup_unmark_under_oom(memcg);
@@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
 		memcg_oom_recover(memcg);
 	}
 cleanup:
-	current->memcg_oom.memcg = NULL;
+	current->memcg_in_oom = NULL;
 	css_put(&memcg->css);
 	return true;
 }
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 2/4] memcg: flatten task_struct->memcg_oom
@ 2015-08-28 15:25     ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
  To: hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team, Tejun Heo

task_struct->memcg_oom is a sub-struct containing fields which are
used for async memcg oom handling.  Most task_struct fields aren't
packaged this way and it can lead to unnecessary alignment paddings.
This patch flattens it.

* task.memcg_oom.memcg		-> task.memcg_in_oom
* task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
* task.memcg_oom.order		-> task.memcg_oom_order
* task.memcg_oom.may_oom	-> task.memcg_may_oom

In addition, task.memcg_may_oom is relocated to where other bitfields
are which reduces the size of task_struct.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/memcontrol.h | 10 +++++-----
 include/linux/sched.h      | 13 ++++++-------
 mm/memcontrol.c            | 16 ++++++++--------
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad800e6..3d28656 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 
 static inline void mem_cgroup_oom_enable(void)
 {
-	WARN_ON(current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 1;
+	WARN_ON(current->memcg_may_oom);
+	current->memcg_may_oom = 1;
 }
 
 static inline void mem_cgroup_oom_disable(void)
 {
-	WARN_ON(!current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 0;
+	WARN_ON(!current->memcg_may_oom);
+	current->memcg_may_oom = 0;
 }
 
 static inline bool task_in_memcg_oom(struct task_struct *p)
 {
-	return p->memcg_oom.memcg;
+	return p->memcg_in_oom;
 }
 
 bool mem_cgroup_oom_synchronize(bool wait);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4ab9da..ef73b54 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1451,7 +1451,9 @@ struct task_struct {
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
-
+#ifdef CONFIG_MEMCG
+	unsigned memcg_may_oom:1;
+#endif
 #ifdef CONFIG_MEMCG_KMEM
 	unsigned memcg_kmem_skip_account:1;
 #endif
@@ -1782,12 +1784,9 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_MEMCG
-	struct memcg_oom_info {
-		struct mem_cgroup *memcg;
-		gfp_t gfp_mask;
-		int order;
-		unsigned int may_oom:1;
-	} memcg_oom;
+	struct mem_cgroup *memcg_in_oom;
+	gfp_t memcg_oom_gfp_mask;
+	int memcg_oom_order;
 #endif
 #ifdef CONFIG_UPROBES
 	struct uprobe_task *utask;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18ecf75..74abb31 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 
 static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_oom.may_oom)
+	if (!current->memcg_may_oom)
 		return;
 	/*
 	 * We are in the middle of the charge context here, so we
@@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 	 * and when we know whether the fault was overall successful.
 	 */
 	css_get(&memcg->css);
-	current->memcg_oom.memcg = memcg;
-	current->memcg_oom.gfp_mask = mask;
-	current->memcg_oom.order = order;
+	current->memcg_in_oom = memcg;
+	current->memcg_oom_gfp_mask = mask;
+	current->memcg_oom_order = order;
 }
 
 /**
@@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
  */
 bool mem_cgroup_oom_synchronize(bool handle)
 {
-	struct mem_cgroup *memcg = current->memcg_oom.memcg;
+	struct mem_cgroup *memcg = current->memcg_in_oom;
 	struct oom_wait_info owait;
 	bool locked;
 
@@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (locked && !memcg->oom_kill_disable) {
 		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
-		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
-					 current->memcg_oom.order);
+		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
+					 current->memcg_oom_order);
 	} else {
 		schedule();
 		mem_cgroup_unmark_under_oom(memcg);
@@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
 		memcg_oom_recover(memcg);
 	}
 cleanup:
-	current->memcg_oom.memcg = NULL;
+	current->memcg_in_oom = NULL;
 	css_put(&memcg->css);
 	return true;
 }
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

[parent not found: <1440775530-18630-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]

* Re: [PATCH 2/4] memcg: flatten task_struct->memcg_oom
  2015-08-28 15:25     ` Tejun Heo
@ 2015-08-28 17:11         ` Michal Hocko
  -1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

On Fri 28-08-15 11:25:28, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling.  Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
> 
> * task.memcg_oom.memcg		-> task.memcg_in_oom
> * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> * task.memcg_oom.order		-> task.memcg_oom_order
> * task.memcg_oom.may_oom	-> task.memcg_may_oom
> 
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.

OK we will save 8B AFAICS which probably doesn't make much different for
this huge structure. But we already have memcg_kmem_skip_account bit
field there so another one makes sense. That alone would be sufficient
to save those bytes. Regarding the struct, I do not have a strong
opinion. I do not mind removing it.
 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

> ---
>  include/linux/memcontrol.h | 10 +++++-----
>  include/linux/sched.h      | 13 ++++++-------
>  mm/memcontrol.c            | 16 ++++++++--------
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ad800e6..3d28656 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  
>  static inline void mem_cgroup_oom_enable(void)
>  {
> -	WARN_ON(current->memcg_oom.may_oom);
> -	current->memcg_oom.may_oom = 1;
> +	WARN_ON(current->memcg_may_oom);
> +	current->memcg_may_oom = 1;
>  }
>  
>  static inline void mem_cgroup_oom_disable(void)
>  {
> -	WARN_ON(!current->memcg_oom.may_oom);
> -	current->memcg_oom.may_oom = 0;
> +	WARN_ON(!current->memcg_may_oom);
> +	current->memcg_may_oom = 0;
>  }
>  
>  static inline bool task_in_memcg_oom(struct task_struct *p)
>  {
> -	return p->memcg_oom.memcg;
> +	return p->memcg_in_oom;
>  }
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a4ab9da..ef73b54 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1451,7 +1451,9 @@ struct task_struct {
>  	unsigned sched_reset_on_fork:1;
>  	unsigned sched_contributes_to_load:1;
>  	unsigned sched_migrated:1;
> -
> +#ifdef CONFIG_MEMCG
> +	unsigned memcg_may_oom:1;
> +#endif
>  #ifdef CONFIG_MEMCG_KMEM
>  	unsigned memcg_kmem_skip_account:1;
>  #endif
> @@ -1782,12 +1784,9 @@ struct task_struct {
>  	unsigned long trace_recursion;
>  #endif /* CONFIG_TRACING */
>  #ifdef CONFIG_MEMCG
> -	struct memcg_oom_info {
> -		struct mem_cgroup *memcg;
> -		gfp_t gfp_mask;
> -		int order;
> -		unsigned int may_oom:1;
> -	} memcg_oom;
> +	struct mem_cgroup *memcg_in_oom;
> +	gfp_t memcg_oom_gfp_mask;
> +	int memcg_oom_order;
>  #endif
>  #ifdef CONFIG_UPROBES
>  	struct uprobe_task *utask;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 18ecf75..74abb31 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	if (!current->memcg_oom.may_oom)
> +	if (!current->memcg_may_oom)
>  		return;
>  	/*
>  	 * We are in the middle of the charge context here, so we
> @@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	 * and when we know whether the fault was overall successful.
>  	 */
>  	css_get(&memcg->css);
> -	current->memcg_oom.memcg = memcg;
> -	current->memcg_oom.gfp_mask = mask;
> -	current->memcg_oom.order = order;
> +	current->memcg_in_oom = memcg;
> +	current->memcg_oom_gfp_mask = mask;
> +	current->memcg_oom_order = order;
>  }
>  
>  /**
> @@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>   */
>  bool mem_cgroup_oom_synchronize(bool handle)
>  {
> -	struct mem_cgroup *memcg = current->memcg_oom.memcg;
> +	struct mem_cgroup *memcg = current->memcg_in_oom;
>  	struct oom_wait_info owait;
>  	bool locked;
>  
> @@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (locked && !memcg->oom_kill_disable) {
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> -		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
> -					 current->memcg_oom.order);
> +		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> +					 current->memcg_oom_order);
>  	} else {
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
> @@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  		memcg_oom_recover(memcg);
>  	}
>  cleanup:
> -	current->memcg_oom.memcg = NULL;
> +	current->memcg_in_oom = NULL;
>  	css_put(&memcg->css);
>  	return true;
>  }
> -- 
> 2.4.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/4] memcg: flatten task_struct->memcg_oom
@ 2015-08-28 17:11         ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team

On Fri 28-08-15 11:25:28, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling.  Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
> 
> * task.memcg_oom.memcg		-> task.memcg_in_oom
> * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> * task.memcg_oom.order		-> task.memcg_oom_order
> * task.memcg_oom.may_oom	-> task.memcg_may_oom
> 
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.

OK we will save 8B AFAICS which probably doesn't make much different for
this huge structure. But we already have memcg_kmem_skip_account bit
field there so another one makes sense. That alone would be sufficient
to save those bytes. Regarding the struct, I do not have a strong
opinion. I do not mind removing it.
 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 10 +++++-----
>  include/linux/sched.h      | 13 ++++++-------
>  mm/memcontrol.c            | 16 ++++++++--------
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ad800e6..3d28656 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  
>  static inline void mem_cgroup_oom_enable(void)
>  {
> -	WARN_ON(current->memcg_oom.may_oom);
> -	current->memcg_oom.may_oom = 1;
> +	WARN_ON(current->memcg_may_oom);
> +	current->memcg_may_oom = 1;
>  }
>  
>  static inline void mem_cgroup_oom_disable(void)
>  {
> -	WARN_ON(!current->memcg_oom.may_oom);
> -	current->memcg_oom.may_oom = 0;
> +	WARN_ON(!current->memcg_may_oom);
> +	current->memcg_may_oom = 0;
>  }
>  
>  static inline bool task_in_memcg_oom(struct task_struct *p)
>  {
> -	return p->memcg_oom.memcg;
> +	return p->memcg_in_oom;
>  }
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a4ab9da..ef73b54 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1451,7 +1451,9 @@ struct task_struct {
>  	unsigned sched_reset_on_fork:1;
>  	unsigned sched_contributes_to_load:1;
>  	unsigned sched_migrated:1;
> -
> +#ifdef CONFIG_MEMCG
> +	unsigned memcg_may_oom:1;
> +#endif
>  #ifdef CONFIG_MEMCG_KMEM
>  	unsigned memcg_kmem_skip_account:1;
>  #endif
> @@ -1782,12 +1784,9 @@ struct task_struct {
>  	unsigned long trace_recursion;
>  #endif /* CONFIG_TRACING */
>  #ifdef CONFIG_MEMCG
> -	struct memcg_oom_info {
> -		struct mem_cgroup *memcg;
> -		gfp_t gfp_mask;
> -		int order;
> -		unsigned int may_oom:1;
> -	} memcg_oom;
> +	struct mem_cgroup *memcg_in_oom;
> +	gfp_t memcg_oom_gfp_mask;
> +	int memcg_oom_order;
>  #endif
>  #ifdef CONFIG_UPROBES
>  	struct uprobe_task *utask;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 18ecf75..74abb31 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	if (!current->memcg_oom.may_oom)
> +	if (!current->memcg_may_oom)
>  		return;
>  	/*
>  	 * We are in the middle of the charge context here, so we
> @@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	 * and when we know whether the fault was overall successful.
>  	 */
>  	css_get(&memcg->css);
> -	current->memcg_oom.memcg = memcg;
> -	current->memcg_oom.gfp_mask = mask;
> -	current->memcg_oom.order = order;
> +	current->memcg_in_oom = memcg;
> +	current->memcg_oom_gfp_mask = mask;
> +	current->memcg_oom_order = order;
>  }
>  
>  /**
> @@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>   */
>  bool mem_cgroup_oom_synchronize(bool handle)
>  {
> -	struct mem_cgroup *memcg = current->memcg_oom.memcg;
> +	struct mem_cgroup *memcg = current->memcg_in_oom;
>  	struct oom_wait_info owait;
>  	bool locked;
>  
> @@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (locked && !memcg->oom_kill_disable) {
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> -		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
> -					 current->memcg_oom.order);
> +		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> +					 current->memcg_oom_order);
>  	} else {
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
> @@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  		memcg_oom_recover(memcg);
>  	}
>  cleanup:
> -	current->memcg_oom.memcg = NULL;
> +	current->memcg_in_oom = NULL;
>  	css_put(&memcg->css);
>  	return true;
>  }
> -- 
> 2.4.3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 15:25     ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
  To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg,
	Tejun Heo

Currently, try_charge() tries to reclaim memory directly when the high
limit is breached; however, this has a couple issues.

* try_charge() can be invoked from any in-kernel allocation site and
  reclaim path may use considerable amount of stack.  This can lead to
  stack overflows which are extremely difficult to reproduce.

* If the allocation doesn't have __GFP_WAIT, direct reclaim is
  skipped.  If a process performs only speculative allocations, it can
  blow way past the high limit.  This is actually easily reproducible
  by simply doing "find /".  VFS tries speculative !__GFP_WAIT
  allocations first, so as long as there's memory which can be
  consumed without blocking, it can keep allocating memory regardless
  of the high limit.

This patch makes try_charge() always punt the direct reclaim to the
return-to-userland path.  If try_charge() detects that high limit is
breached, it sets current->memcg_over_high to the offending memcg and
schedules execution of mem_cgroup_handle_over_high() which performs
the direct reclaim from the return-to-userland path.

As long as kernel doesn't have a run-away allocation spree, this
should provide enough protection while making kmemcg behave more
consistently.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/memcontrol.h |  6 +++++
 include/linux/sched.h      |  1 +
 include/linux/tracehook.h  |  3 +++
 mm/memcontrol.c            | 66 +++++++++++++++++++++++++++++++++++++---------
 4 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3d28656..8d345a7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 	return inactive * inactive_ratio < active;
 }
 
+void mem_cgroup_handle_over_high(void);
+
 void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 				struct task_struct *p);
 
@@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
 {
 }
 
+static inline void mem_cgroup_handle_over_high(void)
+{
+}
+
 static inline void mem_cgroup_oom_enable(void)
 {
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ef73b54..c76b71d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1785,6 +1785,7 @@ struct task_struct {
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_MEMCG
 	struct mem_cgroup *memcg_in_oom;
+	struct mem_cgroup *memcg_over_high; /* reclaim on returning to user */
 	gfp_t memcg_oom_gfp_mask;
 	int memcg_oom_order;
 #endif
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 84d4972..26c1521 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -50,6 +50,7 @@
 #include <linux/ptrace.h>
 #include <linux/security.h>
 #include <linux/task_work.h>
+#include <linux/memcontrol.h>
 struct linux_binprm;
 
 /*
@@ -188,6 +189,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
 	smp_mb__after_atomic();
 	if (unlikely(current->task_works))
 		task_work_run();
+
+	mem_cgroup_handle_over_high();
 }
 
 #endif	/* <linux/tracehook.h> */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74abb31..c94b686 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -62,6 +62,7 @@
 #include <linux/oom.h>
 #include <linux/lockdep.h>
 #include <linux/file.h>
+#include <linux/tracehook.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -1963,6 +1964,33 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+/*
+ * Scheduled by try_charge() to be executed from the userland return path
+ * and reclaims memory over the high limit.
+ */
+void mem_cgroup_handle_over_high(void)
+{
+	struct mem_cgroup *memcg = current->memcg_over_high;
+
+	if (likely(!memcg))
+		return;
+
+	do {
+		unsigned long usage = page_counter_read(&memcg->memory);
+		unsigned long high = ACCESS_ONCE(memcg->high);
+
+		if (usage <= high)
+			continue;
+
+		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+		try_to_free_mem_cgroup_pages(memcg, usage - high,
+					     GFP_KERNEL, true);
+	} while ((memcg = parent_mem_cgroup(memcg)));
+
+	css_put(&current->memcg_over_high->css);
+	current->memcg_over_high = NULL;
+}
+
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		      unsigned int nr_pages)
 {
@@ -2071,21 +2099,27 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
-	if (!(gfp_mask & __GFP_WAIT))
-		goto done;
+
 	/*
-	 * If the hierarchy is above the normal consumption range,
-	 * make the charging task trim their excess contribution.
+	 * If the hierarchy is above the normal consumption range, schedule
+	 * direct reclaim on returning to userland.  We can perform direct
+	 * reclaim here if __GFP_WAIT; however, punting has the benefit of
+	 * avoiding surprise high stack usages and it's fine to breach the
+	 * high limit temporarily while control stays in kernel.
 	 */
-	do {
-		unsigned long usage = page_counter_read(&memcg->memory);
-		unsigned long high = ACCESS_ONCE(memcg->high);
+	if (!current->memcg_over_high) {
+		struct mem_cgroup *pos = memcg;
 
-		if (usage <= high)
-			continue;
-		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
-		try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
-	} while ((memcg = parent_mem_cgroup(memcg)));
+		do {
+			if (page_counter_read(&pos->memory) > pos->high) {
+				/* make user return path rescan from leaf */
+				css_get(&memcg->css);
+				current->memcg_over_high = memcg;
+				set_notify_resume(current);
+				break;
+			}
+		} while ((pos = parent_mem_cgroup(pos)));
+	}
 done:
 	return ret;
 }
@@ -5053,6 +5087,13 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
 }
 #endif
 
+static void mem_cgroup_exit(struct cgroup_subsys_state *css,
+			    struct cgroup_subsys_state *old_css,
+			    struct task_struct *task)
+{
+	mem_cgroup_handle_over_high();
+}
+
 /*
  * Cgroup retains root cgroups across [un]mount cycles making it necessary
  * to verify whether we're attached to the default hierarchy on each mount
@@ -5223,6 +5264,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.can_attach = mem_cgroup_can_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
+	.exit = mem_cgroup_exit,
 	.bind = mem_cgroup_bind,
 	.dfl_cftypes = memory_files,
 	.legacy_cftypes = mem_cgroup_legacy_files,
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 15:25     ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
  To: hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team, Tejun Heo

Currently, try_charge() tries to reclaim memory directly when the high
limit is breached; however, this has a couple issues.

* try_charge() can be invoked from any in-kernel allocation site and
  reclaim path may use considerable amount of stack.  This can lead to
  stack overflows which are extremely difficult to reproduce.

* If the allocation doesn't have __GFP_WAIT, direct reclaim is
  skipped.  If a process performs only speculative allocations, it can
  blow way past the high limit.  This is actually easily reproducible
  by simply doing "find /".  VFS tries speculative !__GFP_WAIT
  allocations first, so as long as there's memory which can be
  consumed without blocking, it can keep allocating memory regardless
  of the high limit.

This patch makes try_charge() always punt the direct reclaim to the
return-to-userland path.  If try_charge() detects that high limit is
breached, it sets current->memcg_over_high to the offending memcg and
schedules execution of mem_cgroup_handle_over_high() which performs
the direct reclaim from the return-to-userland path.

As long as kernel doesn't have a run-away allocation spree, this
should provide enough protection while making kmemcg behave more
consistently.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/memcontrol.h |  6 +++++
 include/linux/sched.h      |  1 +
 include/linux/tracehook.h  |  3 +++
 mm/memcontrol.c            | 66 +++++++++++++++++++++++++++++++++++++---------
 4 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3d28656..8d345a7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 	return inactive * inactive_ratio < active;
 }
 
+void mem_cgroup_handle_over_high(void);
+
 void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 				struct task_struct *p);
 
@@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
 {
 }
 
+static inline void mem_cgroup_handle_over_high(void)
+{
+}
+
 static inline void mem_cgroup_oom_enable(void)
 {
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ef73b54..c76b71d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1785,6 +1785,7 @@ struct task_struct {
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_MEMCG
 	struct mem_cgroup *memcg_in_oom;
+	struct mem_cgroup *memcg_over_high; /* reclaim on returning to user */
 	gfp_t memcg_oom_gfp_mask;
 	int memcg_oom_order;
 #endif
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 84d4972..26c1521 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -50,6 +50,7 @@
 #include <linux/ptrace.h>
 #include <linux/security.h>
 #include <linux/task_work.h>
+#include <linux/memcontrol.h>
 struct linux_binprm;
 
 /*
@@ -188,6 +189,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
 	smp_mb__after_atomic();
 	if (unlikely(current->task_works))
 		task_work_run();
+
+	mem_cgroup_handle_over_high();
 }
 
 #endif	/* <linux/tracehook.h> */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74abb31..c94b686 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -62,6 +62,7 @@
 #include <linux/oom.h>
 #include <linux/lockdep.h>
 #include <linux/file.h>
+#include <linux/tracehook.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -1963,6 +1964,33 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+/*
+ * Scheduled by try_charge() to be executed from the userland return path
+ * and reclaims memory over the high limit.
+ */
+void mem_cgroup_handle_over_high(void)
+{
+	struct mem_cgroup *memcg = current->memcg_over_high;
+
+	if (likely(!memcg))
+		return;
+
+	do {
+		unsigned long usage = page_counter_read(&memcg->memory);
+		unsigned long high = ACCESS_ONCE(memcg->high);
+
+		if (usage <= high)
+			continue;
+
+		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+		try_to_free_mem_cgroup_pages(memcg, usage - high,
+					     GFP_KERNEL, true);
+	} while ((memcg = parent_mem_cgroup(memcg)));
+
+	css_put(&current->memcg_over_high->css);
+	current->memcg_over_high = NULL;
+}
+
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		      unsigned int nr_pages)
 {
@@ -2071,21 +2099,27 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
-	if (!(gfp_mask & __GFP_WAIT))
-		goto done;
+
 	/*
-	 * If the hierarchy is above the normal consumption range,
-	 * make the charging task trim their excess contribution.
+	 * If the hierarchy is above the normal consumption range, schedule
+	 * direct reclaim on returning to userland.  We can perform direct
+	 * reclaim here if __GFP_WAIT; however, punting has the benefit of
+	 * avoiding surprise high stack usages and it's fine to breach the
+	 * high limit temporarily while control stays in kernel.
 	 */
-	do {
-		unsigned long usage = page_counter_read(&memcg->memory);
-		unsigned long high = ACCESS_ONCE(memcg->high);
+	if (!current->memcg_over_high) {
+		struct mem_cgroup *pos = memcg;
 
-		if (usage <= high)
-			continue;
-		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
-		try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
-	} while ((memcg = parent_mem_cgroup(memcg)));
+		do {
+			if (page_counter_read(&pos->memory) > pos->high) {
+				/* make user return path rescan from leaf */
+				css_get(&memcg->css);
+				current->memcg_over_high = memcg;
+				set_notify_resume(current);
+				break;
+			}
+		} while ((pos = parent_mem_cgroup(pos)));
+	}
 done:
 	return ret;
 }
@@ -5053,6 +5087,13 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
 }
 #endif
 
+static void mem_cgroup_exit(struct cgroup_subsys_state *css,
+			    struct cgroup_subsys_state *old_css,
+			    struct task_struct *task)
+{
+	mem_cgroup_handle_over_high();
+}
+
 /*
  * Cgroup retains root cgroups across [un]mount cycles making it necessary
  * to verify whether we're attached to the default hierarchy on each mount
@@ -5223,6 +5264,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.can_attach = mem_cgroup_can_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
+	.exit = mem_cgroup_exit,
 	.bind = mem_cgroup_bind,
 	.dfl_cftypes = memory_files,
 	.legacy_cftypes = mem_cgroup_legacy_files,
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

[parent not found: <1440775530-18630-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 15:25     ` Tejun Heo
@ 2015-08-28 16:36         ` Vladimir Davydov
  -1 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 16:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

Hi Tejun,

On Fri, Aug 28, 2015 at 11:25:29AM -0400, Tejun Heo wrote:
> Currently, try_charge() tries to reclaim memory directly when the high
> limit is breached; however, this has a couple issues.
> 
> * try_charge() can be invoked from any in-kernel allocation site and
>   reclaim path may use considerable amount of stack.  This can lead to
>   stack overflows which are extremely difficult to reproduce.

IMO this paragraph does not justify this patch at all, because one will
still invoke direct reclaim from try_charge() on hitting the hard limit.

> 
> * If the allocation doesn't have __GFP_WAIT, direct reclaim is
>   skipped.  If a process performs only speculative allocations, it can
>   blow way past the high limit.  This is actually easily reproducible
>   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
>   allocations first, so as long as there's memory which can be
>   consumed without blocking, it can keep allocating memory regardless
>   of the high limit.

I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
in a row - they should still alternate with normal __GFP_WAIT
allocations. Yes, that means we can breach memory.high threshold for a
short period of time, but it isn't a hard limit, so it looks perfectly
fine to me.

I tried to run `find /` over ext4 in a cgroup with memory.high set to
32M and kmem accounting enabled. With such a setup memory.current never
got higher than 33152K, which is only 384K greater than the memory.high.
Which FS did you use?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 16:36         ` Vladimir Davydov
  0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 16:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mhocko, cgroups, linux-mm, kernel-team

Hi Tejun,

On Fri, Aug 28, 2015 at 11:25:29AM -0400, Tejun Heo wrote:
> Currently, try_charge() tries to reclaim memory directly when the high
> limit is breached; however, this has a couple issues.
> 
> * try_charge() can be invoked from any in-kernel allocation site and
>   reclaim path may use considerable amount of stack.  This can lead to
>   stack overflows which are extremely difficult to reproduce.

IMO this paragraph does not justify this patch at all, because one will
still invoke direct reclaim from try_charge() on hitting the hard limit.

> 
> * If the allocation doesn't have __GFP_WAIT, direct reclaim is
>   skipped.  If a process performs only speculative allocations, it can
>   blow way past the high limit.  This is actually easily reproducible
>   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
>   allocations first, so as long as there's memory which can be
>   consumed without blocking, it can keep allocating memory regardless
>   of the high limit.

I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
in a row - they should still alternate with normal __GFP_WAIT
allocations. Yes, that means we can breach memory.high threshold for a
short period of time, but it isn't a hard limit, so it looks perfectly
fine to me.

I tried to run `find /` over ext4 in a cgroup with memory.high set to
32M and kmem accounting enabled. With such a setup memory.current never
got higher than 33152K, which is only 384K greater than the memory.high.
Which FS did you use?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 16:36         ` Vladimir Davydov
@ 2015-08-28 16:48           ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 16:48 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

Hello, Vladimir.

On Fri, Aug 28, 2015 at 07:36:11PM +0300, Vladimir Davydov wrote:
> > * try_charge() can be invoked from any in-kernel allocation site and
> >   reclaim path may use considerable amount of stack.  This can lead to
> >   stack overflows which are extremely difficult to reproduce.
> 
> IMO this paragraph does not justify this patch at all, because one will
> still invoke direct reclaim from try_charge() on hitting the hard limit.

Ah... right, and we can't defer direct reclaim for hard limit.

> > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> >   skipped.  If a process performs only speculative allocations, it can
> >   blow way past the high limit.  This is actually easily reproducible
> >   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
> >   allocations first, so as long as there's memory which can be
> >   consumed without blocking, it can keep allocating memory regardless
> >   of the high limit.
> 
> I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
> in a row - they should still alternate with normal __GFP_WAIT
> allocations. Yes, that means we can breach memory.high threshold for a
> short period of time, but it isn't a hard limit, so it looks perfectly
> fine to me.
> 
> I tried to run `find /` over ext4 in a cgroup with memory.high set to
> 32M and kmem accounting enabled. With such a setup memory.current never
> got higher than 33152K, which is only 384K greater than the memory.high.
> Which FS did you use?

ext4.  Here, it goes onto happily consuming hundreds of megabytes with
limit set at 32M.  We have quite a few places where !__GFP_WAIT
allocations are performed speculatively in hot paths with fallback
slow paths, so this is bound to happen somewhere.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 16:48           ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 16:48 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: hannes, mhocko, cgroups, linux-mm, kernel-team

Hello, Vladimir.

On Fri, Aug 28, 2015 at 07:36:11PM +0300, Vladimir Davydov wrote:
> > * try_charge() can be invoked from any in-kernel allocation site and
> >   reclaim path may use considerable amount of stack.  This can lead to
> >   stack overflows which are extremely difficult to reproduce.
> 
> IMO this paragraph does not justify this patch at all, because one will
> still invoke direct reclaim from try_charge() on hitting the hard limit.

Ah... right, and we can't defer direct reclaim for hard limit.

> > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> >   skipped.  If a process performs only speculative allocations, it can
> >   blow way past the high limit.  This is actually easily reproducible
> >   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
> >   allocations first, so as long as there's memory which can be
> >   consumed without blocking, it can keep allocating memory regardless
> >   of the high limit.
> 
> I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
> in a row - they should still alternate with normal __GFP_WAIT
> allocations. Yes, that means we can breach memory.high threshold for a
> short period of time, but it isn't a hard limit, so it looks perfectly
> fine to me.
> 
> I tried to run `find /` over ext4 in a cgroup with memory.high set to
> 32M and kmem accounting enabled. With such a setup memory.current never
> got higher than 33152K, which is only 384K greater than the memory.high.
> Which FS did you use?

ext4.  Here, it goes onto happily consuming hundreds of megabytes with
limit set at 32M.  We have quite a few places where !__GFP_WAIT
allocations are performed speculatively in hot paths with fallback
slow paths, so this is bound to happen somewhere.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 16:48           ` Tejun Heo
  (?)
@ 2015-08-28 20:32           ` Vladimir Davydov
  2015-08-28 20:44               ` Tejun Heo
  2015-08-30 15:52             ` Vladimir Davydov
  -1 siblings, 2 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 20:32 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton
  Cc: hannes, mhocko, cgroups, linux-mm, kernel-team, Joonsoo Kim,
	Christoph Lameter, David Rientjes

On Fri, Aug 28, 2015 at 12:48:19PM -0400, Tejun Heo wrote:
...
> > > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> > >   skipped.  If a process performs only speculative allocations, it can
> > >   blow way past the high limit.  This is actually easily reproducible
> > >   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
> > >   allocations first, so as long as there's memory which can be
> > >   consumed without blocking, it can keep allocating memory regardless
> > >   of the high limit.
> > 
> > I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
> > in a row - they should still alternate with normal __GFP_WAIT
> > allocations. Yes, that means we can breach memory.high threshold for a
> > short period of time, but it isn't a hard limit, so it looks perfectly
> > fine to me.
> > 
> > I tried to run `find /` over ext4 in a cgroup with memory.high set to
> > 32M and kmem accounting enabled. With such a setup memory.current never
> > got higher than 33152K, which is only 384K greater than the memory.high.
> > Which FS did you use?
> 
> ext4.  Here, it goes onto happily consuming hundreds of megabytes with
> limit set at 32M.  We have quite a few places where !__GFP_WAIT
> allocations are performed speculatively in hot paths with fallback
> slow paths, so this is bound to happen somewhere.

What kind of workload should it be then? `find` will constantly invoke
d_alloc, which issues a GFP_KERNEL allocation and therefore is allowed
to perform reclaim...

OK, I tried to reproduce the issue on the latest mainline kernel and ...
succeeded - memory.current did occasionally jump up to ~55M although
memory.high was set to 32M. Hmm, strange... Started to investigate.
Printed stack traces and found that we don't invoke memcg reclaim on
normal GFP_KERNEL allocations! How is that? The thing is there was a
commit that made SLUB (not VFS or any other kmem user, but core SLUB)
try to allocate high order slab pages w/o __GFP_WAIT for performance
reasons. That broke kmemcg case. Here it goes:

commit 6af3142bed1f520b90f4cdb6cd10bbd16906ce9a
Author: Joonsoo Kim <js1304@gmail.com>
Date:   Tue Aug 25 00:03:52 2015 +0000

    mm/slub: don't wait for high-order page allocation

I suspect your kernel has this commit included, because w/o it I haven't
managed to catch anything nearly as bad as you describe: the memory.high
excess reached 1-2 Mb at max, but never "hundreds of megabytes". If so,
we'd better fix that instead. Actually, it's worth fixing anyway. What
about the patch below?
---
From: Vladimir Davydov <vdavydov@parallels.com>
Date: Fri, 28 Aug 2015 23:17:19 +0300
Subject: [PATCH] mm/slub: don't bypass memcg reclaim for high-order page
 allocation

Commit 6af3142bed1f52 ("mm/slub: don't wait for high-order page
allocation") made allocate_slab() try to allocate high order slab pages
w/o __GFP_WAIT in order to avoid invoking reclaim/compaction when we can
fall back on low order pages. However, it broke kmemcg/memory.high
logic. The latter works as a soft limit: an allocation won't fail if it
is breached, but we call direct reclaim to compensate the excess. W/o
__GFP_WAIT we can't invoke reclaimer and therefore we will just go on,
exceeding memory.high more and more until a normal __GFP_WAIT allocation
is issued.

Since memcg reclaim never triggers compaction, we can pass __GFP_WAIT to
memcg_charge_slab() even on high order page allocations w/o any
performance impact. So let's fix this problem by excluding __GFP_WAIT
only from alloc_pages() while still forwarding it to memcg_charge_slab()
if the context allows.

Fixes: 6af3142bed1f52 ("mm/slub: don't wait for high-order page allocation")
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

diff --git a/mm/slub.c b/mm/slub.c
index e180f8dcd06d..1b9dbad40272 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1333,6 +1333,9 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	if (memcg_charge_slab(s, flags, order))
 		return NULL;
 
+	if ((flags & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
+		flags = (flags | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
+
 	if (node == NUMA_NO_NODE)
 		page = alloc_pages(flags, order);
 	else
@@ -1364,8 +1367,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * so we fall-back to the minimum order allocation.
 	 */
 	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
-	if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
-		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
 
 	page = alloc_slab_page(s, alloc_gfp, node, oo);
 	if (unlikely(!page)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 20:32           ` Vladimir Davydov
@ 2015-08-28 20:44               ` Tejun Heo
  2015-08-30 15:52             ` Vladimir Davydov
  1 sibling, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 20:44 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	Joonsoo Kim, Christoph Lameter, David Rientjes

Hello,

On Fri, Aug 28, 2015 at 11:32:31PM +0300, Vladimir Davydov wrote:
> What kind of workload should it be then? `find` will constantly invoke
> d_alloc, which issues a GFP_KERNEL allocation and therefore is allowed
> to perform reclaim...
> 
> OK, I tried to reproduce the issue on the latest mainline kernel and ...
> succeeded - memory.current did occasionally jump up to ~55M although
> memory.high was set to 32M. Hmm, strange... Started to investigate.
> Printed stack traces and found that we don't invoke memcg reclaim on
> normal GFP_KERNEL allocations! How is that? The thing is there was a
> commit that made SLUB (not VFS or any other kmem user, but core SLUB)
> try to allocate high order slab pages w/o __GFP_WAIT for performance
> reasons. That broke kmemcg case. Here it goes:

Ah, cool, so it was a bug from slub.  Punting to return path still has
some niceties but if we can't consistently get rid of stack
consumption it's not that attractive.  Let's revisit it later together
with hard limit reclaim.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 20:44               ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 20:44 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, hannes, mhocko, cgroups, linux-mm, kernel-team,
	Joonsoo Kim, Christoph Lameter, David Rientjes

Hello,

On Fri, Aug 28, 2015 at 11:32:31PM +0300, Vladimir Davydov wrote:
> What kind of workload should it be then? `find` will constantly invoke
> d_alloc, which issues a GFP_KERNEL allocation and therefore is allowed
> to perform reclaim...
> 
> OK, I tried to reproduce the issue on the latest mainline kernel and ...
> succeeded - memory.current did occasionally jump up to ~55M although
> memory.high was set to 32M. Hmm, strange... Started to investigate.
> Printed stack traces and found that we don't invoke memcg reclaim on
> normal GFP_KERNEL allocations! How is that? The thing is there was a
> commit that made SLUB (not VFS or any other kmem user, but core SLUB)
> try to allocate high order slab pages w/o __GFP_WAIT for performance
> reasons. That broke kmemcg case. Here it goes:

Ah, cool, so it was a bug from slub.  Punting to return path still has
some niceties but if we can't consistently get rid of stack
consumption it's not that attractive.  Let's revisit it later together
with hard limit reclaim.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 20:44               ` Tejun Heo
  (?)
@ 2015-08-28 22:06               ` Tejun Heo
       [not found]                 ` <20150828220632.GF11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  -1 siblings, 1 reply; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 22:06 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, hannes, mhocko, cgroups, linux-mm, kernel-team,
	Joonsoo Kim, Christoph Lameter, David Rientjes

On Fri, Aug 28, 2015 at 04:44:32PM -0400, Tejun Heo wrote:
> Ah, cool, so it was a bug from slub.  Punting to return path still has
> some niceties but if we can't consistently get rid of stack
> consumption it's not that attractive.  Let's revisit it later together
> with hard limit reclaim.

So, I can't check right now but I'm pretty sure I was using SLAB on my
test config, so this issue may exist there too.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <20150828220632.GF11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>]

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 22:06               ` Tejun Heo
@ 2015-08-29  7:59                     ` Vladimir Davydov
  0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-29  7:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	Joonsoo Kim, Christoph Lameter, David Rientjes

On Fri, Aug 28, 2015 at 06:06:32PM -0400, Tejun Heo wrote:
> On Fri, Aug 28, 2015 at 04:44:32PM -0400, Tejun Heo wrote:
> > Ah, cool, so it was a bug from slub.  Punting to return path still has
> > some niceties but if we can't consistently get rid of stack
> > consumption it's not that attractive.  Let's revisit it later together
> > with hard limit reclaim.
> 
> So, I can't check right now but I'm pretty sure I was using SLAB on my
> test config, so this issue may exist there too.

Yeah, SLAB is broken too. It was accidentally broken by commit
4167e9b2cf10 ("mm: remove GFP_THISNODE"), which among other things made
SLAB filter out __GFP_WAIT from gfp flags when probing a NUMA node. I'll
take a look what we can do with that.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-29  7:59                     ` Vladimir Davydov
  0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-29  7:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, hannes, mhocko, cgroups, linux-mm, kernel-team,
	Joonsoo Kim, Christoph Lameter, David Rientjes

On Fri, Aug 28, 2015 at 06:06:32PM -0400, Tejun Heo wrote:
> On Fri, Aug 28, 2015 at 04:44:32PM -0400, Tejun Heo wrote:
> > Ah, cool, so it was a bug from slub.  Punting to return path still has
> > some niceties but if we can't consistently get rid of stack
> > consumption it's not that attractive.  Let's revisit it later together
> > with hard limit reclaim.
> 
> So, I can't check right now but I'm pretty sure I was using SLAB on my
> test config, so this issue may exist there too.

Yeah, SLAB is broken too. It was accidentally broken by commit
4167e9b2cf10 ("mm: remove GFP_THISNODE"), which among other things made
SLAB filter out __GFP_WAIT from gfp flags when probing a NUMA node. I'll
take a look what we can do with that.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 20:32           ` Vladimir Davydov
  2015-08-28 20:44               ` Tejun Heo
@ 2015-08-30 15:52             ` Vladimir Davydov
  1 sibling, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-30 15:52 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton
  Cc: hannes, mhocko, cgroups, linux-mm, kernel-team, Joonsoo Kim,
	Christoph Lameter, David Rientjes

On Fri, Aug 28, 2015 at 11:32:31PM +0300, Vladimir Davydov wrote:
...
> From: Vladimir Davydov <vdavydov@parallels.com>
> Date: Fri, 28 Aug 2015 23:17:19 +0300
> Subject: [PATCH] mm/slub: don't bypass memcg reclaim for high-order page
>  allocation

Please ignore this patch. I'll rework and resend.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 15:25     ` Tejun Heo
@ 2015-08-28 17:13         ` Michal Hocko
  -1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

On Fri 28-08-15 11:25:29, Tejun Heo wrote:
> Currently, try_charge() tries to reclaim memory directly when the high
> limit is breached; however, this has a couple issues.
> 
> * try_charge() can be invoked from any in-kernel allocation site and
>   reclaim path may use considerable amount of stack.  This can lead to
>   stack overflows which are extremely difficult to reproduce.

This is true but I haven't seen any reports for the stack overflow for
quite some time.
 
> * If the allocation doesn't have __GFP_WAIT, direct reclaim is
>   skipped.  If a process performs only speculative allocations, it can
>   blow way past the high limit.  This is actually easily reproducible
>   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
>   allocations first, so as long as there's memory which can be
>   consumed without blocking, it can keep allocating memory regardless
>   of the high limit.

It is a bit confusing that you are talking about direct reclaim but in
fact mean high limit reclaim. But yeah, you are right there is no
protection against GFP_NOWAIT allocations there.

> This patch makes try_charge() always punt the direct reclaim to the
> return-to-userland path.  If try_charge() detects that high limit is
> breached, it sets current->memcg_over_high to the offending memcg and
> schedules execution of mem_cgroup_handle_over_high() which performs
> the direct reclaim from the return-to-userland path.

OK, this is certainly an attractive idea because of allocation requests
with reduced reclaim capabilities. GFP_NOWAIT is not the only one.
GFP_NOFS would be another. With kmem accounting they are much bigger
problem than with regular page faults/page cache. And having full
GFP_KERNEL reclaim context is definitely nice.
I would just argue that this implementation has the same issue as the
other patch in the series which performs high-usage reclaim. I think
that each task should reclaim only its contribution which is trivial
to account.
 
> As long as kernel doesn't have a run-away allocation spree, this
> should provide enough protection while making kmemcg behave more
> consistently.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  include/linux/memcontrol.h |  6 +++++
>  include/linux/sched.h      |  1 +
>  include/linux/tracehook.h  |  3 +++
>  mm/memcontrol.c            | 66 +++++++++++++++++++++++++++++++++++++---------
>  4 files changed, 64 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3d28656..8d345a7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
>  	return inactive * inactive_ratio < active;
>  }
>  
> +void mem_cgroup_handle_over_high(void);
> +
>  void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  				struct task_struct *p);
>  
> @@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
>  {
>  }
>  
> +static inline void mem_cgroup_handle_over_high(void)
> +{
> +}
> +
>  static inline void mem_cgroup_oom_enable(void)
>  {
>  }
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ef73b54..c76b71d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1785,6 +1785,7 @@ struct task_struct {
>  #endif /* CONFIG_TRACING */
>  #ifdef CONFIG_MEMCG
>  	struct mem_cgroup *memcg_in_oom;
> +	struct mem_cgroup *memcg_over_high; /* reclaim on returning to user */
>  	gfp_t memcg_oom_gfp_mask;
>  	int memcg_oom_order;
>  #endif
> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
> index 84d4972..26c1521 100644
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -50,6 +50,7 @@
>  #include <linux/ptrace.h>
>  #include <linux/security.h>
>  #include <linux/task_work.h>
> +#include <linux/memcontrol.h>
>  struct linux_binprm;
>  
>  /*
> @@ -188,6 +189,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
>  	smp_mb__after_atomic();
>  	if (unlikely(current->task_works))
>  		task_work_run();
> +
> +	mem_cgroup_handle_over_high();
>  }
>  
>  #endif	/* <linux/tracehook.h> */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 74abb31..c94b686 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -62,6 +62,7 @@
>  #include <linux/oom.h>
>  #include <linux/lockdep.h>
>  #include <linux/file.h>
> +#include <linux/tracehook.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -1963,6 +1964,33 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	return NOTIFY_OK;
>  }
>  
> +/*
> + * Scheduled by try_charge() to be executed from the userland return path
> + * and reclaims memory over the high limit.
> + */
> +void mem_cgroup_handle_over_high(void)
> +{
> +	struct mem_cgroup *memcg = current->memcg_over_high;
> +
> +	if (likely(!memcg))
> +		return;
> +
> +	do {
> +		unsigned long usage = page_counter_read(&memcg->memory);
> +		unsigned long high = ACCESS_ONCE(memcg->high);
> +
> +		if (usage <= high)
> +			continue;
> +
> +		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> +		try_to_free_mem_cgroup_pages(memcg, usage - high,
> +					     GFP_KERNEL, true);
> +	} while ((memcg = parent_mem_cgroup(memcg)));
> +
> +	css_put(&current->memcg_over_high->css);
> +	current->memcg_over_high = NULL;
> +}
> +
>  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		      unsigned int nr_pages)
>  {
> @@ -2071,21 +2099,27 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	css_get_many(&memcg->css, batch);
>  	if (batch > nr_pages)
>  		refill_stock(memcg, batch - nr_pages);
> -	if (!(gfp_mask & __GFP_WAIT))
> -		goto done;
> +
>  	/*
> -	 * If the hierarchy is above the normal consumption range,
> -	 * make the charging task trim their excess contribution.
> +	 * If the hierarchy is above the normal consumption range, schedule
> +	 * direct reclaim on returning to userland.  We can perform direct
> +	 * reclaim here if __GFP_WAIT; however, punting has the benefit of
> +	 * avoiding surprise high stack usages and it's fine to breach the
> +	 * high limit temporarily while control stays in kernel.
>  	 */
> -	do {
> -		unsigned long usage = page_counter_read(&memcg->memory);
> -		unsigned long high = ACCESS_ONCE(memcg->high);
> +	if (!current->memcg_over_high) {
> +		struct mem_cgroup *pos = memcg;
>  
> -		if (usage <= high)
> -			continue;
> -		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> -		try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
> -	} while ((memcg = parent_mem_cgroup(memcg)));
> +		do {
> +			if (page_counter_read(&pos->memory) > pos->high) {
> +				/* make user return path rescan from leaf */
> +				css_get(&memcg->css);
> +				current->memcg_over_high = memcg;
> +				set_notify_resume(current);
> +				break;
> +			}
> +		} while ((pos = parent_mem_cgroup(pos)));
> +	}
>  done:
>  	return ret;
>  }
> @@ -5053,6 +5087,13 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
>  }
>  #endif
>  
> +static void mem_cgroup_exit(struct cgroup_subsys_state *css,
> +			    struct cgroup_subsys_state *old_css,
> +			    struct task_struct *task)
> +{
> +	mem_cgroup_handle_over_high();
> +}
> +
>  /*
>   * Cgroup retains root cgroups across [un]mount cycles making it necessary
>   * to verify whether we're attached to the default hierarchy on each mount
> @@ -5223,6 +5264,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.can_attach = mem_cgroup_can_attach,
>  	.cancel_attach = mem_cgroup_cancel_attach,
>  	.attach = mem_cgroup_move_task,
> +	.exit = mem_cgroup_exit,
>  	.bind = mem_cgroup_bind,
>  	.dfl_cftypes = memory_files,
>  	.legacy_cftypes = mem_cgroup_legacy_files,
> -- 
> 2.4.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 17:13         ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team

On Fri 28-08-15 11:25:29, Tejun Heo wrote:
> Currently, try_charge() tries to reclaim memory directly when the high
> limit is breached; however, this has a couple issues.
> 
> * try_charge() can be invoked from any in-kernel allocation site and
>   reclaim path may use considerable amount of stack.  This can lead to
>   stack overflows which are extremely difficult to reproduce.

This is true but I haven't seen any reports for the stack overflow for
quite some time.
 
> * If the allocation doesn't have __GFP_WAIT, direct reclaim is
>   skipped.  If a process performs only speculative allocations, it can
>   blow way past the high limit.  This is actually easily reproducible
>   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
>   allocations first, so as long as there's memory which can be
>   consumed without blocking, it can keep allocating memory regardless
>   of the high limit.

It is a bit confusing that you are talking about direct reclaim but in
fact mean high limit reclaim. But yeah, you are right there is no
protection against GFP_NOWAIT allocations there.

> This patch makes try_charge() always punt the direct reclaim to the
> return-to-userland path.  If try_charge() detects that high limit is
> breached, it sets current->memcg_over_high to the offending memcg and
> schedules execution of mem_cgroup_handle_over_high() which performs
> the direct reclaim from the return-to-userland path.

OK, this is certainly an attractive idea because of allocation requests
with reduced reclaim capabilities. GFP_NOWAIT is not the only one.
GFP_NOFS would be another. With kmem accounting they are much bigger
problem than with regular page faults/page cache. And having full
GFP_KERNEL reclaim context is definitely nice.
I would just argue that this implementation has the same issue as the
other patch in the series which performs high-usage reclaim. I think
that each task should reclaim only its contribution which is trivial
to account.
 
> As long as kernel doesn't have a run-away allocation spree, this
> should provide enough protection while making kmemcg behave more
> consistently.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  include/linux/memcontrol.h |  6 +++++
>  include/linux/sched.h      |  1 +
>  include/linux/tracehook.h  |  3 +++
>  mm/memcontrol.c            | 66 +++++++++++++++++++++++++++++++++++++---------
>  4 files changed, 64 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3d28656..8d345a7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
>  	return inactive * inactive_ratio < active;
>  }
>  
> +void mem_cgroup_handle_over_high(void);
> +
>  void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  				struct task_struct *p);
>  
> @@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
>  {
>  }
>  
> +static inline void mem_cgroup_handle_over_high(void)
> +{
> +}
> +
>  static inline void mem_cgroup_oom_enable(void)
>  {
>  }
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ef73b54..c76b71d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1785,6 +1785,7 @@ struct task_struct {
>  #endif /* CONFIG_TRACING */
>  #ifdef CONFIG_MEMCG
>  	struct mem_cgroup *memcg_in_oom;
> +	struct mem_cgroup *memcg_over_high; /* reclaim on returning to user */
>  	gfp_t memcg_oom_gfp_mask;
>  	int memcg_oom_order;
>  #endif
> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
> index 84d4972..26c1521 100644
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -50,6 +50,7 @@
>  #include <linux/ptrace.h>
>  #include <linux/security.h>
>  #include <linux/task_work.h>
> +#include <linux/memcontrol.h>
>  struct linux_binprm;
>  
>  /*
> @@ -188,6 +189,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
>  	smp_mb__after_atomic();
>  	if (unlikely(current->task_works))
>  		task_work_run();
> +
> +	mem_cgroup_handle_over_high();
>  }
>  
>  #endif	/* <linux/tracehook.h> */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 74abb31..c94b686 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -62,6 +62,7 @@
>  #include <linux/oom.h>
>  #include <linux/lockdep.h>
>  #include <linux/file.h>
> +#include <linux/tracehook.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -1963,6 +1964,33 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	return NOTIFY_OK;
>  }
>  
> +/*
> + * Scheduled by try_charge() to be executed from the userland return path
> + * and reclaims memory over the high limit.
> + */
> +void mem_cgroup_handle_over_high(void)
> +{
> +	struct mem_cgroup *memcg = current->memcg_over_high;
> +
> +	if (likely(!memcg))
> +		return;
> +
> +	do {
> +		unsigned long usage = page_counter_read(&memcg->memory);
> +		unsigned long high = ACCESS_ONCE(memcg->high);
> +
> +		if (usage <= high)
> +			continue;
> +
> +		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> +		try_to_free_mem_cgroup_pages(memcg, usage - high,
> +					     GFP_KERNEL, true);
> +	} while ((memcg = parent_mem_cgroup(memcg)));
> +
> +	css_put(&current->memcg_over_high->css);
> +	current->memcg_over_high = NULL;
> +}
> +
>  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		      unsigned int nr_pages)
>  {
> @@ -2071,21 +2099,27 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	css_get_many(&memcg->css, batch);
>  	if (batch > nr_pages)
>  		refill_stock(memcg, batch - nr_pages);
> -	if (!(gfp_mask & __GFP_WAIT))
> -		goto done;
> +
>  	/*
> -	 * If the hierarchy is above the normal consumption range,
> -	 * make the charging task trim their excess contribution.
> +	 * If the hierarchy is above the normal consumption range, schedule
> +	 * direct reclaim on returning to userland.  We can perform direct
> +	 * reclaim here if __GFP_WAIT; however, punting has the benefit of
> +	 * avoiding surprise high stack usages and it's fine to breach the
> +	 * high limit temporarily while control stays in kernel.
>  	 */
> -	do {
> -		unsigned long usage = page_counter_read(&memcg->memory);
> -		unsigned long high = ACCESS_ONCE(memcg->high);
> +	if (!current->memcg_over_high) {
> +		struct mem_cgroup *pos = memcg;
>  
> -		if (usage <= high)
> -			continue;
> -		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> -		try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
> -	} while ((memcg = parent_mem_cgroup(memcg)));
> +		do {
> +			if (page_counter_read(&pos->memory) > pos->high) {
> +				/* make user return path rescan from leaf */
> +				css_get(&memcg->css);
> +				current->memcg_over_high = memcg;
> +				set_notify_resume(current);
> +				break;
> +			}
> +		} while ((pos = parent_mem_cgroup(pos)));
> +	}
>  done:
>  	return ret;
>  }
> @@ -5053,6 +5087,13 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
>  }
>  #endif
>  
> +static void mem_cgroup_exit(struct cgroup_subsys_state *css,
> +			    struct cgroup_subsys_state *old_css,
> +			    struct task_struct *task)
> +{
> +	mem_cgroup_handle_over_high();
> +}
> +
>  /*
>   * Cgroup retains root cgroups across [un]mount cycles making it necessary
>   * to verify whether we're attached to the default hierarchy on each mount
> @@ -5223,6 +5264,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.can_attach = mem_cgroup_can_attach,
>  	.cancel_attach = mem_cgroup_cancel_attach,
>  	.attach = mem_cgroup_move_task,
> +	.exit = mem_cgroup_exit,
>  	.bind = mem_cgroup_bind,
>  	.dfl_cftypes = memory_files,
>  	.legacy_cftypes = mem_cgroup_legacy_files,
> -- 
> 2.4.3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <20150828171322.GC21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 17:13         ` Michal Hocko
@ 2015-08-28 17:56             ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 17:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

Hello,

On Fri, Aug 28, 2015 at 07:13:22PM +0200, Michal Hocko wrote:
> On Fri 28-08-15 11:25:29, Tejun Heo wrote:
> > Currently, try_charge() tries to reclaim memory directly when the high
> > limit is breached; however, this has a couple issues.
> > 
> > * try_charge() can be invoked from any in-kernel allocation site and
> >   reclaim path may use considerable amount of stack.  This can lead to
> >   stack overflows which are extremely difficult to reproduce.
> 
> This is true but I haven't seen any reports for the stack overflow for
> quite some time.

So, this didn't really fix it but xfs had to punt things to workqueues
to avoid stack overflows and IIRC it involved direct reclaim.  Maybe
it's too late but it probably is a good idea to punt this from the
source.

> I would just argue that this implementation has the same issue as the
> other patch in the series which performs high-usage reclaim. I think
> that each task should reclaim only its contribution which is trivial
> to account.

Hmm... I'll respond there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 17:56             ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 17:56 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team

Hello,

On Fri, Aug 28, 2015 at 07:13:22PM +0200, Michal Hocko wrote:
> On Fri 28-08-15 11:25:29, Tejun Heo wrote:
> > Currently, try_charge() tries to reclaim memory directly when the high
> > limit is breached; however, this has a couple issues.
> > 
> > * try_charge() can be invoked from any in-kernel allocation site and
> >   reclaim path may use considerable amount of stack.  This can lead to
> >   stack overflows which are extremely difficult to reproduce.
> 
> This is true but I haven't seen any reports for the stack overflow for
> quite some time.

So, this didn't really fix it but xfs had to punt things to workqueues
to avoid stack overflows and IIRC it involved direct reclaim.  Maybe
it's too late but it probably is a good idea to punt this from the
source.

> I would just argue that this implementation has the same issue as the
> other patch in the series which performs high-usage reclaim. I think
> that each task should reclaim only its contribution which is trivial
> to account.

Hmm... I'll respond there.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 17:13         ` Michal Hocko
@ 2015-08-28 20:45             ` Vladimir Davydov
  -1 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 20:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Fri, Aug 28, 2015 at 07:13:22PM +0200, Michal Hocko wrote:
...
> > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> >   skipped.  If a process performs only speculative allocations, it can
> >   blow way past the high limit.  This is actually easily reproducible
> >   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
> >   allocations first, so as long as there's memory which can be
> >   consumed without blocking, it can keep allocating memory regardless
> >   of the high limit.
> 
> It is a bit confusing that you are talking about direct reclaim but in
> fact mean high limit reclaim. But yeah, you are right there is no
> protection against GFP_NOWAIT allocations there.

Actually, memory.high by itself *is* the protection against GFP_NOWAIT
allocations, similarly to zone watermarks. W/o it we would have no other
choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
should just set it so that

  memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
                              that can normally occur in a row]

That being said, currently I don't see any point in making memory.high
!__GFP_WAIT-safe.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 20:45             ` Vladimir Davydov
  0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 20:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, hannes, cgroups, linux-mm, kernel-team

On Fri, Aug 28, 2015 at 07:13:22PM +0200, Michal Hocko wrote:
...
> > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> >   skipped.  If a process performs only speculative allocations, it can
> >   blow way past the high limit.  This is actually easily reproducible
> >   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
> >   allocations first, so as long as there's memory which can be
> >   consumed without blocking, it can keep allocating memory regardless
> >   of the high limit.
> 
> It is a bit confusing that you are talking about direct reclaim but in
> fact mean high limit reclaim. But yeah, you are right there is no
> protection against GFP_NOWAIT allocations there.

Actually, memory.high by itself *is* the protection against GFP_NOWAIT
allocations, similarly to zone watermarks. W/o it we would have no other
choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
should just set it so that

  memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
                              that can normally occur in a row]

That being said, currently I don't see any point in making memory.high
!__GFP_WAIT-safe.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 20:45             ` Vladimir Davydov
@ 2015-08-28 20:53               ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 20:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

Hello, Vladmir.

On Fri, Aug 28, 2015 at 11:45:54PM +0300, Vladimir Davydov wrote:
> Actually, memory.high by itself *is* the protection against GFP_NOWAIT
> allocations, similarly to zone watermarks. W/o it we would have no other
> choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
> should just set it so that
> 
>   memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
>                               that can normally occur in a row]

While this would be true in many cases, I don't think this is the
intention of the two knobs and the space between high and max can be
filled up by anything which can't be reclaimed - e.g. too many dirty /
writeback pages on a slow device or memlocked pages.  If it were
really the buffer for GFP_NOWAIT, there's no reason to even make it a
separate knob and we *may* change how over-high reclaim behaves in the
future, so let's please not dig ourselves into something too specific.

> That being said, currently I don't see any point in making memory.high
> !__GFP_WAIT-safe.

Yeah, as long as the blow up can't be triggered consistently, it
should be fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 20:53               ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 20:53 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Michal Hocko, hannes, cgroups, linux-mm, kernel-team

Hello, Vladmir.

On Fri, Aug 28, 2015 at 11:45:54PM +0300, Vladimir Davydov wrote:
> Actually, memory.high by itself *is* the protection against GFP_NOWAIT
> allocations, similarly to zone watermarks. W/o it we would have no other
> choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
> should just set it so that
> 
>   memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
>                               that can normally occur in a row]

While this would be true in many cases, I don't think this is the
intention of the two knobs and the space between high and max can be
filled up by anything which can't be reclaimed - e.g. too many dirty /
writeback pages on a slow device or memlocked pages.  If it were
really the buffer for GFP_NOWAIT, there's no reason to even make it a
separate knob and we *may* change how over-high reclaim behaves in the
future, so let's please not dig ourselves into something too specific.

> That being said, currently I don't see any point in making memory.high
> !__GFP_WAIT-safe.

Yeah, as long as the blow up can't be triggered consistently, it
should be fine.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <20150828205301.GB11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>]

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 20:53               ` Tejun Heo
@ 2015-08-28 21:07                   ` Vladimir Davydov
  -1 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 21:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Fri, Aug 28, 2015 at 04:53:01PM -0400, Tejun Heo wrote:
> On Fri, Aug 28, 2015 at 11:45:54PM +0300, Vladimir Davydov wrote:
> > Actually, memory.high by itself *is* the protection against GFP_NOWAIT
> > allocations, similarly to zone watermarks. W/o it we would have no other
> > choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
> > should just set it so that
> > 
> >   memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
> >                               that can normally occur in a row]
> 
> While this would be true in many cases, I don't think this is the
> intention of the two knobs and the space between high and max can be
> filled up by anything which can't be reclaimed - e.g. too many dirty /
> writeback pages on a slow device or memlocked pages.  If it were
> really the buffer for GFP_NOWAIT, there's no reason to even make it a
> separate knob and we *may* change how over-high reclaim behaves in the
> future, so let's please not dig ourselves into something too specific.

Yep, come to think of it, you're right. One might want to use the
memory.high knob as the protection, because currently it is the only way
to protect kmemcg against GFP_NOWAIT failures, but it looks more like
abusing it :-/

We should probably think about introducing some kind of watermarks that
would trigger memcg reclaim, asynchronous or direct, on exceeding
them.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 21:07                   ` Vladimir Davydov
  0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 21:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Michal Hocko, hannes, cgroups, linux-mm, kernel-team

On Fri, Aug 28, 2015 at 04:53:01PM -0400, Tejun Heo wrote:
> On Fri, Aug 28, 2015 at 11:45:54PM +0300, Vladimir Davydov wrote:
> > Actually, memory.high by itself *is* the protection against GFP_NOWAIT
> > allocations, similarly to zone watermarks. W/o it we would have no other
> > choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
> > should just set it so that
> > 
> >   memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
> >                               that can normally occur in a row]
> 
> While this would be true in many cases, I don't think this is the
> intention of the two knobs and the space between high and max can be
> filled up by anything which can't be reclaimed - e.g. too many dirty /
> writeback pages on a slow device or memlocked pages.  If it were
> really the buffer for GFP_NOWAIT, there's no reason to even make it a
> separate knob and we *may* change how over-high reclaim behaves in the
> future, so let's please not dig ourselves into something too specific.

Yep, come to think of it, you're right. One might want to use the
memory.high knob as the protection, because currently it is the only way
to protect kmemcg against GFP_NOWAIT failures, but it looks more like
abusing it :-/

We should probably think about introducing some kind of watermarks that
would trigger memcg reclaim, asynchronous or direct, on exceeding
them.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
  2015-08-28 21:07                   ` Vladimir Davydov
@ 2015-08-28 21:14                     ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 21:14 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

Hey,

On Sat, Aug 29, 2015 at 12:07:04AM +0300, Vladimir Davydov wrote:
> We should probably think about introducing some kind of watermarks that
> would trigger memcg reclaim, asynchronous or direct, on exceeding
> them.

Yeah, for max + kmemcg case, we eventually should do something similar
to the global case where we try to kick off async reclaim before we
hit the hard wall.  Ultimately, I think punting reclaims to workqueue
or return-path is a good idea anyway, so maybe it can be all part of
the same mechanism.  Given that the high limit is the primary control
mechanism on the default hierarchy, it should be fine for now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 21:14                     ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 21:14 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Michal Hocko, hannes, cgroups, linux-mm, kernel-team

Hey,

On Sat, Aug 29, 2015 at 12:07:04AM +0300, Vladimir Davydov wrote:
> We should probably think about introducing some kind of watermarks that
> would trigger memcg reclaim, asynchronous or direct, on exceeding
> them.

Yeah, for max + kmemcg case, we eventually should do something similar
to the global case where we try to kick off async reclaim before we
hit the hard wall.  Ultimately, I think punting reclaims to workqueue
or return-path is a good idea anyway, so maybe it can be all part of
the same mechanism.  Given that the high limit is the primary control
mechanism on the default hierarchy, it should be fine for now.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 15:25     ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
  To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg,
	Tejun Heo

On the default hierarchy, all memory consumption will be accounted
together and controlled by the same set of limits.  Always enable
kmemcg on the default hierarchy.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 mm/memcontrol.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c94b686..8a5dd01 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (ret)
 		return ret;
 
+	/* kmem is always accounted together on the default hierarchy */
+	if (cgroup_on_dfl(css->cgroup)) {
+		ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
+		if (ret)
+			return ret;
+	}
+
 	/*
 	 * Make sure the memcg is initialized: mem_cgroup_iter()
 	 * orders reading memcg->initialized against its callers
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
@ 2015-08-28 15:25     ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
  To: hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team, Tejun Heo

On the default hierarchy, all memory consumption will be accounted
together and controlled by the same set of limits.  Always enable
kmemcg on the default hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 mm/memcontrol.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c94b686..8a5dd01 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (ret)
 		return ret;
 
+	/* kmem is always accounted together on the default hierarchy */
+	if (cgroup_on_dfl(css->cgroup)) {
+		ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
+		if (ret)
+			return ret;
+	}
+
 	/*
 	 * Make sure the memcg is initialized: mem_cgroup_iter()
 	 * orders reading memcg->initialized against its callers
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

[parent not found: <1440775530-18630-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-08-28 15:25     ` Tejun Heo
@ 2015-08-28 16:49         ` Vladimir Davydov
  -1 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 16:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> On the default hierarchy, all memory consumption will be accounted
> together and controlled by the same set of limits.  Always enable
> kmemcg on the default hierarchy.

IMO we should introduce a boot time knob for disabling it, because kmem
accounting is still not perfect, besides some users might prefer to go
w/o it for performance reasons.

> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  mm/memcontrol.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c94b686..8a5dd01 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	if (ret)
>  		return ret;
>  
> +	/* kmem is always accounted together on the default hierarchy */
> +	if (cgroup_on_dfl(css->cgroup)) {
> +		ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
> +		if (ret)
> +			return ret;
> +	}
> +

This is a wrong place for this. The kernel will panic on an attempt to
create a sub memcg, because memcg_init_kmem already enables kmem
accounting in this case. I guess we should add this hunk to
memcg_propagate_kmem instead.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
@ 2015-08-28 16:49         ` Vladimir Davydov
  0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 16:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mhocko, cgroups, linux-mm, kernel-team

On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> On the default hierarchy, all memory consumption will be accounted
> together and controlled by the same set of limits.  Always enable
> kmemcg on the default hierarchy.

IMO we should introduce a boot time knob for disabling it, because kmem
accounting is still not perfect, besides some users might prefer to go
w/o it for performance reasons.

> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  mm/memcontrol.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c94b686..8a5dd01 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	if (ret)
>  		return ret;
>  
> +	/* kmem is always accounted together on the default hierarchy */
> +	if (cgroup_on_dfl(css->cgroup)) {
> +		ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
> +		if (ret)
> +			return ret;
> +	}
> +

This is a wrong place for this. The kernel will panic on an attempt to
create a sub memcg, because memcg_init_kmem already enables kmem
accounting in this case. I guess we should add this hunk to
memcg_propagate_kmem instead.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-08-28 16:49         ` Vladimir Davydov
  (?)
@ 2015-08-28 16:56         ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 16:56 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: hannes, mhocko, cgroups, linux-mm, kernel-team

Hello,

On Fri, Aug 28, 2015 at 07:49:18PM +0300, Vladimir Davydov wrote:
> On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> > On the default hierarchy, all memory consumption will be accounted
> > together and controlled by the same set of limits.  Always enable
> > kmemcg on the default hierarchy.
> 
> IMO we should introduce a boot time knob for disabling it, because kmem
> accounting is still not perfect, besides some users might prefer to go
> w/o it for performance reasons.

Yeah, fair enough.

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c94b686..8a5dd01 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
> >  	if (ret)
> >  		return ret;
> >  
> > +	/* kmem is always accounted together on the default hierarchy */
> > +	if (cgroup_on_dfl(css->cgroup)) {
> > +		ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> 
> This is a wrong place for this. The kernel will panic on an attempt to
> create a sub memcg, because memcg_init_kmem already enables kmem
> accounting in this case. I guess we should add this hunk to
> memcg_propagate_kmem instead.

Yeap, bypassing "parent is active" test in memcg_propagate_kmem()
seems like the right thing to do.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-08-28 16:49         ` Vladimir Davydov
@ 2015-08-28 17:14           ` Michal Hocko
  -1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:14 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Tejun Heo, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Fri 28-08-15 19:49:18, Vladimir Davydov wrote:
> On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> > On the default hierarchy, all memory consumption will be accounted
> > together and controlled by the same set of limits.  Always enable
> > kmemcg on the default hierarchy.
> 
> IMO we should introduce a boot time knob for disabling it, because kmem
> accounting is still not perfect, besides some users might prefer to go
> w/o it for performance reasons.

I would even argue for opt-in rather than opt-out.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
@ 2015-08-28 17:14           ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:14 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Tejun Heo, hannes, cgroups, linux-mm, kernel-team

On Fri 28-08-15 19:49:18, Vladimir Davydov wrote:
> On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> > On the default hierarchy, all memory consumption will be accounted
> > together and controlled by the same set of limits.  Always enable
> > kmemcg on the default hierarchy.
> 
> IMO we should introduce a boot time knob for disabling it, because kmem
> accounting is still not perfect, besides some users might prefer to go
> w/o it for performance reasons.

I would even argue for opt-in rather than opt-out.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-08-28 17:14           ` Michal Hocko
  (?)
@ 2015-08-28 17:41           ` Tejun Heo
  2015-09-01 12:44             ` Michal Hocko
  -1 siblings, 1 reply; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 17:41 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team

On Fri, Aug 28, 2015 at 07:14:38PM +0200, Michal Hocko wrote:
> On Fri 28-08-15 19:49:18, Vladimir Davydov wrote:
> > On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> > > On the default hierarchy, all memory consumption will be accounted
> > > together and controlled by the same set of limits.  Always enable
> > > kmemcg on the default hierarchy.
> > 
> > IMO we should introduce a boot time knob for disabling it, because kmem
> > accounting is still not perfect, besides some users might prefer to go
> > w/o it for performance reasons.
> 
> I would even argue for opt-in rather than opt-out.

Definitely not.  We wanna put all memory consumptions under the same
roof by default.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-08-28 17:41           ` Tejun Heo
@ 2015-09-01 12:44             ` Michal Hocko
       [not found]               ` <20150901124459.GC8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2015-09-01 12:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team

On Fri 28-08-15 13:41:40, Tejun Heo wrote:
> On Fri, Aug 28, 2015 at 07:14:38PM +0200, Michal Hocko wrote:
> > On Fri 28-08-15 19:49:18, Vladimir Davydov wrote:
> > > On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> > > > On the default hierarchy, all memory consumption will be accounted
> > > > together and controlled by the same set of limits.  Always enable
> > > > kmemcg on the default hierarchy.
> > > 
> > > IMO we should introduce a boot time knob for disabling it, because kmem
> > > accounting is still not perfect, besides some users might prefer to go
> > > w/o it for performance reasons.
> > 
> > I would even argue for opt-in rather than opt-out.
> 
> Definitely not. 

The runtime overhead is not negligible and I do not see why everybody
should be paying that price by default. I can definitely see the reason why
somebody would want to enable the kmem accounting but many users will
probably never care because the kernel footprint would be in the noise
wrt. user memory.

> We wanna put all memory consumptions under the same roof by default.

But I am not sure we will ever achieve this. E.g. hugetlb memory is way
too different to be under the same charging by default IMO. Also all
the random drivers calling into the page allocator directly in the user
context would need to charge explicitly.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <20150901124459.GC8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-09-01 12:44             ` Michal Hocko
@ 2015-09-01 18:51                   ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-09-01 18:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

Hello,

On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote:
> The runtime overhead is not negligible and I do not see why everybody
> should be paying that price by default. I can definitely see the reason why
> somebody would want to enable the kmem accounting but many users will
> probably never care because the kernel footprint would be in the noise
> wrt. user memory.

We said the same thing about hierarchy support.  Sure, it's not the
same but I think it's wiser to keep the architectural decisions at a
higher level.  I don't think kmem overhead is that high but if this
actually is a problem we'd need a per-cgroup knob anyway.

> > We wanna put all memory consumptions under the same roof by default.
> 
> But I am not sure we will ever achieve this. E.g. hugetlb memory is way
> too different to be under the same charging by default IMO. Also all
> the random drivers calling into the page allocator directly in the user
> context would need to charge explicitly.

Oh I meant the big ones.  I don't think we'll achieve 100% coverage
either but even just catching the major ones, kmem and tcp socket
buffers, should remove most ambiguities around memory consumption.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
@ 2015-09-01 18:51                   ` Tejun Heo
  0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-09-01 18:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team

Hello,

On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote:
> The runtime overhead is not negligible and I do not see why everybody
> should be paying that price by default. I can definitely see the reason why
> somebody would want to enable the kmem accounting but many users will
> probably never care because the kernel footprint would be in the noise
> wrt. user memory.

We said the same thing about hierarchy support.  Sure, it's not the
same but I think it's wiser to keep the architectural decisions at a
higher level.  I don't think kmem overhead is that high but if this
actually is a problem we'd need a per-cgroup knob anyway.

> > We wanna put all memory consumptions under the same roof by default.
> 
> But I am not sure we will ever achieve this. E.g. hugetlb memory is way
> too different to be under the same charging by default IMO. Also all
> the random drivers calling into the page allocator directly in the user
> context would need to charge explicitly.

Oh I meant the big ones.  I don't think we'll achieve 100% coverage
either but even just catching the major ones, kmem and tcp socket
buffers, should remove most ambiguities around memory consumption.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-09-01 18:51                   ` Tejun Heo
  (?)
@ 2015-09-04 13:30                   ` Michal Hocko
       [not found]                     ` <20150904133038.GC8220-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2015-09-04 16:18                     ` Tejun Heo
  -1 siblings, 2 replies; 66+ messages in thread
From: Michal Hocko @ 2015-09-04 13:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team

On Tue 01-09-15 14:51:57, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote:
> > The runtime overhead is not negligible and I do not see why everybody
> > should be paying that price by default. I can definitely see the reason why
> > somebody would want to enable the kmem accounting but many users will
> > probably never care because the kernel footprint would be in the noise
> > wrt. user memory.
> 
> We said the same thing about hierarchy support.  Sure, it's not the
> same but I think it's wiser to keep the architectural decisions at a
> higher level.  I don't think kmem overhead is that high but if this
> actually is a problem we'd need a per-cgroup knob anyway.

The overhead was around 4% for the basic kbuild test without ever
triggering the [k]memcg limit last time I checked. This was quite some
time ago and things might have changed since then. Even when this got
better there will still be _some_ overhead because we have to track that
memory and that is not free.

The question really is whether kmem accounting is so generally useful
that the overhead is acceptable and it is should be enabled by
default. From my POV it is a useful mitigation of untrusted users but
many loads simply do not care because they only care about a certain
level of isolation.

I might be wrong here of course but if the default should be switched it
would deserve a better justification with some numbers so that people
can see the possible drawbacks.

I agree that the per-cgroup knob is better than the global one. We
should also find consensus whether the legacy semantic of k < u limit
should be preserved. It made sense to me at the time it was introduced
but I recall that Vladimir found it not really helpful when we discussed
that at LSF. I found it interesting e.g. for the rough task count
limiting use case which people were asking for.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <20150904133038.GC8220-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-09-04 13:30                   ` Michal Hocko
@ 2015-09-04 15:38                         ` Vladimir Davydov
  2015-09-04 16:18                     ` Tejun Heo
  1 sibling, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-09-04 15:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote:
> On Tue 01-09-15 14:51:57, Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote:
> > > The runtime overhead is not negligible and I do not see why everybody
> > > should be paying that price by default. I can definitely see the reason why
> > > somebody would want to enable the kmem accounting but many users will
> > > probably never care because the kernel footprint would be in the noise
> > > wrt. user memory.
> > 
> > We said the same thing about hierarchy support.  Sure, it's not the
> > same but I think it's wiser to keep the architectural decisions at a
> > higher level.  I don't think kmem overhead is that high but if this
> > actually is a problem we'd need a per-cgroup knob anyway.
> 
> The overhead was around 4% for the basic kbuild test without ever
> triggering the [k]memcg limit last time I checked. This was quite some
> time ago and things might have changed since then. Even when this got
> better there will still be _some_ overhead because we have to track that
> memory and that is not free.

Just like there is some overhead if a process is placed in memcg w/o
kmem accounting enabled.

> 
> The question really is whether kmem accounting is so generally useful
> that the overhead is acceptable and it is should be enabled by
> default. From my POV it is a useful mitigation of untrusted users but
> many loads simply do not care because they only care about a certain
> level of isolation.

FWIW, I've seen a useful workload that generated tons of negative
dentries for some reason (if my memory doesn't fail, it was nginx web
server). If one starts such a workload inside a container w/o kmem
accounting, it might evict useful data from other containers. So, even
if a container is trusted, it might be still worth having kmem
accounting enabled.

> 
> I might be wrong here of course but if the default should be switched it
> would deserve a better justification with some numbers so that people
> can see the possible drawbacks.

Personally, I'd prefer to have it switched on by default, because it
would force people test it and report bugs and performance degradation.
If one finds it really crappy, he/she should be able to disable it.

> 
> I agree that the per-cgroup knob is better than the global one. We

Not that sure :-/

> should also find consensus whether the legacy semantic of k < u limit
> should be preserved. It made sense to me at the time it was introduced
> but I recall that Vladimir found it not really helpful when we discussed
> that at LSF. I found it interesting e.g. for the rough task count
> limiting use case which people were asking for.

There is the pids cgroup, which suits this purpose much better.

K < U adds a lot of complexity to reclaim, while it's not clear whether
we really need it. For instance, when you hit K you should reclaim kmem
only, but there is kmem that is pinned by umem, e.g. radix tree nodes or
buffer heads. What should we do with them? Reclaim umem on hitting kmem
limit? IMO ugly.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
@ 2015-09-04 15:38                         ` Vladimir Davydov
  0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-09-04 15:38 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, hannes, cgroups, linux-mm, kernel-team

On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote:
> On Tue 01-09-15 14:51:57, Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote:
> > > The runtime overhead is not negligible and I do not see why everybody
> > > should be paying that price by default. I can definitely see the reason why
> > > somebody would want to enable the kmem accounting but many users will
> > > probably never care because the kernel footprint would be in the noise
> > > wrt. user memory.
> > 
> > We said the same thing about hierarchy support.  Sure, it's not the
> > same but I think it's wiser to keep the architectural decisions at a
> > higher level.  I don't think kmem overhead is that high but if this
> > actually is a problem we'd need a per-cgroup knob anyway.
> 
> The overhead was around 4% for the basic kbuild test without ever
> triggering the [k]memcg limit last time I checked. This was quite some
> time ago and things might have changed since then. Even when this got
> better there will still be _some_ overhead because we have to track that
> memory and that is not free.

Just like there is some overhead if a process is placed in memcg w/o
kmem accounting enabled.

> 
> The question really is whether kmem accounting is so generally useful
> that the overhead is acceptable and it is should be enabled by
> default. From my POV it is a useful mitigation of untrusted users but
> many loads simply do not care because they only care about a certain
> level of isolation.

FWIW, I've seen a useful workload that generated tons of negative
dentries for some reason (if my memory doesn't fail, it was nginx web
server). If one starts such a workload inside a container w/o kmem
accounting, it might evict useful data from other containers. So, even
if a container is trusted, it might be still worth having kmem
accounting enabled.

> 
> I might be wrong here of course but if the default should be switched it
> would deserve a better justification with some numbers so that people
> can see the possible drawbacks.

Personally, I'd prefer to have it switched on by default, because it
would force people test it and report bugs and performance degradation.
If one finds it really crappy, he/she should be able to disable it.

> 
> I agree that the per-cgroup knob is better than the global one. We

Not that sure :-/

> should also find consensus whether the legacy semantic of k < u limit
> should be preserved. It made sense to me at the time it was introduced
> but I recall that Vladimir found it not really helpful when we discussed
> that at LSF. I found it interesting e.g. for the rough task count
> limiting use case which people were asking for.

There is the pids cgroup, which suits this purpose much better.

K < U adds a lot of complexity to reclaim, while it's not clear whether
we really need it. For instance, when you hit K you should reclaim kmem
only, but there is kmem that is pinned by umem, e.g. radix tree nodes or
buffer heads. What should we do with them? Reclaim umem on hitting kmem
limit? IMO ugly.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-09-04 15:38                         ` Vladimir Davydov
@ 2015-09-07  9:39                           ` Michal Hocko
  -1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-09-07  9:39 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Tejun Heo, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Fri 04-09-15 18:38:10, Vladimir Davydov wrote:
> On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote:
> > On Tue 01-09-15 14:51:57, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote:
> > > > The runtime overhead is not negligible and I do not see why everybody
> > > > should be paying that price by default. I can definitely see the reason why
> > > > somebody would want to enable the kmem accounting but many users will
> > > > probably never care because the kernel footprint would be in the noise
> > > > wrt. user memory.
> > > 
> > > We said the same thing about hierarchy support.  Sure, it's not the
> > > same but I think it's wiser to keep the architectural decisions at a
> > > higher level.  I don't think kmem overhead is that high but if this
> > > actually is a problem we'd need a per-cgroup knob anyway.
> > 
> > The overhead was around 4% for the basic kbuild test without ever
> > triggering the [k]memcg limit last time I checked. This was quite some
> > time ago and things might have changed since then. Even when this got
> > better there will still be _some_ overhead because we have to track that
> > memory and that is not free.
> 
> Just like there is some overhead if a process is placed in memcg w/o
> kmem accounting enabled.

Sure I am not questioning that.

> > The question really is whether kmem accounting is so generally useful
> > that the overhead is acceptable and it is should be enabled by
> > default. From my POV it is a useful mitigation of untrusted users but
> > many loads simply do not care because they only care about a certain
> > level of isolation.
> 
> FWIW, I've seen a useful workload that generated tons of negative
> dentries for some reason (if my memory doesn't fail, it was nginx web
> server). If one starts such a workload inside a container w/o kmem
> accounting, it might evict useful data from other containers. So, even
> if a container is trusted, it might be still worth having kmem
> accounting enabled.

OK, then this is a clear usecase for using kmem mem. I was merely
pointing out that many others will not care about kmem. Is it majority?
I dunno. But I haven't heard any convincing argument that those that
need kmem would form a majority either.

> > I might be wrong here of course but if the default should be switched it
> > would deserve a better justification with some numbers so that people
> > can see the possible drawbacks.
> 
> Personally, I'd prefer to have it switched on by default, because it
> would force people test it and report bugs and performance degradation.
> If one finds it really crappy, he/she should be able to disable it.

I do not think this is the way of introducing new functionality. You do
not want to force users to debug your code and go let it disable if it
is too crappy.

> > I agree that the per-cgroup knob is better than the global one. We
> 
> Not that sure :-/

Why?

> > should also find consensus whether the legacy semantic of k < u limit
> > should be preserved. It made sense to me at the time it was introduced
> > but I recall that Vladimir found it not really helpful when we discussed
> > that at LSF. I found it interesting e.g. for the rough task count
> > limiting use case which people were asking for.
> 
> There is the pids cgroup, which suits this purpose much better.

I am not familiar with this controller. I am not following cgroup
mailing list too closely but I vaguely remember somebody proposing this
controller but I am not sure what is the current status. The last time
this has been discussed there was a general pushback to use kmem
accounting for process count restriction.

> K < U adds a lot of complexity to reclaim, while it's not clear whether
> we really need it. For instance, when you hit K you should reclaim kmem
> only, but there is kmem that is pinned by umem, e.g. radix tree nodes or
> buffer heads. What should we do with them? Reclaim umem on hitting kmem
> limit? IMO ugly.

kmem as a hard limit could simply reclaim slab objects and fail if it
doesn't succeed. Sure many objects might be pinned by other resources
which are not reclaimable but that is possible with the global case as
well.
I can see your argument that the configuration might be quite tricky,
though. If there is a general consensus that kernel memory bound to
resources which need to be controllable will get its own way of
controlling then a separate K limit would indeed be not needed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
@ 2015-09-07  9:39                           ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-09-07  9:39 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Tejun Heo, hannes, cgroups, linux-mm, kernel-team

On Fri 04-09-15 18:38:10, Vladimir Davydov wrote:
> On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote:
> > On Tue 01-09-15 14:51:57, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Tue, Sep 01, 2015 at 02:44:59PM +0200, Michal Hocko wrote:
> > > > The runtime overhead is not negligible and I do not see why everybody
> > > > should be paying that price by default. I can definitely see the reason why
> > > > somebody would want to enable the kmem accounting but many users will
> > > > probably never care because the kernel footprint would be in the noise
> > > > wrt. user memory.
> > > 
> > > We said the same thing about hierarchy support.  Sure, it's not the
> > > same but I think it's wiser to keep the architectural decisions at a
> > > higher level.  I don't think kmem overhead is that high but if this
> > > actually is a problem we'd need a per-cgroup knob anyway.
> > 
> > The overhead was around 4% for the basic kbuild test without ever
> > triggering the [k]memcg limit last time I checked. This was quite some
> > time ago and things might have changed since then. Even when this got
> > better there will still be _some_ overhead because we have to track that
> > memory and that is not free.
> 
> Just like there is some overhead if a process is placed in memcg w/o
> kmem accounting enabled.

Sure I am not questioning that.

> > The question really is whether kmem accounting is so generally useful
> > that the overhead is acceptable and it is should be enabled by
> > default. From my POV it is a useful mitigation of untrusted users but
> > many loads simply do not care because they only care about a certain
> > level of isolation.
> 
> FWIW, I've seen a useful workload that generated tons of negative
> dentries for some reason (if my memory doesn't fail, it was nginx web
> server). If one starts such a workload inside a container w/o kmem
> accounting, it might evict useful data from other containers. So, even
> if a container is trusted, it might be still worth having kmem
> accounting enabled.

OK, then this is a clear usecase for using kmem mem. I was merely
pointing out that many others will not care about kmem. Is it majority?
I dunno. But I haven't heard any convincing argument that those that
need kmem would form a majority either.

> > I might be wrong here of course but if the default should be switched it
> > would deserve a better justification with some numbers so that people
> > can see the possible drawbacks.
> 
> Personally, I'd prefer to have it switched on by default, because it
> would force people test it and report bugs and performance degradation.
> If one finds it really crappy, he/she should be able to disable it.

I do not think this is the way of introducing new functionality. You do
not want to force users to debug your code and go let it disable if it
is too crappy.

> > I agree that the per-cgroup knob is better than the global one. We
> 
> Not that sure :-/

Why?

> > should also find consensus whether the legacy semantic of k < u limit
> > should be preserved. It made sense to me at the time it was introduced
> > but I recall that Vladimir found it not really helpful when we discussed
> > that at LSF. I found it interesting e.g. for the rough task count
> > limiting use case which people were asking for.
> 
> There is the pids cgroup, which suits this purpose much better.

I am not familiar with this controller. I am not following cgroup
mailing list too closely but I vaguely remember somebody proposing this
controller but I am not sure what is the current status. The last time
this has been discussed there was a general pushback to use kmem
accounting for process count restriction.

> K < U adds a lot of complexity to reclaim, while it's not clear whether
> we really need it. For instance, when you hit K you should reclaim kmem
> only, but there is kmem that is pinned by umem, e.g. radix tree nodes or
> buffer heads. What should we do with them? Reclaim umem on hitting kmem
> limit? IMO ugly.

kmem as a hard limit could simply reclaim slab objects and fail if it
doesn't succeed. Sure many objects might be pinned by other resources
which are not reclaimable but that is possible with the global case as
well.
I can see your argument that the configuration might be quite tricky,
though. If there is a general consensus that kernel memory bound to
resources which need to be controllable will get its own way of
controlling then a separate K limit would indeed be not needed.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-09-07  9:39                           ` Michal Hocko
  (?)
@ 2015-09-07 10:01                           ` Vladimir Davydov
  2015-09-07 11:03                             ` Michal Hocko
  -1 siblings, 1 reply; 66+ messages in thread
From: Vladimir Davydov @ 2015-09-07 10:01 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, hannes, cgroups, linux-mm, kernel-team

On Mon, Sep 07, 2015 at 11:39:06AM +0200, Michal Hocko wrote:
...
> > > I might be wrong here of course but if the default should be switched it
> > > would deserve a better justification with some numbers so that people
> > > can see the possible drawbacks.
> > 
> > Personally, I'd prefer to have it switched on by default, because it
> > would force people test it and report bugs and performance degradation.
> > If one finds it really crappy, he/she should be able to disable it.
> 
> I do not think this is the way of introducing new functionality. You do
> not want to force users to debug your code and go let it disable if it
> is too crappy.

One must first enable CONFIG_KMEM, which is off by default.

Anyway, we aren't talking about enabling it by default in the legacy
hierarchy, only in the unified hierarchy, which must be explicitly
enabled by passing __DEVEL__save_behavior. I think that's enough.

> 
> > > I agree that the per-cgroup knob is better than the global one. We
> > 
> > Not that sure :-/
> 
> Why?

I'm not sure there is use cases which need having kmem acct enabled in
one cgroup and disabled in another.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-09-07 10:01                           ` Vladimir Davydov
@ 2015-09-07 11:03                             ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-09-07 11:03 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Tejun Heo, hannes, cgroups, linux-mm, kernel-team

On Mon 07-09-15 13:01:10, Vladimir Davydov wrote:
> On Mon, Sep 07, 2015 at 11:39:06AM +0200, Michal Hocko wrote:
> ...
> > > > I might be wrong here of course but if the default should be switched it
> > > > would deserve a better justification with some numbers so that people
> > > > can see the possible drawbacks.
> > > 
> > > Personally, I'd prefer to have it switched on by default, because it
> > > would force people test it and report bugs and performance degradation.
> > > If one finds it really crappy, he/she should be able to disable it.
> > 
> > I do not think this is the way of introducing new functionality. You do
> > not want to force users to debug your code and go let it disable if it
> > is too crappy.
> 
> One must first enable CONFIG_KMEM, which is off by default.

Yes, but let's be realistic here. People tend to use distribution
kernels and so the config option will have to be enabled.
 
> Anyway, we aren't talking about enabling it by default in the legacy
> hierarchy, only in the unified hierarchy, which must be explicitly
> enabled by passing __DEVEL__save_behavior. I think that's enough.

But once it is set like that in default then it will stay even when the
__DEVEL__ part is dropped. 
 
> > > > I agree that the per-cgroup knob is better than the global one. We
> > > 
> > > Not that sure :-/
> > 
> > Why?
> 
> I'm not sure there is use cases which need having kmem acct enabled in
> one cgroup and disabled in another.

What would be the downside though? It is true that all the static keys
will be enabled in these configurations so even the non-enabled path
would pay some overhead but that should be still close to unmeasurable
(I haven't measured that so I might be wrong here).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-09-04 13:30                   ` Michal Hocko
       [not found]                     ` <20150904133038.GC8220-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2015-09-04 16:18                     ` Tejun Heo
       [not found]                       ` <20150904161845.GB25329-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  1 sibling, 1 reply; 66+ messages in thread
From: Tejun Heo @ 2015-09-04 16:18 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team

Hello, Michal.

On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote:
> The overhead was around 4% for the basic kbuild test without ever
> triggering the [k]memcg limit last time I checked. This was quite some
> time ago and things might have changed since then. Even when this got
> better there will still be _some_ overhead because we have to track that
> memory and that is not free.

So, I just ran small scale tests and I don't see any meaningful
difference between kmemcg disabled and enabled for kbuild workload
(limit is never reached in both cases, memory is reclaimed from global
pressure).  The difference in kernel time usage.  I'm sure there's
*some* overhead buried in the noise but given the current
implementation, I can't see how enabling kmem would lead to 4%
overhead in kbuild tests.  It isn't that kernel intensive to begin
with.

> The question really is whether kmem accounting is so generally useful
> that the overhead is acceptable and it is should be enabled by
> default. From my POV it is a useful mitigation of untrusted users but
> many loads simply do not care because they only care about a certain
> level of isolation.

I don't think that's the right way to approach the problem.  Given
that the cost isn't prohibitive, no user only care about a certain
level of isolation willingly.  Distributing memory is what it's all
about after all and memory is memory, user or kernel.  We have kmem
on/off situation for historical reasons and because the early
implementation wasn't good enough to be enabled by default.  I get
that there can be special cases, temporary or otherwise, where
disabling kmem is desirable but that gotta be the exception, not the
norm.

> I might be wrong here of course but if the default should be switched it
> would deserve a better justification with some numbers so that people
> can see the possible drawbacks.
>
> I agree that the per-cgroup knob is better than the global one. We
> should also find consensus whether the legacy semantic of k < u limit
> should be preserved. It made sense to me at the time it was introduced
> but I recall that Vladimir found it not really helpful when we discussed
> that at LSF. I found it interesting e.g. for the rough task count
> limiting use case which people were asking for.

Let's please not hinge major design decisions on short-sighted hacks
and overhead considerations.  If task count is an actual resource
which needs to be regulated separatley, we should add a proper
controller for it and we did.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

[parent not found: <20150904161845.GB25329-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-09-04 16:18                     ` Tejun Heo
@ 2015-09-07 10:54                           ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-09-07 10:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vladimir Davydov, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Fri 04-09-15 12:18:45, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote:
> > The overhead was around 4% for the basic kbuild test without ever
> > triggering the [k]memcg limit last time I checked. This was quite some
> > time ago and things might have changed since then. Even when this got
> > better there will still be _some_ overhead because we have to track that
> > memory and that is not free.
> 
> So, I just ran small scale tests and I don't see any meaningful
> difference between kmemcg disabled and enabled for kbuild workload
> (limit is never reached in both cases, memory is reclaimed from global
> pressure).  The difference in kernel time usage.  I'm sure there's
> *some* overhead buried in the noise but given the current
> implementation, I can't see how enabling kmem would lead to 4%
> overhead in kbuild tests.  It isn't that kernel intensive to begin
> with.

OK, I've quickly rerun my test on 32CPU machine with 64G of RAM
Elapsed
logs.kmem: min: 68.10 max: 69.27 avg: 68.53 std: 0.53 runs: 3
logs.no.kmem: min: 64.08 [94.1%] max: 68.42 [98.8%] avg: 66.22 [96.6%] std: 1.77 runs: 3
User
logs.kmem: min: 867.68 max: 872.88 avg: 869.49 std: 2.40 runs: 3
logs.no.kmem: min: 865.99 [99.8%] max: 884.94 [101.4%] avg: 874.08 [100.5%] std: 7.98 runs: 3
System
logs.kmem: min: 78.50 max: 78.85 avg: 78.63 std: 0.16 runs: 3
logs.no.kmem: min: 75.36 [96.0%] max: 80.50 [102.1%] avg: 77.91 [99.1%] std: 2.10 runs: 3

The elapsed time is still ~3% worse in average while user and system are
in noise. I haven't checked where he overhead is coming from.
 
> > The question really is whether kmem accounting is so generally useful
> > that the overhead is acceptable and it is should be enabled by
> > default. From my POV it is a useful mitigation of untrusted users but
> > many loads simply do not care because they only care about a certain
> > level of isolation.
> 
> I don't think that's the right way to approach the problem.  Given
> that the cost isn't prohibitive, no user only care about a certain
> level of isolation willingly.

I haven't said it is prohibitive. It is simply non-zero and there is
always cost/benefit that should be considered.

> Distributing memory is what it's all about after all and memory is
> memory, user or kernel.

True except that kmem accounting doesn't cover the whole kernel memory
usage. It is an opt-in mechanism for a _better_ isolation. And the
question really is whether that better isolation is needed/requested by
default.

> We have kmem
> on/off situation for historical reasons and because the early
> implementation wasn't good enough to be enabled by default.  I get
> that there can be special cases, temporary or otherwise, where
> disabling kmem is desirable but that gotta be the exception, not the
> norm.

The default should be the cheapest one IMHO. And our overhead is really
close to 0 if no memcg accounting is enabled thanks to Johannes'
page_counters. Then we have a lightweight form of accounting (only user
memory) which is nicely defined. And then we have an additional opt-in
for a better isolation which involves some kernel memory as well. Why
should we conflate the last two? I mean, if somebody wants an additional
protection then sure, enable kmem and pay an additional overhead but why
to force this on everybody who wants to use memcg?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
@ 2015-09-07 10:54                           ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-09-07 10:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team

On Fri 04-09-15 12:18:45, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Sep 04, 2015 at 03:30:38PM +0200, Michal Hocko wrote:
> > The overhead was around 4% for the basic kbuild test without ever
> > triggering the [k]memcg limit last time I checked. This was quite some
> > time ago and things might have changed since then. Even when this got
> > better there will still be _some_ overhead because we have to track that
> > memory and that is not free.
> 
> So, I just ran small scale tests and I don't see any meaningful
> difference between kmemcg disabled and enabled for kbuild workload
> (limit is never reached in both cases, memory is reclaimed from global
> pressure).  The difference in kernel time usage.  I'm sure there's
> *some* overhead buried in the noise but given the current
> implementation, I can't see how enabling kmem would lead to 4%
> overhead in kbuild tests.  It isn't that kernel intensive to begin
> with.

OK, I've quickly rerun my test on 32CPU machine with 64G of RAM
Elapsed
logs.kmem: min: 68.10 max: 69.27 avg: 68.53 std: 0.53 runs: 3
logs.no.kmem: min: 64.08 [94.1%] max: 68.42 [98.8%] avg: 66.22 [96.6%] std: 1.77 runs: 3
User
logs.kmem: min: 867.68 max: 872.88 avg: 869.49 std: 2.40 runs: 3
logs.no.kmem: min: 865.99 [99.8%] max: 884.94 [101.4%] avg: 874.08 [100.5%] std: 7.98 runs: 3
System
logs.kmem: min: 78.50 max: 78.85 avg: 78.63 std: 0.16 runs: 3
logs.no.kmem: min: 75.36 [96.0%] max: 80.50 [102.1%] avg: 77.91 [99.1%] std: 2.10 runs: 3

The elapsed time is still ~3% worse in average while user and system are
in noise. I haven't checked where he overhead is coming from.
 
> > The question really is whether kmem accounting is so generally useful
> > that the overhead is acceptable and it is should be enabled by
> > default. From my POV it is a useful mitigation of untrusted users but
> > many loads simply do not care because they only care about a certain
> > level of isolation.
> 
> I don't think that's the right way to approach the problem.  Given
> that the cost isn't prohibitive, no user only care about a certain
> level of isolation willingly.

I haven't said it is prohibitive. It is simply non-zero and there is
always cost/benefit that should be considered.

> Distributing memory is what it's all about after all and memory is
> memory, user or kernel.

True except that kmem accounting doesn't cover the whole kernel memory
usage. It is an opt-in mechanism for a _better_ isolation. And the
question really is whether that better isolation is needed/requested by
default.

> We have kmem
> on/off situation for historical reasons and because the early
> implementation wasn't good enough to be enabled by default.  I get
> that there can be special cases, temporary or otherwise, where
> disabling kmem is desirable but that gotta be the exception, not the
> norm.

The default should be the cheapest one IMHO. And our overhead is really
close to 0 if no memcg accounting is enabled thanks to Johannes'
page_counters. Then we have a lightweight form of accounting (only user
memory) which is nicely defined. And then we have an additional opt-in
for a better isolation which involves some kernel memory as well. Why
should we conflate the last two? I mean, if somebody wants an additional
protection then sure, enable kmem and pay an additional overhead but why
to force this on everybody who wants to use memcg?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-09-07 10:54                           ` Michal Hocko
  (?)
@ 2015-09-08 18:50                           ` Tejun Heo
  -1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-09-08 18:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team

Hello, Michal.

On Mon, Sep 07, 2015 at 12:54:37PM +0200, Michal Hocko wrote:
> OK, I've quickly rerun my test on 32CPU machine with 64G of RAM
> Elapsed
> logs.kmem: min: 68.10 max: 69.27 avg: 68.53 std: 0.53 runs: 3
> logs.no.kmem: min: 64.08 [94.1%] max: 68.42 [98.8%] avg: 66.22 [96.6%] std: 1.77 runs: 3
> User
> logs.kmem: min: 867.68 max: 872.88 avg: 869.49 std: 2.40 runs: 3
> logs.no.kmem: min: 865.99 [99.8%] max: 884.94 [101.4%] avg: 874.08 [100.5%] std: 7.98 runs: 3
> System
> logs.kmem: min: 78.50 max: 78.85 avg: 78.63 std: 0.16 runs: 3
> logs.no.kmem: min: 75.36 [96.0%] max: 80.50 [102.1%] avg: 77.91 [99.1%] std: 2.10 runs: 3
> 
> The elapsed time is still ~3% worse in average while user and system are
> in noise. I haven't checked where he overhead is coming from.

Does the cgroup have memory limit configured?  Unless there are
measurement errors, the only way it'd take noticeably longer w/o
incurring more CPU time is spending time blocked on reclaim and
enabling kmem of course adds to memory pressure, which is the intended
behavior.

> > I don't think that's the right way to approach the problem.  Given
> > that the cost isn't prohibitive, no user only care about a certain
> > level of isolation willingly.
> 
> I haven't said it is prohibitive. It is simply non-zero and there is
> always cost/benefit that should be considered.

We do want to hunt down that 3% but locking into interface is an a lot
larger and longer-term commitment.  The cost sure is non-zero but I'd
be surprised if we can't get that down to something generally
acceptable over time given that the domain switching is a relatively
low-frequency event (scheduling) and it's an area where we can
actively make space to speed trade-off.

> > Distributing memory is what it's all about after all and memory is
> > memory, user or kernel.
> 
> True except that kmem accounting doesn't cover the whole kernel memory
> usage. It is an opt-in mechanism for a _better_ isolation. And the
> question really is whether that better isolation is needed/requested by
> default.

It isn't perfect but kmem and socket buffers do cover most of kernel
memory usage that is accountable to userland.  It isn't just a matter
of better or worse.  The goal of cgroup is providing a "reasonable"
isolation.  Sure, we can decide to ignore some but that should be
because the extra accuracy there doesn't matter in the scheme of
things and thus paying the overhead is pointless; however, users
shouldn't need to worry about the different levels of ambiguous
accuracies which can't even be quantified, at least not by default.

Let's please not get lost in perfectionism.  Sure, it can't be perfect
but that doesn't mean an attainable and clear goal doesn't exist here.

> > We have kmem
> > on/off situation for historical reasons and because the early
> > implementation wasn't good enough to be enabled by default.  I get
> > that there can be special cases, temporary or otherwise, where
> > disabling kmem is desirable but that gotta be the exception, not the
> > norm.
> 
> The default should be the cheapest one IMHO. And our overhead is really

That is a way too simplistic and greedy decision criterion.  I don't
think we want to make interface decisions on that.  Overhead
considerations definitely dictate what we can and can't do and that's
why I said that the cost wasn't prohibitive but there are a whole lot
of other things to consider too including where we wanna be
eventually years down the road.

> close to 0 if no memcg accounting is enabled thanks to Johannes'
> page_counters. Then we have a lightweight form of accounting (only user
> memory) which is nicely defined. And then we have an additional opt-in
> for a better isolation which involves some kernel memory as well. Why
> should we conflate the last two? I mean, if somebody wants an additional
> protection then sure, enable kmem and pay an additional overhead but why
> to force this on everybody who wants to use memcg?

Because it betrays the basic goal of reasonable resource isolation.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
  2015-08-28 15:25     ` Tejun Heo
  (?)
  (?)
@ 2015-11-05 17:30     ` Michal Hocko
  -1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-11-05 17:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team

Just for the reference. This has been discussed as a part of other email
thread discussed here:
http://lkml.kernel.org/r/20151027122647.GG9891%40dhcp22.suse.cz

I am _really_ sorry for hijacking that one - I didn't intend to do
so but my remark ended up in a full discussion. If I knew it would go
that way I wouldn't even mention it.

On Fri 28-08-15 11:25:30, Tejun Heo wrote:
> On the default hierarchy, all memory consumption will be accounted
> together and controlled by the same set of limits.  Always enable
> kmemcg on the default hierarchy.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  mm/memcontrol.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c94b686..8a5dd01 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	if (ret)
>  		return ret;
>  
> +	/* kmem is always accounted together on the default hierarchy */
> +	if (cgroup_on_dfl(css->cgroup)) {
> +		ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
> +		if (ret)
> +			return ret;
> +	}
> +
>  	/*
>  	 * Make sure the memcg is initialized: mem_cgroup_iter()
>  	 * orders reading memcg->initialized against its callers
> -- 
> 2.4.3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2015-11-05 17:30 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-28 15:25 [PATCHSET] memcg: improve high limit behavior and always enable kmemcg on dfl hier Tejun Heo
2015-08-28 15:25 ` Tejun Heo
     [not found] ` <1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 15:25   ` [PATCH 1/4] memcg: fix over-high reclaim amount Tejun Heo
2015-08-28 15:25     ` Tejun Heo
2015-08-28 17:06     ` Michal Hocko
     [not found]       ` <20150828170612.GA21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-08-28 18:32         ` Tejun Heo
2015-08-28 18:32           ` Tejun Heo
     [not found]           ` <20150828183209.GA9423-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-08-31  7:51             ` Michal Hocko
2015-08-31  7:51               ` Michal Hocko
     [not found]               ` <20150831075133.GA29723-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-08-31 13:38                 ` Tejun Heo
2015-08-31 13:38                   ` Tejun Heo
2015-09-01 12:51                   ` Michal Hocko
     [not found]                     ` <20150901125149.GD8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-09-01 18:33                       ` Tejun Heo
2015-09-01 18:33                         ` Tejun Heo
2015-08-28 15:25   ` [PATCH 2/4] memcg: flatten task_struct->memcg_oom Tejun Heo
2015-08-28 15:25     ` Tejun Heo
     [not found]     ` <1440775530-18630-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 17:11       ` Michal Hocko
2015-08-28 17:11         ` Michal Hocko
2015-08-28 15:25   ` [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path Tejun Heo
2015-08-28 15:25     ` Tejun Heo
     [not found]     ` <1440775530-18630-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 16:36       ` Vladimir Davydov
2015-08-28 16:36         ` Vladimir Davydov
2015-08-28 16:48         ` Tejun Heo
2015-08-28 16:48           ` Tejun Heo
2015-08-28 20:32           ` Vladimir Davydov
2015-08-28 20:44             ` Tejun Heo
2015-08-28 20:44               ` Tejun Heo
2015-08-28 22:06               ` Tejun Heo
     [not found]                 ` <20150828220632.GF11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-08-29  7:59                   ` Vladimir Davydov
2015-08-29  7:59                     ` Vladimir Davydov
2015-08-30 15:52             ` Vladimir Davydov
2015-08-28 17:13       ` Michal Hocko
2015-08-28 17:13         ` Michal Hocko
     [not found]         ` <20150828171322.GC21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-08-28 17:56           ` Tejun Heo
2015-08-28 17:56             ` Tejun Heo
2015-08-28 20:45           ` Vladimir Davydov
2015-08-28 20:45             ` Vladimir Davydov
2015-08-28 20:53             ` Tejun Heo
2015-08-28 20:53               ` Tejun Heo
     [not found]               ` <20150828205301.GB11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-08-28 21:07                 ` Vladimir Davydov
2015-08-28 21:07                   ` Vladimir Davydov
2015-08-28 21:14                   ` Tejun Heo
2015-08-28 21:14                     ` Tejun Heo
2015-08-28 15:25   ` [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy Tejun Heo
2015-08-28 15:25     ` Tejun Heo
     [not found]     ` <1440775530-18630-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 16:49       ` Vladimir Davydov
2015-08-28 16:49         ` Vladimir Davydov
2015-08-28 16:56         ` Tejun Heo
2015-08-28 17:14         ` Michal Hocko
2015-08-28 17:14           ` Michal Hocko
2015-08-28 17:41           ` Tejun Heo
2015-09-01 12:44             ` Michal Hocko
     [not found]               ` <20150901124459.GC8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-09-01 18:51                 ` Tejun Heo
2015-09-01 18:51                   ` Tejun Heo
2015-09-04 13:30                   ` Michal Hocko
     [not found]                     ` <20150904133038.GC8220-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-09-04 15:38                       ` Vladimir Davydov
2015-09-04 15:38                         ` Vladimir Davydov
2015-09-07  9:39                         ` Michal Hocko
2015-09-07  9:39                           ` Michal Hocko
2015-09-07 10:01                           ` Vladimir Davydov
2015-09-07 11:03                             ` Michal Hocko
2015-09-04 16:18                     ` Tejun Heo
     [not found]                       ` <20150904161845.GB25329-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-09-07 10:54                         ` Michal Hocko
2015-09-07 10:54                           ` Michal Hocko
2015-09-08 18:50                           ` Tejun Heo
2015-11-05 17:30     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.