[PATCH -v2 0/6] memcg: some charge path cleanups + css offline vs. charge race fix

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -v2 0/6] memcg: some charge path cleanups + css offline vs. charge race fix
@ 2014-02-04 13:28 Michal Hocko
  2014-02-04 13:28 ` [PATCH -v2 1/6] memcg: do not replicate try_get_mem_cgroup_from_mm in __mem_cgroup_try_charge Michal Hocko
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 13:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

Hi,
this is a second version of the series previously posted here:
http://marc.info/?l=linux-mm&m=138729515304263&w=2. It is based on
3.14-rc1 and I am still testing it but having another eyes on it would
be great because this piece of code is really tricky.

The first four patches are an attempt to clean up memcg charging path a
bit. I am already fed up about all the different combinations of mm vs.
memcgp parameters so I have split up the function into two parts:
        * charge mm
        * charge a known memcg
More details are in the patch 2. I think that this makes more sense.
It was also quite surprising that just the code reordering without any
functional changes made the code smaller by 800B.
Johannes has suggested (http://marc.info/?l=linux-mm&m=139144269917488&w=2)
that mem_cgroup_try_charge_mm is not that helpful because the caller
can resolve the proper memcg by calling try_get_mem_cgroup_from_mm but
the next patch will require a retry if the css becomes offline and we do
not want to duplicate the same logic in each caller.

Patch #4 addresses memcg charge vs. memcg_offline race which is now
worked around by 96f1c58d8534 (mm: memcg: fix race condition between
memcg teardown and swapin). The last patch reverts the workaround.

Changes since v1 (based on comments from Johannes)
- renamed mem_cgroup_bypass_charge to current_bypass_charge
- get rid of try_get_mem_cgroup_from_mm duplication if the mm is charged
- fixed rcu_read_lock recursion bug in try_get_mem_cgroup_from_mm
- dropped memcg->offline and replace it by css_tryget & css_put
- fixed ref leak for kmem CHARGE_RETRY path
- kmem accounting cleanup as well

Michal Hocko (6):
      memcg: do not replicate try_get_mem_cgroup_from_mm in __mem_cgroup_try_charge
      memcg: cleanup charge routines
      memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm
      memcg: make sure that memcg is not offline when charging
      memcg, kmem: clean up memcg parameter handling
      Revert "mm: memcg: fix race condition between memcg teardown and swapin"

Diffstat says:
 include/linux/memcontrol.h |   4 +-
 mm/memcontrol.c            | 341 +++++++++++++++++++++------------------------
 mm/page_alloc.c            |   2 +-
 3 files changed, 162 insertions(+), 185 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH -v2 1/6] memcg: do not replicate try_get_mem_cgroup_from_mm in __mem_cgroup_try_charge
  2014-02-04 13:28 [PATCH -v2 0/6] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
@ 2014-02-04 13:28 ` Michal Hocko
  2014-02-04 15:55   ` Johannes Weiner
  2014-02-04 13:28 ` [PATCH -v2 2/6] memcg: cleanup charge routines Michal Hocko
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 13:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

Johannes Weiner has pointed out that __mem_cgroup_try_charge duplicates
try_get_mem_cgroup_from_mm for charges which came without a memcg. The
only reason seems to be a tiny optimization when css_tryget is not
called if the charge can be consumed from the stock. Nevertheless
css_tryget is very cheap since it has been reworked to use per-cpu
counting so this optimization doesn't give us anything these days.

So let's drop the code duplication so that the code is more readable.
While we are at it also remove a very confusing comment in
try_get_mem_cgroup_from_mm.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 49 ++++++++-----------------------------------------
 1 file changed, 8 insertions(+), 41 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53385cd4e6f0..042e4ff36c05 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1081,11 +1081,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 
 	if (!mm)
 		return NULL;
-	/*
-	 * Because we have no locks, mm->owner's may be being moved to other
-	 * cgroup. We use css_tryget() here even if this looks
-	 * pessimistic (rather than adding locks here).
-	 */
+
 	rcu_read_lock();
 	do {
 		memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
@@ -2759,45 +2755,15 @@ again:
 			goto done;
 		css_get(&memcg->css);
 	} else {
-		struct task_struct *p;
-
-		rcu_read_lock();
-		p = rcu_dereference(mm->owner);
-		/*
-		 * Because we don't have task_lock(), "p" can exit.
-		 * In that case, "memcg" can point to root or p can be NULL with
-		 * race with swapoff. Then, we have small risk of mis-accouning.
-		 * But such kind of mis-account by race always happens because
-		 * we don't have cgroup_mutex(). It's overkill and we allo that
-		 * small race, here.
-		 * (*) swapoff at el will charge against mm-struct not against
-		 * task-struct. So, mm->owner can be NULL.
-		 */
-		memcg = mem_cgroup_from_task(p);
-		if (!memcg)
+		memcg = try_get_mem_cgroup_from_mm(mm);
+		if (!memcg) {
 			memcg = root_mem_cgroup;
-		if (mem_cgroup_is_root(memcg)) {
-			rcu_read_unlock();
-			goto done;
-		}
-		if (consume_stock(memcg, nr_pages)) {
-			/*
-			 * It seems dagerous to access memcg without css_get().
-			 * But considering how consume_stok works, it's not
-			 * necessary. If consume_stock success, some charges
-			 * from this memcg are cached on this cpu. So, we
-			 * don't need to call css_get()/css_tryget() before
-			 * calling consume_stock().
-			 */
-			rcu_read_unlock();
 			goto done;
 		}
-		/* after here, we may be blocked. we need to get refcnt */
-		if (!css_tryget(&memcg->css)) {
-			rcu_read_unlock();
-			goto again;
-		}
-		rcu_read_unlock();
+		if (mem_cgroup_is_root(memcg))
+			goto done_put;
+		if (consume_stock(memcg, nr_pages))
+			goto done_put;
 	}
 
 	do {
@@ -2834,6 +2800,7 @@ again:
 
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
+done_put:
 	css_put(&memcg->css);
 done:
 	*ptr = memcg;
-- 
1.9.rc1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 1/6] memcg: do not replicate try_get_mem_cgroup_from_mm in __mem_cgroup_try_charge
  2014-02-04 13:28 ` [PATCH -v2 1/6] memcg: do not replicate try_get_mem_cgroup_from_mm in __mem_cgroup_try_charge Michal Hocko
@ 2014-02-04 15:55   ` Johannes Weiner
  2014-02-04 16:05     ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2014-02-04 15:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue, Feb 04, 2014 at 02:28:55PM +0100, Michal Hocko wrote:
> Johannes Weiner has pointed out that __mem_cgroup_try_charge duplicates
> try_get_mem_cgroup_from_mm for charges which came without a memcg. The
> only reason seems to be a tiny optimization when css_tryget is not
> called if the charge can be consumed from the stock. Nevertheless
> css_tryget is very cheap since it has been reworked to use per-cpu
> counting so this optimization doesn't give us anything these days.
> 
> So let's drop the code duplication so that the code is more readable.
> While we are at it also remove a very confusing comment in
> try_get_mem_cgroup_from_mm.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c | 49 ++++++++-----------------------------------------
>  1 file changed, 8 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53385cd4e6f0..042e4ff36c05 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1081,11 +1081,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  
>  	if (!mm)
>  		return NULL;

While you're at it, this check also seems unnecessary.

> -	/*
> -	 * Because we have no locks, mm->owner's may be being moved to other
> -	 * cgroup. We use css_tryget() here even if this looks
> -	 * pessimistic (rather than adding locks here).
> -	 */
> +
>  	rcu_read_lock();
>  	do {
>  		memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> @@ -2759,45 +2755,15 @@ again:
>  			goto done;
>  		css_get(&memcg->css);
>  	} else {
> -		struct task_struct *p;
> -
> -		rcu_read_lock();
> -		p = rcu_dereference(mm->owner);
> -		/*
> -		 * Because we don't have task_lock(), "p" can exit.
> -		 * In that case, "memcg" can point to root or p can be NULL with
> -		 * race with swapoff. Then, we have small risk of mis-accouning.
> -		 * But such kind of mis-account by race always happens because
> -		 * we don't have cgroup_mutex(). It's overkill and we allo that
> -		 * small race, here.
> -		 * (*) swapoff at el will charge against mm-struct not against
> -		 * task-struct. So, mm->owner can be NULL.
> -		 */
> -		memcg = mem_cgroup_from_task(p);
> -		if (!memcg)
> +		memcg = try_get_mem_cgroup_from_mm(mm);
> +		if (!memcg) {
>  			memcg = root_mem_cgroup;
> -		if (mem_cgroup_is_root(memcg)) {
> -			rcu_read_unlock();
> -			goto done;
> -		}
> -		if (consume_stock(memcg, nr_pages)) {
> -			/*
> -			 * It seems dagerous to access memcg without css_get().
> -			 * But considering how consume_stok works, it's not
> -			 * necessary. If consume_stock success, some charges
> -			 * from this memcg are cached on this cpu. So, we
> -			 * don't need to call css_get()/css_tryget() before
> -			 * calling consume_stock().
> -			 */
> -			rcu_read_unlock();
>  			goto done;
>  		}
> -		/* after here, we may be blocked. we need to get refcnt */
> -		if (!css_tryget(&memcg->css)) {
> -			rcu_read_unlock();
> -			goto again;
> -		}
> -		rcu_read_unlock();
> +		if (mem_cgroup_is_root(memcg))
> +			goto done_put;
> +		if (consume_stock(memcg, nr_pages))
> +			goto done_put;

These two are actually the same in the if (*ptr) branch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 1/6] memcg: do not replicate try_get_mem_cgroup_from_mm in __mem_cgroup_try_charge
  2014-02-04 15:55   ` Johannes Weiner
@ 2014-02-04 16:05     ` Michal Hocko
  2014-02-05 13:49       ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 16:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue 04-02-14 10:55:08, Johannes Weiner wrote:
> On Tue, Feb 04, 2014 at 02:28:55PM +0100, Michal Hocko wrote:
> > Johannes Weiner has pointed out that __mem_cgroup_try_charge duplicates
> > try_get_mem_cgroup_from_mm for charges which came without a memcg. The
> > only reason seems to be a tiny optimization when css_tryget is not
> > called if the charge can be consumed from the stock. Nevertheless
> > css_tryget is very cheap since it has been reworked to use per-cpu
> > counting so this optimization doesn't give us anything these days.
> > 
> > So let's drop the code duplication so that the code is more readable.
> > While we are at it also remove a very confusing comment in
> > try_get_mem_cgroup_from_mm.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  mm/memcontrol.c | 49 ++++++++-----------------------------------------
> >  1 file changed, 8 insertions(+), 41 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 53385cd4e6f0..042e4ff36c05 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1081,11 +1081,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> >  
> >  	if (!mm)
> >  		return NULL;
> 
> While you're at it, this check also seems unnecessary.

Yes, it will be removed in a later patch. I wanted to have it in a
separate patch for a better bisectability just in case I have really
missed mm-might-by-NULL case.

> > -	/*
> > -	 * Because we have no locks, mm->owner's may be being moved to other
> > -	 * cgroup. We use css_tryget() here even if this looks
> > -	 * pessimistic (rather than adding locks here).
> > -	 */
> > +
> >  	rcu_read_lock();
> >  	do {
> >  		memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > @@ -2759,45 +2755,15 @@ again:
> >  			goto done;
> >  		css_get(&memcg->css);
> >  	} else {
> > -		struct task_struct *p;
> > -
> > -		rcu_read_lock();
> > -		p = rcu_dereference(mm->owner);
> > -		/*
> > -		 * Because we don't have task_lock(), "p" can exit.
> > -		 * In that case, "memcg" can point to root or p can be NULL with
> > -		 * race with swapoff. Then, we have small risk of mis-accouning.
> > -		 * But such kind of mis-account by race always happens because
> > -		 * we don't have cgroup_mutex(). It's overkill and we allo that
> > -		 * small race, here.
> > -		 * (*) swapoff at el will charge against mm-struct not against
> > -		 * task-struct. So, mm->owner can be NULL.
> > -		 */
> > -		memcg = mem_cgroup_from_task(p);
> > -		if (!memcg)
> > +		memcg = try_get_mem_cgroup_from_mm(mm);
> > +		if (!memcg) {
> >  			memcg = root_mem_cgroup;
> > -		if (mem_cgroup_is_root(memcg)) {
> > -			rcu_read_unlock();
> > -			goto done;
> > -		}
> > -		if (consume_stock(memcg, nr_pages)) {
> > -			/*
> > -			 * It seems dagerous to access memcg without css_get().
> > -			 * But considering how consume_stok works, it's not
> > -			 * necessary. If consume_stock success, some charges
> > -			 * from this memcg are cached on this cpu. So, we
> > -			 * don't need to call css_get()/css_tryget() before
> > -			 * calling consume_stock().
> > -			 */
> > -			rcu_read_unlock();
> >  			goto done;
> >  		}
> > -		/* after here, we may be blocked. we need to get refcnt */
> > -		if (!css_tryget(&memcg->css)) {
> > -			rcu_read_unlock();
> > -			goto again;
> > -		}
> > -		rcu_read_unlock();
> > +		if (mem_cgroup_is_root(memcg))
> > +			goto done_put;
> > +		if (consume_stock(memcg, nr_pages))
> > +			goto done_put;
> 
> These two are actually the same in the if (*ptr) branch.

True, I just wanted to have the patch minimalistic and do just a single
thing here. Duplicity will vanish in the next patch.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 1/6] memcg: do not replicate try_get_mem_cgroup_from_mm in __mem_cgroup_try_charge
  2014-02-04 16:05     ` Michal Hocko
@ 2014-02-05 13:49       ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2014-02-05 13:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue 04-02-14 17:05:21, Michal Hocko wrote:
> On Tue 04-02-14 10:55:08, Johannes Weiner wrote:
> > On Tue, Feb 04, 2014 at 02:28:55PM +0100, Michal Hocko wrote:
> > > Johannes Weiner has pointed out that __mem_cgroup_try_charge duplicates
> > > try_get_mem_cgroup_from_mm for charges which came without a memcg. The
> > > only reason seems to be a tiny optimization when css_tryget is not
> > > called if the charge can be consumed from the stock. Nevertheless
> > > css_tryget is very cheap since it has been reworked to use per-cpu
> > > counting so this optimization doesn't give us anything these days.
> > > 
> > > So let's drop the code duplication so that the code is more readable.
> > > While we are at it also remove a very confusing comment in
> > > try_get_mem_cgroup_from_mm.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > > ---
> > >  mm/memcontrol.c | 49 ++++++++-----------------------------------------
> > >  1 file changed, 8 insertions(+), 41 deletions(-)
> > > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 53385cd4e6f0..042e4ff36c05 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1081,11 +1081,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> > >  
> > >  	if (!mm)
> > >  		return NULL;
> > 
> > While you're at it, this check also seems unnecessary.
> 
> Yes, it will be removed in a later patch. I wanted to have it in a
> separate patch for a better bisectability just in case I have really
> missed mm-might-by-NULL case.

Ohh, I have mixed that with the other mm check. You are right we can
remove this one as well. Thanks and sorry for confusion!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH -v2 2/6] memcg: cleanup charge routines
  2014-02-04 13:28 [PATCH -v2 0/6] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
  2014-02-04 13:28 ` [PATCH -v2 1/6] memcg: do not replicate try_get_mem_cgroup_from_mm in __mem_cgroup_try_charge Michal Hocko
@ 2014-02-04 13:28 ` Michal Hocko
  2014-02-04 16:05   ` Johannes Weiner
  2014-02-04 13:28 ` [PATCH -v2 3/6] memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm Michal Hocko
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 13:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

The current core of memcg charging is wild to say the least.
__mem_cgroup_try_charge which is in the center tries to be too clever
and it handles two independent cases
	* when the memcg to be charged is known in advance
	* when the given mm_struct is charged
The resulting callchains are quite complex:

memcg_charge_kmem(mm=NULL, memcg)  mem_cgroup_newpage_charge(mm)
 |                                | _________________________________________ mem_cgroup_cache_charge(current->mm)
 |                                |/                                            |
 | ______________________________ mem_cgroup_charge_common(mm, memcg=NULL)      |
 |/                                                                             /
 |                                                                             /
 | ____________________________ mem_cgroup_try_charge_swapin(mm, memcg=NULL)  /
 |/                               | _________________________________________/
 |                                |/
 |                                |                         /* swap accounting */   /* no swap accounting */
 | _____________________________  __mem_cgroup_try_charge_swapin(mm=NULL, memcg) || (mm, memcg=NULL)
 |/
 | ____________________________ mem_cgroup_do_precharge(mm=NULL, memcg)
 |/
__mem_cgroup_try_charge
  mem_cgroup_do_charge
    res_counter_charge
    mem_cgroup_reclaim
    mem_cgroup_wait_acct_move
    mem_cgroup_oom

This patch splits __mem_cgroup_try_charge into two logical parts.
mem_cgroup_try_charge_mm which is responsible for charges for the given
mm_struct and it returns the charged memcg or NULL under OOM while
mem_cgroup_try_charge_memcg charges a known memcg and returns an error
code.

The only tricky part which remains is __mem_cgroup_try_charge_swapin
because it can return 0 if PageCgroupUsed is already set and then we do
not want to commit the charge. This is done with a magic combination of
memcg = NULL and ret = 0. So the function preserves its memcgp parameter
and sets the given memcg to NULL when it sees PageCgroupUsed
(__mem_cgroup_commit_charge_swapin then ignores such a commit).

Not only the code is easier to follow the change reduces the code size
too:
size mm/built-in.o.before mm/built-in.o.after
   text    data     bss     dec     hex filename
464718   83038   49904  597660   91e9c mm/built-in.o.before
463894   83038   49904  596836   91b64 mm/built-in.o.after

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 205 +++++++++++++++++++++++++++-----------------------------
 1 file changed, 98 insertions(+), 107 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 042e4ff36c05..72fbe0fb3320 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2618,7 +2618,7 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 }
 
 
-/* See __mem_cgroup_try_charge() for details */
+/* See mem_cgroup_do_charge() for details */
 enum {
 	CHARGE_OK,		/* success */
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
@@ -2691,108 +2691,69 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	return CHARGE_NOMEM;
 }
 
+static bool current_bypass_charge(void)
+{
+	/*
+	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
+	 * in system level. So, allow to go ahead dying process in addition to
+	 * MEMDIE process.
+	 */
+	if (unlikely(test_thread_flag(TIF_MEMDIE)
+		     || fatal_signal_pending(current)))
+		return true;
+
+	return false;
+}
+
 /*
- * __mem_cgroup_try_charge() does
- * 1. detect memcg to be charged against from passed *mm and *ptr,
- * 2. update res_counter
- * 3. call memory reclaim if necessary.
- *
- * In some special case, if the task is fatal, fatal_signal_pending() or
- * has TIF_MEMDIE, this function returns -EINTR while writing root_mem_cgroup
- * to *ptr. There are two reasons for this. 1: fatal threads should quit as soon
- * as possible without any hazards. 2: all pages should have a valid
- * pc->mem_cgroup. If mm is NULL and the caller doesn't pass a valid memcg
- * pointer, that is treated as a charge to root_mem_cgroup.
- *
- * So __mem_cgroup_try_charge() will return
- *  0       ...  on success, filling *ptr with a valid memcg pointer.
- *  -ENOMEM ...  charge failure because of resource limits.
- *  -EINTR  ...  if thread is fatal. *ptr is filled with root_mem_cgroup.
+ * mem_cgroup_try_charge_memcg - core of the memcg charging code. The caller
+ * keeps a css reference to the given memcg. We do not charge root_mem_cgroup.
+ * OOM is triggered only if allowed by the given oom parameter (except for
+ * __GFP_NOFAIL when it is ignored).
  *
- * Unlike the exported interface, an "oom" parameter is added. if oom==true,
- * the oom-killer can be invoked.
+ * Returns 0 on success, -ENOMEM when the given memcg is under OOM and -EINTR
+ * when the charge is bypassed (either when fatal signals are pending or
+ * __GFP_NOFAIL allocation cannot be charged).
  */
-static int __mem_cgroup_try_charge(struct mm_struct *mm,
-				   gfp_t gfp_mask,
+static int mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
 				   unsigned int nr_pages,
-				   struct mem_cgroup **ptr,
+				   struct mem_cgroup *memcg,
 				   bool oom)
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct mem_cgroup *memcg = NULL;
 	int ret;
 
-	/*
-	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
-	 * in system level. So, allow to go ahead dying process in addition to
-	 * MEMDIE process.
-	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)
-		     || fatal_signal_pending(current)))
+	if (mem_cgroup_is_root(memcg) || current_bypass_charge())
 		goto bypass;
 
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
 
+	if (consume_stock(memcg, nr_pages))
+		return 0;
+
 	if (gfp_mask & __GFP_NOFAIL)
 		oom = false;
 
-	/*
-	 * We always charge the cgroup the mm_struct belongs to.
-	 * The mm_struct's mem_cgroup changes on task migration if the
-	 * thread group leader migrates. It's possible that mm is not
-	 * set, if so charge the root memcg (happens for pagecache usage).
-	 */
-	if (!*ptr && !mm)
-		*ptr = root_mem_cgroup;
-again:
-	if (*ptr) { /* css should be a valid one */
-		memcg = *ptr;
-		if (mem_cgroup_is_root(memcg))
-			goto done;
-		if (consume_stock(memcg, nr_pages))
-			goto done;
-		css_get(&memcg->css);
-	} else {
-		memcg = try_get_mem_cgroup_from_mm(mm);
-		if (!memcg) {
-			memcg = root_mem_cgroup;
-			goto done;
-		}
-		if (mem_cgroup_is_root(memcg))
-			goto done_put;
-		if (consume_stock(memcg, nr_pages))
-			goto done_put;
-	}
-
 	do {
 		bool invoke_oom = oom && !nr_oom_retries;
 
 		/* If killed, bypass charge */
-		if (fatal_signal_pending(current)) {
-			css_put(&memcg->css);
+		if (fatal_signal_pending(current))
 			goto bypass;
-		}
 
 		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
 					   nr_pages, invoke_oom);
 		switch (ret) {
-		case CHARGE_OK:
-			break;
 		case CHARGE_RETRY: /* not in OOM situation but retry */
 			batch = nr_pages;
-			css_put(&memcg->css);
-			memcg = NULL;
-			goto again;
+			break;
 		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
-			css_put(&memcg->css);
 			goto nomem;
 		case CHARGE_NOMEM: /* OOM routine works */
-			if (!oom || invoke_oom) {
-				css_put(&memcg->css);
+			if (!oom || invoke_oom)
 				goto nomem;
-			}
 			nr_oom_retries--;
 			break;
 		}
@@ -2800,22 +2761,51 @@ again:
 
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
-done_put:
-	css_put(&memcg->css);
-done:
-	*ptr = memcg;
+
 	return 0;
 nomem:
-	if (!(gfp_mask & __GFP_NOFAIL)) {
-		*ptr = NULL;
+	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
-	}
 bypass:
-	*ptr = root_mem_cgroup;
 	return -EINTR;
 }
 
 /*
+ * Charges and returns memcg associated with the given mm (or root_mem_cgroup
+ * if mm is NULL). Returns NULL if memcg is under OOM.
+ */
+static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
+				   gfp_t gfp_mask,
+				   unsigned int nr_pages,
+				   bool oom)
+{
+	struct mem_cgroup *memcg;
+	int ret;
+
+	/*
+	 * We always charge the cgroup the mm_struct belongs to.
+	 * The mm_struct's mem_cgroup changes on task migration if the
+	 * thread group leader migrates. It's possible that mm is not
+	 * set, if so charge the root memcg (happens for pagecache usage).
+	 */
+	if (!mm)
+		goto bypass;
+	memcg = try_get_mem_cgroup_from_mm(mm);
+	if (!memcg)
+		goto bypass;
+
+	ret = mem_cgroup_try_charge_memcg(gfp_mask, nr_pages, memcg, oom);
+	css_put(&memcg->css);
+	if (ret == -EINTR)
+		goto bypass;
+	else if (ret == -ENOMEM)
+		memcg = NULL;
+	return memcg;
+bypass:
+	return root_mem_cgroup;
+}
+
+/*
  * Somemtimes we have to undo a charge we got by try_charge().
  * This function is for that and do uncharge, put css's refcnt.
  * gotten by try_charge().
@@ -3010,21 +3000,19 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 {
 	struct res_counter *fail_res;
-	struct mem_cgroup *_memcg;
 	int ret = 0;
 
 	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
 	if (ret)
 		return ret;
 
-	_memcg = memcg;
-	ret = __mem_cgroup_try_charge(NULL, gfp, size >> PAGE_SHIFT,
-				      &_memcg, oom_gfp_allowed(gfp));
+	ret = mem_cgroup_try_charge_memcg(gfp, size >> PAGE_SHIFT,
+				      memcg, oom_gfp_allowed(gfp));
 
 	if (ret == -EINTR)  {
 		/*
-		 * __mem_cgroup_try_charge() chosed to bypass to root due to
-		 * OOM kill or fatal signal.  Since our only options are to
+		 * mem_cgroup_try_charge_memcg() chosed to bypass to root due
+		 * to OOM kill or fatal signal.  Since our only options are to
 		 * either fail the allocation or charge it to this cgroup, do
 		 * it as a temporary condition. But we can't fail. From a
 		 * kmem/slab perspective, the cache has already been selected,
@@ -3033,7 +3021,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 		 *
 		 * This condition will only trigger if the task entered
 		 * memcg_charge_kmem in a sane state, but was OOM-killed during
-		 * __mem_cgroup_try_charge() above. Tasks that were already
+		 * mem_cgroup_try_charge_memcg() above. Tasks that were already
 		 * dying when the allocation triggers should have been already
 		 * directed to the root cgroup in memcontrol.h
 		 */
@@ -3906,10 +3894,9 @@ out:
 static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask, enum charge_type ctype)
 {
-	struct mem_cgroup *memcg = NULL;
+	struct mem_cgroup *memcg;
 	unsigned int nr_pages = 1;
 	bool oom = true;
-	int ret;
 
 	if (PageTransHuge(page)) {
 		nr_pages <<= compound_order(page);
@@ -3921,9 +3908,9 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 		oom = false;
 	}
 
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, nr_pages, &memcg, oom);
-	if (ret == -ENOMEM)
-		return ret;
+	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages, oom);
+	if (!memcg)
+		return -ENOMEM;
 	__mem_cgroup_commit_charge(memcg, page, nr_pages, ctype, false);
 	return 0;
 }
@@ -3947,8 +3934,7 @@ int mem_cgroup_newpage_charge(struct page *page,
  * "commit()" or removed by "cancel()"
  */
 static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-					  struct page *page,
-					  gfp_t mask,
+					  struct page *page, gfp_t mask,
 					  struct mem_cgroup **memcgp)
 {
 	struct mem_cgroup *memcg;
@@ -3962,31 +3948,36 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	 * already charged pages, too.  The USED bit is protected by
 	 * the page lock, which serializes swap cache removal, which
 	 * in turn serializes uncharging.
+	 * Have to set memcg to NULL so that __mem_cgroup_commit_charge_swapin
+	 * will ignore such a page.
 	 */
-	if (PageCgroupUsed(pc))
+	if (PageCgroupUsed(pc)) {
+		*memcgp = NULL;
 		return 0;
+	}
 	if (!do_swap_account)
 		goto charge_cur_mm;
 	memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
 		goto charge_cur_mm;
 	*memcgp = memcg;
-	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
+	ret = mem_cgroup_try_charge_memcg(mask, 1, memcg, true);
 	css_put(&memcg->css);
-	if (ret == -EINTR)
+	if (ret == -EINTR) {
+		*memcgp = root_mem_cgroup;
 		ret = 0;
+	}
 	return ret;
 charge_cur_mm:
-	ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true);
-	if (ret == -EINTR)
-		ret = 0;
-	return ret;
+	*memcgp = mem_cgroup_try_charge_mm(mm, mask, 1, true);
+	if (!*memcgp)
+		return -ENOMEM;
+	return 0;
 }
 
 int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
 				 gfp_t gfp_mask, struct mem_cgroup **memcgp)
 {
-	*memcgp = NULL;
 	if (mem_cgroup_disabled())
 		return 0;
 	/*
@@ -3996,13 +3987,14 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
 	 * there's also a KSM case which does need to charge the page.
 	 */
 	if (!PageSwapCache(page)) {
-		int ret;
+		int ret = 0;
 
-		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true);
-		if (ret == -EINTR)
-			ret = 0;
+		*memcgp = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
+		if (!*memcgp)
+			ret = -ENOMEM;
 		return ret;
 	}
+
 	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
 }
 
@@ -4048,7 +4040,6 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
-	struct mem_cgroup *memcg = NULL;
 	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	int ret;
 
@@ -4060,6 +4051,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 	if (!PageSwapCache(page))
 		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
 	else { /* page is swapcache/shmem */
+		struct mem_cgroup *memcg;
 		ret = __mem_cgroup_try_charge_swapin(mm, page,
 						     gfp_mask, &memcg);
 		if (!ret)
@@ -6671,8 +6663,7 @@ one_by_one:
 			batch_count = PRECHARGE_COUNT_AT_ONCE;
 			cond_resched();
 		}
-		ret = __mem_cgroup_try_charge(NULL,
-					GFP_KERNEL, 1, &memcg, false);
+		ret = mem_cgroup_try_charge_memcg(GFP_KERNEL, 1, memcg, false);
 		if (ret)
 			/* mem_cgroup_clear_mc() will do uncharge later */
 			return ret;
-- 
1.9.rc1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 2/6] memcg: cleanup charge routines
  2014-02-04 13:28 ` [PATCH -v2 2/6] memcg: cleanup charge routines Michal Hocko
@ 2014-02-04 16:05   ` Johannes Weiner
  2014-02-04 16:12     ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2014-02-04 16:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue, Feb 04, 2014 at 02:28:56PM +0100, Michal Hocko wrote:
> The current core of memcg charging is wild to say the least.
> __mem_cgroup_try_charge which is in the center tries to be too clever
> and it handles two independent cases
> 	* when the memcg to be charged is known in advance
> 	* when the given mm_struct is charged
> The resulting callchains are quite complex:
> 
> memcg_charge_kmem(mm=NULL, memcg)  mem_cgroup_newpage_charge(mm)
>  |                                | _________________________________________ mem_cgroup_cache_charge(current->mm)
>  |                                |/                                            |
>  | ______________________________ mem_cgroup_charge_common(mm, memcg=NULL)      |
>  |/                                                                             /
>  |                                                                             /
>  | ____________________________ mem_cgroup_try_charge_swapin(mm, memcg=NULL)  /
>  |/                               | _________________________________________/
>  |                                |/
>  |                                |                         /* swap accounting */   /* no swap accounting */
>  | _____________________________  __mem_cgroup_try_charge_swapin(mm=NULL, memcg) || (mm, memcg=NULL)
>  |/
>  | ____________________________ mem_cgroup_do_precharge(mm=NULL, memcg)
>  |/
> __mem_cgroup_try_charge
>   mem_cgroup_do_charge
>     res_counter_charge
>     mem_cgroup_reclaim
>     mem_cgroup_wait_acct_move
>     mem_cgroup_oom
> 
> This patch splits __mem_cgroup_try_charge into two logical parts.
> mem_cgroup_try_charge_mm which is responsible for charges for the given
> mm_struct and it returns the charged memcg or NULL under OOM while
> mem_cgroup_try_charge_memcg charges a known memcg and returns an error
> code.
> 
> The only tricky part which remains is __mem_cgroup_try_charge_swapin
> because it can return 0 if PageCgroupUsed is already set and then we do
> not want to commit the charge. This is done with a magic combination of
> memcg = NULL and ret = 0. So the function preserves its memcgp parameter
> and sets the given memcg to NULL when it sees PageCgroupUsed
> (__mem_cgroup_commit_charge_swapin then ignores such a commit).
> 
> Not only the code is easier to follow the change reduces the code size
> too:
> size mm/built-in.o.before mm/built-in.o.after
>    text    data     bss     dec     hex filename
> 464718   83038   49904  597660   91e9c mm/built-in.o.before
> 463894   83038   49904  596836   91b64 mm/built-in.o.after
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c | 205 +++++++++++++++++++++++++++-----------------------------
>  1 file changed, 98 insertions(+), 107 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 042e4ff36c05..72fbe0fb3320 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2618,7 +2618,7 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  }
>  
>  
> -/* See __mem_cgroup_try_charge() for details */
> +/* See mem_cgroup_do_charge() for details */
>  enum {
>  	CHARGE_OK,		/* success */
>  	CHARGE_RETRY,		/* need to retry but retry is not bad */
> @@ -2691,108 +2691,69 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	return CHARGE_NOMEM;
>  }
>  
> +static bool current_bypass_charge(void)
> +{
> +	/*
> +	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> +	 * in system level. So, allow to go ahead dying process in addition to
> +	 * MEMDIE process.
> +	 */
> +	if (unlikely(test_thread_flag(TIF_MEMDIE)
> +		     || fatal_signal_pending(current)))
> +		return true;
> +
> +	return false;
> +}

I'd just leave it inline at this point, it lines up nicely with the
other pre-charge checks in try_charge, which is at this point short
enough to take this awkward 3-liner.

> +static int mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
>  				   unsigned int nr_pages,
> -				   struct mem_cgroup **ptr,
> +				   struct mem_cgroup *memcg,
>  				   bool oom)
>  {
>  	unsigned int batch = max(CHARGE_BATCH, nr_pages);
>  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> -	struct mem_cgroup *memcg = NULL;
>  	int ret;
>  
> -	/*
> -	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> -	 * in system level. So, allow to go ahead dying process in addition to
> -	 * MEMDIE process.
> -	 */
> -	if (unlikely(test_thread_flag(TIF_MEMDIE)
> -		     || fatal_signal_pending(current)))
> +	if (mem_cgroup_is_root(memcg) || current_bypass_charge())
>  		goto bypass;
>  
>  	if (unlikely(task_in_memcg_oom(current)))
>  		goto nomem;
>  
> +	if (consume_stock(memcg, nr_pages))
> +		return 0;
> +
>  	if (gfp_mask & __GFP_NOFAIL)
>  		oom = false;
>  
> -	/*
> -	 * We always charge the cgroup the mm_struct belongs to.
> -	 * The mm_struct's mem_cgroup changes on task migration if the
> -	 * thread group leader migrates. It's possible that mm is not
> -	 * set, if so charge the root memcg (happens for pagecache usage).
> -	 */
> -	if (!*ptr && !mm)
> -		*ptr = root_mem_cgroup;

[...]

>  /*
> + * Charges and returns memcg associated with the given mm (or root_mem_cgroup
> + * if mm is NULL). Returns NULL if memcg is under OOM.
> + */
> +static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
> +				   gfp_t gfp_mask,
> +				   unsigned int nr_pages,
> +				   bool oom)
> +{
> +	struct mem_cgroup *memcg;
> +	int ret;
> +
> +	/*
> +	 * We always charge the cgroup the mm_struct belongs to.
> +	 * The mm_struct's mem_cgroup changes on task migration if the
> +	 * thread group leader migrates. It's possible that mm is not
> +	 * set, if so charge the root memcg (happens for pagecache usage).
> +	 */
> +	if (!mm)
> +		goto bypass;

Why shuffle it around right before you remove it anyway?  Just start
the series off with the patches that delete stuff without having to
restructure anything, get those out of the way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 2/6] memcg: cleanup charge routines
  2014-02-04 16:05   ` Johannes Weiner
@ 2014-02-04 16:12     ` Michal Hocko
  2014-02-04 16:40       ` Johannes Weiner
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 16:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue 04-02-14 11:05:09, Johannes Weiner wrote:
> On Tue, Feb 04, 2014 at 02:28:56PM +0100, Michal Hocko wrote:
[...]
> > +static bool current_bypass_charge(void)
> > +{
> > +	/*
> > +	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> > +	 * in system level. So, allow to go ahead dying process in addition to
> > +	 * MEMDIE process.
> > +	 */
> > +	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > +		     || fatal_signal_pending(current)))
> > +		return true;
> > +
> > +	return false;
> > +}
> 
> I'd just leave it inline at this point, it lines up nicely with the
> other pre-charge checks in try_charge, which is at this point short
> enough to take this awkward 3-liner.

I can keep it inline of course. I thought having it out of line would
make it more obvious what are the bypass conditions. But as there is
still mem_cgroup_is_root then it is probably not the best thing to do.

> > +static int mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
> >  				   unsigned int nr_pages,
> > -				   struct mem_cgroup **ptr,
> > +				   struct mem_cgroup *memcg,
> >  				   bool oom)
> >  {
> >  	unsigned int batch = max(CHARGE_BATCH, nr_pages);
> >  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > -	struct mem_cgroup *memcg = NULL;
> >  	int ret;
> >  
> > -	/*
> > -	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> > -	 * in system level. So, allow to go ahead dying process in addition to
> > -	 * MEMDIE process.
> > -	 */
> > -	if (unlikely(test_thread_flag(TIF_MEMDIE)
> > -		     || fatal_signal_pending(current)))
> > +	if (mem_cgroup_is_root(memcg) || current_bypass_charge())
> >  		goto bypass;
> >  
> >  	if (unlikely(task_in_memcg_oom(current)))
> >  		goto nomem;
> >  
> > +	if (consume_stock(memcg, nr_pages))
> > +		return 0;
> > +
> >  	if (gfp_mask & __GFP_NOFAIL)
> >  		oom = false;
> >  
> > -	/*
> > -	 * We always charge the cgroup the mm_struct belongs to.
> > -	 * The mm_struct's mem_cgroup changes on task migration if the
> > -	 * thread group leader migrates. It's possible that mm is not
> > -	 * set, if so charge the root memcg (happens for pagecache usage).
> > -	 */
> > -	if (!*ptr && !mm)
> > -		*ptr = root_mem_cgroup;
> 
> [...]
> 
> >  /*
> > + * Charges and returns memcg associated with the given mm (or root_mem_cgroup
> > + * if mm is NULL). Returns NULL if memcg is under OOM.
> > + */
> > +static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
> > +				   gfp_t gfp_mask,
> > +				   unsigned int nr_pages,
> > +				   bool oom)
> > +{
> > +	struct mem_cgroup *memcg;
> > +	int ret;
> > +
> > +	/*
> > +	 * We always charge the cgroup the mm_struct belongs to.
> > +	 * The mm_struct's mem_cgroup changes on task migration if the
> > +	 * thread group leader migrates. It's possible that mm is not
> > +	 * set, if so charge the root memcg (happens for pagecache usage).
> > +	 */
> > +	if (!mm)
> > +		goto bypass;
> 
> Why shuffle it around right before you remove it anyway?  Just start
> the series off with the patches that delete stuff without having to
> restructure anything, get those out of the way.

As mentioned in the previous email. I wanted to have this condition
removal bisectable. So it is removed in the next patch when it is
replaced by VM_BUG_ON.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 2/6] memcg: cleanup charge routines
  2014-02-04 16:12     ` Michal Hocko
@ 2014-02-04 16:40       ` Johannes Weiner
  2014-02-04 19:11         ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2014-02-04 16:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue, Feb 04, 2014 at 05:12:30PM +0100, Michal Hocko wrote:
> On Tue 04-02-14 11:05:09, Johannes Weiner wrote:
> > On Tue, Feb 04, 2014 at 02:28:56PM +0100, Michal Hocko wrote:
> > > -	/*
> > > -	 * We always charge the cgroup the mm_struct belongs to.
> > > -	 * The mm_struct's mem_cgroup changes on task migration if the
> > > -	 * thread group leader migrates. It's possible that mm is not
> > > -	 * set, if so charge the root memcg (happens for pagecache usage).
> > > -	 */
> > > -	if (!*ptr && !mm)
> > > -		*ptr = root_mem_cgroup;
> > 
> > [...]
> > 
> > >  /*
> > > + * Charges and returns memcg associated with the given mm (or root_mem_cgroup
> > > + * if mm is NULL). Returns NULL if memcg is under OOM.
> > > + */
> > > +static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
> > > +				   gfp_t gfp_mask,
> > > +				   unsigned int nr_pages,
> > > +				   bool oom)
> > > +{
> > > +	struct mem_cgroup *memcg;
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * We always charge the cgroup the mm_struct belongs to.
> > > +	 * The mm_struct's mem_cgroup changes on task migration if the
> > > +	 * thread group leader migrates. It's possible that mm is not
> > > +	 * set, if so charge the root memcg (happens for pagecache usage).
> > > +	 */
> > > +	if (!mm)
> > > +		goto bypass;
> > 
> > Why shuffle it around right before you remove it anyway?  Just start
> > the series off with the patches that delete stuff without having to
> > restructure anything, get those out of the way.
> 
> As mentioned in the previous email. I wanted to have this condition
> removal bisectable. So it is removed in the next patch when it is
> replaced by VM_BUG_ON.

I'm not suggesting to sneak the removal into *this* patch, just put
the simple stand-alone patches that remove stuff first in the series.

Seems pretty logical to me to first reduce the code base as much as
possible before reorganizing it.  This does not change bisectability
but it sure makes the patches easier to read.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 2/6] memcg: cleanup charge routines
  2014-02-04 16:40       ` Johannes Weiner
@ 2014-02-04 19:11         ` Michal Hocko
  2014-02-04 19:36           ` Johannes Weiner
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 19:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue 04-02-14 11:40:50, Johannes Weiner wrote:
> On Tue, Feb 04, 2014 at 05:12:30PM +0100, Michal Hocko wrote:
> > On Tue 04-02-14 11:05:09, Johannes Weiner wrote:
> > > On Tue, Feb 04, 2014 at 02:28:56PM +0100, Michal Hocko wrote:
> > > > -	/*
> > > > -	 * We always charge the cgroup the mm_struct belongs to.
> > > > -	 * The mm_struct's mem_cgroup changes on task migration if the
> > > > -	 * thread group leader migrates. It's possible that mm is not
> > > > -	 * set, if so charge the root memcg (happens for pagecache usage).
> > > > -	 */
> > > > -	if (!*ptr && !mm)
> > > > -		*ptr = root_mem_cgroup;
> > > 
> > > [...]
> > > 
> > > >  /*
> > > > + * Charges and returns memcg associated with the given mm (or root_mem_cgroup
> > > > + * if mm is NULL). Returns NULL if memcg is under OOM.
> > > > + */
> > > > +static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
> > > > +				   gfp_t gfp_mask,
> > > > +				   unsigned int nr_pages,
> > > > +				   bool oom)
> > > > +{
> > > > +	struct mem_cgroup *memcg;
> > > > +	int ret;
> > > > +
> > > > +	/*
> > > > +	 * We always charge the cgroup the mm_struct belongs to.
> > > > +	 * The mm_struct's mem_cgroup changes on task migration if the
> > > > +	 * thread group leader migrates. It's possible that mm is not
> > > > +	 * set, if so charge the root memcg (happens for pagecache usage).
> > > > +	 */
> > > > +	if (!mm)
> > > > +		goto bypass;
> > > 
> > > Why shuffle it around right before you remove it anyway?  Just start
> > > the series off with the patches that delete stuff without having to
> > > restructure anything, get those out of the way.
> > 
> > As mentioned in the previous email. I wanted to have this condition
> > removal bisectable. So it is removed in the next patch when it is
> > replaced by VM_BUG_ON.
> 
> I'm not suggesting to sneak the removal into *this* patch,

OK

> just put the simple stand-alone patches that remove stuff first in the
> series.

In this particular case, though, the reduced condition is much easier
to review IMO. Just look at the jungle of different *ptr vs. mm
combinations described in this patch description which would have to be
reviewed separately if I moved the removal before this patch.
The ptr part of the original condition went away naturally here while
the reasoning why there is no code path implicitly relying on (!ptr &&
!mm) resulting in bypass would be harder.

> Seems pretty logical to me to first reduce the code base as much as
> possible before reorganizing it.  This does not change bisectability
> but it sure makes the patches easier to read.

Agreed.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 2/6] memcg: cleanup charge routines
  2014-02-04 19:11         ` Michal Hocko
@ 2014-02-04 19:36           ` Johannes Weiner
  0 siblings, 0 replies; 27+ messages in thread
From: Johannes Weiner @ 2014-02-04 19:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue, Feb 04, 2014 at 08:11:25PM +0100, Michal Hocko wrote:
> On Tue 04-02-14 11:40:50, Johannes Weiner wrote:
> > On Tue, Feb 04, 2014 at 05:12:30PM +0100, Michal Hocko wrote:
> > > On Tue 04-02-14 11:05:09, Johannes Weiner wrote:
> > > > On Tue, Feb 04, 2014 at 02:28:56PM +0100, Michal Hocko wrote:
> > > > > -	/*
> > > > > -	 * We always charge the cgroup the mm_struct belongs to.
> > > > > -	 * The mm_struct's mem_cgroup changes on task migration if the
> > > > > -	 * thread group leader migrates. It's possible that mm is not
> > > > > -	 * set, if so charge the root memcg (happens for pagecache usage).
> > > > > -	 */
> > > > > -	if (!*ptr && !mm)
> > > > > -		*ptr = root_mem_cgroup;
> > > > 
> > > > [...]
> > > > 
> > > > >  /*
> > > > > + * Charges and returns memcg associated with the given mm (or root_mem_cgroup
> > > > > + * if mm is NULL). Returns NULL if memcg is under OOM.
> > > > > + */
> > > > > +static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
> > > > > +				   gfp_t gfp_mask,
> > > > > +				   unsigned int nr_pages,
> > > > > +				   bool oom)
> > > > > +{
> > > > > +	struct mem_cgroup *memcg;
> > > > > +	int ret;
> > > > > +
> > > > > +	/*
> > > > > +	 * We always charge the cgroup the mm_struct belongs to.
> > > > > +	 * The mm_struct's mem_cgroup changes on task migration if the
> > > > > +	 * thread group leader migrates. It's possible that mm is not
> > > > > +	 * set, if so charge the root memcg (happens for pagecache usage).
> > > > > +	 */
> > > > > +	if (!mm)
> > > > > +		goto bypass;
> > > > 
> > > > Why shuffle it around right before you remove it anyway?  Just start
> > > > the series off with the patches that delete stuff without having to
> > > > restructure anything, get those out of the way.
> > > 
> > > As mentioned in the previous email. I wanted to have this condition
> > > removal bisectable. So it is removed in the next patch when it is
> > > replaced by VM_BUG_ON.
> > 
> > I'm not suggesting to sneak the removal into *this* patch,
> 
> OK
> 
> > just put the simple stand-alone patches that remove stuff first in the
> > series.
> 
> In this particular case, though, the reduced condition is much easier
> to review IMO. Just look at the jungle of different *ptr vs. mm
> combinations described in this patch description which would have to be
> reviewed separately if I moved the removal before this patch.
> The ptr part of the original condition went away naturally here while
> the reasoning why there is no code path implicitly relying on (!ptr &&
> !mm) resulting in bypass would be harder.

You just have to check ptr=NULL callsites...?  "Callsites that don't
provide their own memcg pass a valid mm pointer for lookup."

Anyway, I'm just telling you what threw me off during review.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH -v2 3/6] memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm
  2014-02-04 13:28 [PATCH -v2 0/6] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
  2014-02-04 13:28 ` [PATCH -v2 1/6] memcg: do not replicate try_get_mem_cgroup_from_mm in __mem_cgroup_try_charge Michal Hocko
  2014-02-04 13:28 ` [PATCH -v2 2/6] memcg: cleanup charge routines Michal Hocko
@ 2014-02-04 13:28 ` Michal Hocko
  2014-02-04 16:05   ` Johannes Weiner
  2014-02-04 13:28 ` [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging Michal Hocko
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 13:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

An ancient comment tries to explain that a given mm might be NULL when a
task is migrated. It has been introduced by 8a9f3ccd (Memory controller:
memory accounting) along with other bigger changes so it is not much
more specific about the conditions.

Anyway, Even if the task is migrated to another memcg there is no way we
can see NULL mm struct. So either this was not correct from the very
beginning or it is not true anymore.
The only remaining case would be seeing charges after exit_mm but that
would be a bug on its own as the task doesn't have an address space
anymore.

This patch replaces the check by VM_BUG_ON to make it obvious that we
really expect non-NULL mm_struct.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 72fbe0fb3320..2fcdee529ad3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2782,14 +2782,7 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 	struct mem_cgroup *memcg;
 	int ret;
 
-	/*
-	 * We always charge the cgroup the mm_struct belongs to.
-	 * The mm_struct's mem_cgroup changes on task migration if the
-	 * thread group leader migrates. It's possible that mm is not
-	 * set, if so charge the root memcg (happens for pagecache usage).
-	 */
-	if (!mm)
-		goto bypass;
+	VM_BUG_ON(!mm);
 	memcg = try_get_mem_cgroup_from_mm(mm);
 	if (!memcg)
 		goto bypass;
-- 
1.9.rc1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 3/6] memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm
  2014-02-04 13:28 ` [PATCH -v2 3/6] memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm Michal Hocko
@ 2014-02-04 16:05   ` Johannes Weiner
  0 siblings, 0 replies; 27+ messages in thread
From: Johannes Weiner @ 2014-02-04 16:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue, Feb 04, 2014 at 02:28:57PM +0100, Michal Hocko wrote:
> An ancient comment tries to explain that a given mm might be NULL when a
> task is migrated. It has been introduced by 8a9f3ccd (Memory controller:
> memory accounting) along with other bigger changes so it is not much
> more specific about the conditions.
> 
> Anyway, Even if the task is migrated to another memcg there is no way we
> can see NULL mm struct. So either this was not correct from the very
> beginning or it is not true anymore.
> The only remaining case would be seeing charges after exit_mm but that
> would be a bug on its own as the task doesn't have an address space
> anymore.
> 
> This patch replaces the check by VM_BUG_ON to make it obvious that we
> really expect non-NULL mm_struct.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging
  2014-02-04 13:28 [PATCH -v2 0/6] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
                   ` (2 preceding siblings ...)
  2014-02-04 13:28 ` [PATCH -v2 3/6] memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm Michal Hocko
@ 2014-02-04 13:28 ` Michal Hocko
  2014-02-04 16:29   ` Johannes Weiner
  2014-02-04 13:28 ` [PATCH -v2 5/6] memcg, kmem: clean up memcg parameter handling Michal Hocko
  2014-02-04 13:29 ` [PATCH -v2 6/6] Revert "mm: memcg: fix race condition between memcg teardown and swapin" Michal Hocko
  5 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 13:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

The current charge path might race with memcg offlining because holding
css reference doesn't neither prevent from task move to a different
group nor stop css offline. When a charging task is the last one in the
group and it is moved to a different group in the middle of the charge
the old memcg might get offlined. As a result res counter might be
charged after mem_cgroup_reparent_charges (called from memcg css_offline
callback) and so the charge would never be freed. This has been worked
around by 96f1c58d8534 (mm: memcg: fix race condition between memcg
teardown and swapin) which tries to catch such a leaked charges later
during css_free. It is more optimal to heal this race in the long term
though.

In order to make this raceless we have to check that the memcg is online
and res_counter_charge in the same RCU read section. The online check can
be done simply by calling css_tryget & css_put which are now wrapped
into mem_cgroup_is_online helper.

Callers are then updated to retry with a new memcg which is associated
with the current mm. There always has to be a valid memcg encountered
sooner or later because task had to be moved to a valid and online
cgroup.

The only exception is mem_cgroup_do_precharge which should never see
this race because it is called from cgroup {can_}attach callbacks and so
the whole cgroup cannot go away.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2fcdee529ad3..d06743a9a765 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2624,8 +2624,31 @@ enum {
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
 	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
 	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
+	CHARGE_OFFLINE,		/* memcg is offline already so no further
+				 * charges are allowed
+				 */
 };
 
+/*
+ * Checks whether given memcg is still online (css_offline hasn't
+ * been called yet).
+ *
+ * Caller has to hold rcu read lock.
+ */
+static bool mem_cgroup_is_online(struct mem_cgroup *memcg)
+{
+	bool online;
+
+	rcu_read_lock_held();
+
+	online = css_tryget(&memcg->css);
+	if (online)
+		css_put(&memcg->css);
+
+	return online;
+
+}
+
 static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				unsigned int nr_pages, unsigned int min_pages,
 				bool invoke_oom)
@@ -2636,20 +2659,37 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long flags = 0;
 	int ret;
 
+	/*
+	 * Although the holder keeps a css reference for memcg this doesn't
+	 * prevent it from being offlined in the meantime. We have to make sure
+	 * that res counter is charged before css_offline reparents its pages
+	 * otherwise the charge might leak. Therefore both css_tryget has to
+	 * happen in the same rcu read section as res_counter charge.
+	 */
+	rcu_read_lock();
+	if (unlikely(!mem_cgroup_is_online(memcg))) {
+		rcu_read_unlock();
+		return CHARGE_OFFLINE;
+	}
 	ret = res_counter_charge(&memcg->res, csize, &fail_res);
-
 	if (likely(!ret)) {
-		if (!do_swap_account)
+		if (!do_swap_account) {
+			rcu_read_unlock();
 			return CHARGE_OK;
+		}
 		ret = res_counter_charge(&memcg->memsw, csize, &fail_res);
-		if (likely(!ret))
+		if (likely(!ret)) {
+			rcu_read_unlock();
 			return CHARGE_OK;
+		}
 
 		res_counter_uncharge(&memcg->res, csize);
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
 		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+	rcu_read_unlock();
+
 	/*
 	 * Never reclaim on behalf of optional batching, retry with a
 	 * single page instead.
@@ -2756,6 +2796,12 @@ static int mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
 				goto nomem;
 			nr_oom_retries--;
 			break;
+		/*
+		 * memcg went offline, the caller should fallback to
+		 * a different group.
+		 */
+		case CHARGE_OFFLINE:
+			return -EAGAIN;
 		}
 	} while (ret != CHARGE_OK);
 
@@ -2783,6 +2829,7 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 	int ret;
 
 	VM_BUG_ON(!mm);
+again:
 	memcg = try_get_mem_cgroup_from_mm(mm);
 	if (!memcg)
 		goto bypass;
@@ -2793,6 +2840,8 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 		goto bypass;
 	else if (ret == -ENOMEM)
 		memcg = NULL;
+	else if (ret == -EAGAIN)
+		goto again;
 	return memcg;
 bypass:
 	return root_mem_cgroup;
@@ -3617,6 +3666,7 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
 	if (!current->mm || current->memcg_kmem_skip_account)
 		return true;
 
+again:
 	memcg = try_get_mem_cgroup_from_mm(current->mm);
 
 	/*
@@ -3633,10 +3683,12 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
 	}
 
 	ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
+	css_put(&memcg->css);
 	if (!ret)
 		*_memcg = memcg;
+	else if (ret == -EAGAIN)
+		goto again;
 
-	css_put(&memcg->css);
 	return (ret == 0);
 }
 
@@ -3959,6 +4011,8 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	if (ret == -EINTR) {
 		*memcgp = root_mem_cgroup;
 		ret = 0;
+	} else if (ret == -EAGAIN) {
+		goto charge_cur_mm;
 	}
 	return ret;
 charge_cur_mm:
@@ -6657,6 +6711,14 @@ one_by_one:
 			cond_resched();
 		}
 		ret = mem_cgroup_try_charge_memcg(GFP_KERNEL, 1, memcg, false);
+
+		/*
+		 * The target memcg cannot go offline because we are in
+		 * move path and cgroup core doesn't allow to offline
+		 * such groups.
+		 */
+		BUG_ON(ret == -EAGAIN);
+
 		if (ret)
 			/* mem_cgroup_clear_mc() will do uncharge later */
 			return ret;
-- 
1.9.rc1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging
  2014-02-04 13:28 ` [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging Michal Hocko
@ 2014-02-04 16:29   ` Johannes Weiner
  2014-02-05 13:38     ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2014-02-04 16:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue, Feb 04, 2014 at 02:28:58PM +0100, Michal Hocko wrote:
> The current charge path might race with memcg offlining because holding
> css reference doesn't neither prevent from task move to a different
> group nor stop css offline. When a charging task is the last one in the
> group and it is moved to a different group in the middle of the charge
> the old memcg might get offlined. As a result res counter might be
> charged after mem_cgroup_reparent_charges (called from memcg css_offline
> callback) and so the charge would never be freed. This has been worked
> around by 96f1c58d8534 (mm: memcg: fix race condition between memcg
> teardown and swapin) which tries to catch such a leaked charges later
> during css_free. It is more optimal to heal this race in the long term
> though.
> 
> In order to make this raceless we have to check that the memcg is online
> and res_counter_charge in the same RCU read section. The online check can
> be done simply by calling css_tryget & css_put which are now wrapped
> into mem_cgroup_is_online helper.
> 
> Callers are then updated to retry with a new memcg which is associated
> with the current mm. There always has to be a valid memcg encountered
> sooner or later because task had to be moved to a valid and online
> cgroup.
> 
> The only exception is mem_cgroup_do_precharge which should never see
> this race because it is called from cgroup {can_}attach callbacks and so
> the whole cgroup cannot go away.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 66 insertions(+), 4 deletions(-)

Maybe we should remove the XXX if it makes you think we should change
the current situation by any means necessary.  This patch is not an
improvement.

I put the XXX there so that we one day maybe refactor the code in a
clean fashion where try_get_mem_cgroup_from_whatever() is in the same
rcu section as the first charge attempt.  On failure, reclaim, and do
the lookup again.

Also, this problem only exists on swapin, where the memcg is looked up
from an auxilliary data structure and not the current task, so maybe
that would be an angle to look for a clean solution.

Either way, the problem is currently fixed with a *oneliner*.  Unless
the alternative solution is inherent in a clean rework of the code to
match cgroup core lifetime management, I don't see any reason to move
away from the status quo.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging
  2014-02-04 16:29   ` Johannes Weiner
@ 2014-02-05 13:38     ` Michal Hocko
  2014-02-05 15:28       ` Johannes Weiner
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2014-02-05 13:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue 04-02-14 11:29:39, Johannes Weiner wrote:
[...]
> Maybe we should remove the XXX if it makes you think we should change
> the current situation by any means necessary.  This patch is not an
> improvement.
>
> I put the XXX there so that we one day maybe refactor the code in a
> clean fashion where try_get_mem_cgroup_from_whatever() is in the same
> rcu section as the first charge attempt.  On failure, reclaim, and do
> the lookup again.

I wouldn't be opposed to such a cleanup. It is not that simple, though.

> Also, this problem only exists on swapin, where the memcg is looked up
> from an auxilliary data structure and not the current task, so maybe
> that would be an angle to look for a clean solution.

I am not so sure about that. Task could have been moved to a different
group basically anytime it was outside of rcu_read_lock section (which
means most of the time). And so the group might get removed and we are
in the very same situation.

> Either way, the problem is currently fixed 

OK, my understanding (and my ack was based on that) was that we needed
a simple and safe fix for the stable trees and we would have something
more appropriate later on. Preventing from the race sounds like a more
appropriate and a better technical solution to me. So I would rather ask
why to keep a workaround in place. Does it add any risk?
Especially when we basically abuse the 2 stage cgroup removal. All the
charges should be cleared out after css_offline.

> with a *oneliner*.

That is really not importat becaust _that_ oneliner abuses the function
which should be in fact called from a different context.

> Unless the alternative solution is inherent in a clean rework of the
> code to match cgroup core lifetime management, I don't see any reason
> to move away from the status quo.

To be honest this sounds like a weak reasoning to refuse a real fix
which replaces a workaround.

This is a second attempt to fix the actual race that you are dismissing
which is really surprising to me. Especially when the workaround is an
ugly hack.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging
  2014-02-05 13:38     ` Michal Hocko
@ 2014-02-05 15:28       ` Johannes Weiner
  2014-02-05 15:42         ` Tejun Heo
  2014-02-05 16:19         ` Michal Hocko
  0 siblings, 2 replies; 27+ messages in thread
From: Johannes Weiner @ 2014-02-05 15:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Tejun Heo,
	LKML, linux-mm

On Wed, Feb 05, 2014 at 02:38:34PM +0100, Michal Hocko wrote:
> On Tue 04-02-14 11:29:39, Johannes Weiner wrote:
> [...]
> > Maybe we should remove the XXX if it makes you think we should change
> > the current situation by any means necessary.  This patch is not an
> > improvement.
> >
> > I put the XXX there so that we one day maybe refactor the code in a
> > clean fashion where try_get_mem_cgroup_from_whatever() is in the same
> > rcu section as the first charge attempt.  On failure, reclaim, and do
> > the lookup again.
> 
> I wouldn't be opposed to such a cleanup. It is not that simple, though.
>
> > Also, this problem only exists on swapin, where the memcg is looked up
> > from an auxilliary data structure and not the current task, so maybe
> > that would be an angle to look for a clean solution.
> 
> I am not so sure about that. Task could have been moved to a different
> group basically anytime it was outside of rcu_read_lock section (which
> means most of the time). And so the group might get removed and we are
> in the very same situation.
> 
> > Either way, the problem is currently fixed 
> 
> OK, my understanding (and my ack was based on that) was that we needed
> a simple and safe fix for the stable trees and we would have something
> more appropriate later on. Preventing from the race sounds like a more
> appropriate and a better technical solution to me. So I would rather ask
> why to keep a workaround in place. Does it add any risk?
> Especially when we basically abuse the 2 stage cgroup removal. All the
> charges should be cleared out after css_offline.

I thought more about this and talked to Tejun as well.  He told me
that the rcu grace period between disabling tryget and calling
css_offline() is currently an implementation detail of the refcounter
that css uses, but it's not a guarantee.  So my initial idea of
reworking memcg to do css_tryget() and res_counter_charge() in the
same rcu section is no longer enough to synchronize against offlining.
We can forget about that.

On the other hand, memcg holds a css reference only while an actual
controller reference is being established (res_counter_charge), then
drops it.  This means that once css_tryget() is disabled, we only need
to wait for the css refcounter to hit 0 to know for sure that no new
charges can show up and reparent_charges() is safe to run, right?

Well, css_free() is the callback invoked when the ref counter hits 0,
and that is a guarantee.  From a memcg perspective, it's the right
place to do reparenting, not css_offline().

Here is the only exception to the above: swapout records maintain
permanent css references, so they prevent css_free() from running.
For that reason alone we should run one optimistic reparenting in
css_offline() to make sure one swap record does not pin gigabytes of
pages in an offlined cgroup, which is unreachable for reclaim.  But
the reparenting for *correctness* is in css_free(), not css_offline().

We should be changing the comments.  The code is already correct.

> > Unless the alternative solution is inherent in a clean rework of the
> > code to match cgroup core lifetime management, I don't see any reason
> > to move away from the status quo.
> 
> To be honest this sounds like a weak reasoning to refuse a real fix
> which replaces a workaround.
> 
> This is a second attempt to fix the actual race that you are dismissing
> which is really surprising to me. Especially when the workaround is an
> ugly hack.

IMO it was always functionally correct, just something that could have
been done cleaner from a design POV.  That's why I refused every
alternative solution that made the code worse instead of better.

But looks like it also makes perfect sense from a design POV, so
it's all moot now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging
  2014-02-05 15:28       ` Johannes Weiner
@ 2014-02-05 15:42         ` Tejun Heo
  2014-02-05 16:19         ` Michal Hocko
  1 sibling, 0 replies; 27+ messages in thread
From: Tejun Heo @ 2014-02-05 15:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	LKML, linux-mm

Hello, guys.

On Wed, Feb 05, 2014 at 10:28:21AM -0500, Johannes Weiner wrote:
> I thought more about this and talked to Tejun as well.  He told me
> that the rcu grace period between disabling tryget and calling
> css_offline() is currently an implementation detail of the refcounter
> that css uses, but it's not a guarantee.  So my initial idea of

Yeah, that's an implementation detail coming from how percpu_ref is
implemented at the moment.  Also, it's a sched_rcu grace period, not a
normal one.  The only RCU-related guarnatee that cgroup core gives is
that there will be a full RCU grace period between css's ref reaching
zero and invocation of ->css_free() so that it's safe to do
css_tryget() inside RCU critical sections.

In short, offlining is *not* protected by RCU.  Freeing is.

> Well, css_free() is the callback invoked when the ref counter hits 0,
> and that is a guarantee.  From a memcg perspective, it's the right
> place to do reparenting, not css_offline().

So, css_offline() is cgroup telling controllers two things.

* The destruction of the css, which will commence when css ref reaches
  zero, has initiated.  If you're holding any long term css refs for
  caching and stuff, put them so that destruction can actually happen.

* Any css_tryget() attempts which haven't finished yet are guaranteed
  to fail.  (there's no implied RCU protection here)

Maybe offline is a bit of misnomer.  It's really just telling the
controllers to get prepared to be destroyed.

> Here is the only exception to the above: swapout records maintain
> permanent css references, so they prevent css_free() from running.
> For that reason alone we should run one optimistic reparenting in
> css_offline() to make sure one swap record does not pin gigabytes of
> pages in an offlined cgroup, which is unreachable for reclaim.  But
> the reparenting for *correctness* is in css_free(), not css_offline().

A more canonical use case can be found in blkcg.  blkcg holds "cache"
css refs for optimization in the indexing data structure.  On offline,
blkcg purges those refs so that those stale cache refs don't put off
actual destruction for too long.  But yeah the above sounds like a
plausible use case too.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging
  2014-02-05 15:28       ` Johannes Weiner
  2014-02-05 15:42         ` Tejun Heo
@ 2014-02-05 16:19         ` Michal Hocko
  2014-02-05 16:29           ` Michal Hocko
                             ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: Michal Hocko @ 2014-02-05 16:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Tejun Heo,
	LKML, linux-mm

On Wed 05-02-14 10:28:21, Johannes Weiner wrote:
> On Wed, Feb 05, 2014 at 02:38:34PM +0100, Michal Hocko wrote:
> > On Tue 04-02-14 11:29:39, Johannes Weiner wrote:
> > [...]
> > > Maybe we should remove the XXX if it makes you think we should change
> > > the current situation by any means necessary.  This patch is not an
> > > improvement.
> > >
> > > I put the XXX there so that we one day maybe refactor the code in a
> > > clean fashion where try_get_mem_cgroup_from_whatever() is in the same
> > > rcu section as the first charge attempt.  On failure, reclaim, and do
> > > the lookup again.
> > 
> > I wouldn't be opposed to such a cleanup. It is not that simple, though.
> >
> > > Also, this problem only exists on swapin, where the memcg is looked up
> > > from an auxilliary data structure and not the current task, so maybe
> > > that would be an angle to look for a clean solution.
> > 
> > I am not so sure about that. Task could have been moved to a different
> > group basically anytime it was outside of rcu_read_lock section (which
> > means most of the time). And so the group might get removed and we are
> > in the very same situation.
> > 
> > > Either way, the problem is currently fixed 
> > 
> > OK, my understanding (and my ack was based on that) was that we needed
> > a simple and safe fix for the stable trees and we would have something
> > more appropriate later on. Preventing from the race sounds like a more
> > appropriate and a better technical solution to me. So I would rather ask
> > why to keep a workaround in place. Does it add any risk?
> > Especially when we basically abuse the 2 stage cgroup removal. All the
> > charges should be cleared out after css_offline.
> 
> I thought more about this and talked to Tejun as well.  He told me
> that the rcu grace period between disabling tryget and calling
> css_offline() is currently an implementation detail of the refcounter
> that css uses, but it's not a guarantee.  So my initial idea of
> reworking memcg to do css_tryget() and res_counter_charge() in the
> same rcu section is no longer enough to synchronize against offlining.
> We can forget about that.
> 
> On the other hand, memcg holds a css reference only while an actual
> controller reference is being established (res_counter_charge), then
> drops it.  This means that once css_tryget() is disabled, we only need
> to wait for the css refcounter to hit 0 to know for sure that no new
> charges can show up and reparent_charges() is safe to run, right?
> 
> Well, css_free() is the callback invoked when the ref counter hits 0,
> and that is a guarantee.  From a memcg perspective, it's the right
> place to do reparenting, not css_offline().

OK, it seems I've totally misunderstood what is the purpose of
css_offline. My understanding was that any attempt to css_tryget will
fail when css_offline starts. I will read through Tejun's email as well
and think about it some more.

> Here is the only exception to the above: swapout records maintain
> permanent css references, so they prevent css_free() from running.
> For that reason alone we should run one optimistic reparenting in
> css_offline() to make sure one swap record does not pin gigabytes of
> pages in an offlined cgroup, which is unreachable for reclaim.  But

How can reparenting help for swapped out pages? Or did you mean to at
least get rid of swapcache pages?

> the reparenting for *correctness* is in css_free(), not css_offline().

With the provided explanation of css_offline yes.

> We should be changing the comments.  The code is already correct.
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging
  2014-02-05 16:19         ` Michal Hocko
@ 2014-02-05 16:29           ` Michal Hocko
  2014-02-05 16:30           ` Tejun Heo
  2014-02-05 16:45           ` Johannes Weiner
  2 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2014-02-05 16:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Tejun Heo,
	LKML, linux-mm

On Wed 05-02-14 17:19:40, Michal Hocko wrote:
> On Wed 05-02-14 10:28:21, Johannes Weiner wrote:
[...]
> > I thought more about this and talked to Tejun as well.  He told me
> > that the rcu grace period between disabling tryget and calling
> > css_offline() is currently an implementation detail of the refcounter
> > that css uses, but it's not a guarantee.  So my initial idea of
> > reworking memcg to do css_tryget() and res_counter_charge() in the
> > same rcu section is no longer enough to synchronize against offlining.
> > We can forget about that.
> > 
> > On the other hand, memcg holds a css reference only while an actual
> > controller reference is being established (res_counter_charge), then
> > drops it.  This means that once css_tryget() is disabled, we only need
> > to wait for the css refcounter to hit 0 to know for sure that no new
> > charges can show up and reparent_charges() is safe to run, right?
> > 
> > Well, css_free() is the callback invoked when the ref counter hits 0,
> > and that is a guarantee.  From a memcg perspective, it's the right
> > place to do reparenting, not css_offline().
> 
> OK, it seems I've totally misunderstood what is the purpose of
> css_offline. My understanding was that any attempt to css_tryget will
> fail when css_offline starts. I will read through Tejun's email as well
> and think about it some more.

OK, so css_tryget fails at the time of css_offline but there is no rcu
guarantee which we rely on. This means that css_offline is of very
limitted use for us. Pages which are swapped out are not reachable for
reparent and so we still might have a lot of references to css. Whether
it makes much sense to call reparent only for the swapcache is
questionable. We are still relying on some task to release that memory
while it lives in other memcg.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging
  2014-02-05 16:19         ` Michal Hocko
  2014-02-05 16:29           ` Michal Hocko
@ 2014-02-05 16:30           ` Tejun Heo
  2014-02-05 16:45           ` Johannes Weiner
  2 siblings, 0 replies; 27+ messages in thread
From: Tejun Heo @ 2014-02-05 16:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, LKML, linux-mm

Hello, Michal.

On Wed, Feb 05, 2014 at 05:19:40PM +0100, Michal Hocko wrote:
> > Well, css_free() is the callback invoked when the ref counter hits 0,
> > and that is a guarantee.  From a memcg perspective, it's the right
> > place to do reparenting, not css_offline().
> 
> OK, it seems I've totally misunderstood what is the purpose of
> css_offline. My understanding was that any attempt to css_tryget will

Heh, the semantics have changed significantly during the past year.
It started as something pretty unusual (synchronous ref draining on
rmdir) and took some iterations to reach the current design and we
still don't have any proper documentation, so misunderstanding
probably is inevitable, sorry.  :)

> fail when css_offline starts. I will read through Tejun's email as well
> and think about it some more.

Yes, css_tryget() is guaranteed to fail once css_offline() starts.
This is to help ref draining so that controllers have a scalable way
to reliably decide when to say no to new usages.  Please note that
css_get() is still allowed even after css_offline() (of course as long
as the caller already has a ref).

Thanks!

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging
  2014-02-05 16:19         ` Michal Hocko
  2014-02-05 16:29           ` Michal Hocko
  2014-02-05 16:30           ` Tejun Heo
@ 2014-02-05 16:45           ` Johannes Weiner
  2014-02-05 17:23             ` Michal Hocko
  2 siblings, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2014-02-05 16:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Tejun Heo,
	LKML, linux-mm

On Wed, Feb 05, 2014 at 05:19:40PM +0100, Michal Hocko wrote:
> On Wed 05-02-14 10:28:21, Johannes Weiner wrote:
> > Here is the only exception to the above: swapout records maintain
> > permanent css references, so they prevent css_free() from running.
> > For that reason alone we should run one optimistic reparenting in
> > css_offline() to make sure one swap record does not pin gigabytes of
> > pages in an offlined cgroup, which is unreachable for reclaim.  But
> 
> How can reparenting help for swapped out pages? Or did you mean to at
> least get rid of swapcache pages?

I was thinking primarily of page cache.  There could be a lot of it
left in the group and once css_tryget() is disabled we can't reclaim
it anymore.  So we'd clean that out at offline time optimistically and
at css_free() we catch any charges raced that showed up afterwards.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging
  2014-02-05 16:45           ` Johannes Weiner
@ 2014-02-05 17:23             ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2014-02-05 17:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Tejun Heo,
	LKML, linux-mm

On Wed 05-02-14 11:45:43, Johannes Weiner wrote:
> On Wed, Feb 05, 2014 at 05:19:40PM +0100, Michal Hocko wrote:
> > On Wed 05-02-14 10:28:21, Johannes Weiner wrote:
> > > Here is the only exception to the above: swapout records maintain
> > > permanent css references, so they prevent css_free() from running.
> > > For that reason alone we should run one optimistic reparenting in
> > > css_offline() to make sure one swap record does not pin gigabytes of
> > > pages in an offlined cgroup, which is unreachable for reclaim.  But
> > 
> > How can reparenting help for swapped out pages? Or did you mean to at
> > least get rid of swapcache pages?
> 
> I was thinking primarily of page cache.  There could be a lot of it
> left in the group and once css_tryget() is disabled we can't reclaim
> it anymore.

Good point.

> So we'd clean that out at offline time optimistically and
> at css_free() we catch any charges raced that showed up afterwards.

OK, care to send a patch which would clarify the reparenting usage in
both css_offline and css_free?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH -v2 5/6] memcg, kmem: clean up memcg parameter handling
  2014-02-04 13:28 [PATCH -v2 0/6] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
                   ` (3 preceding siblings ...)
  2014-02-04 13:28 ` [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging Michal Hocko
@ 2014-02-04 13:28 ` Michal Hocko
  2014-02-04 16:32   ` Johannes Weiner
  2014-02-04 13:29 ` [PATCH -v2 6/6] Revert "mm: memcg: fix race condition between memcg teardown and swapin" Michal Hocko
  5 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 13:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

memcg_kmem_newpage_charge doesn't always set the given memcg parameter.
Some early escape paths skip setting *memcg while
__memcg_kmem_newpage_charge down the call chain sets *memcg even if no
memcg is charged due to other escape paths.

The current code is correct because the memcg is initialized to NULL
at the highest level in __alloc_pages_nodemask but this all is very
confusing and error prone. Let's make the semantic clear and move the
memcg parameter initialization to the highest level of kmem accounting
(memcg_kmem_newpage_charge).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h | 4 +++-
 mm/memcontrol.c            | 2 --
 mm/page_alloc.c            | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index abd0113b6620..7bcb39668917 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -522,11 +522,13 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
  * allocation.
  *
  * We return true automatically if this allocation is not to be accounted to
- * any memcg.
+ * any memcg when *memcg is set to NULL.
  */
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
+	*memcg = NULL;
+
 	if (!memcg_kmem_enabled())
 		return true;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d06743a9a765..46b9f461cedf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3637,8 +3637,6 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
 	struct mem_cgroup *memcg;
 	int ret;
 
-	*_memcg = NULL;
-
 	/*
 	 * Disabling accounting is only relevant for some specific memcg
 	 * internal allocations. Therefore we would initially not have such
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3758a09a009..6f6099d38772 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2692,7 +2692,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
-	struct mem_cgroup *memcg = NULL;
+	struct mem_cgroup *memcg;
 
 	gfp_mask &= gfp_allowed_mask;
 
-- 
1.9.rc1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 5/6] memcg, kmem: clean up memcg parameter handling
  2014-02-04 13:28 ` [PATCH -v2 5/6] memcg, kmem: clean up memcg parameter handling Michal Hocko
@ 2014-02-04 16:32   ` Johannes Weiner
  2014-02-04 16:42     ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2014-02-04 16:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue, Feb 04, 2014 at 02:28:59PM +0100, Michal Hocko wrote:
> memcg_kmem_newpage_charge doesn't always set the given memcg parameter.

lol, I really don't get your patch order...

> Some early escape paths skip setting *memcg while
> __memcg_kmem_newpage_charge down the call chain sets *memcg even if no
> memcg is charged due to other escape paths.
> 
> The current code is correct because the memcg is initialized to NULL
> at the highest level in __alloc_pages_nodemask but this all is very
> confusing and error prone. Let's make the semantic clear and move the
> memcg parameter initialization to the highest level of kmem accounting
> (memcg_kmem_newpage_charge).
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Patch looks good, though.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH -v2 5/6] memcg, kmem: clean up memcg parameter handling
  2014-02-04 16:32   ` Johannes Weiner
@ 2014-02-04 16:42     ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 16:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue 04-02-14 11:32:10, Johannes Weiner wrote:
> On Tue, Feb 04, 2014 at 02:28:59PM +0100, Michal Hocko wrote:
> > memcg_kmem_newpage_charge doesn't always set the given memcg parameter.
> 
> lol, I really don't get your patch order...

Ok, Ok, I've encountered this mess while double checking #2 and was too
lazy to rebasing again. I will move it to the front for the merge.

> > Some early escape paths skip setting *memcg while
> > __memcg_kmem_newpage_charge down the call chain sets *memcg even if no
> > memcg is charged due to other escape paths.
> > 
> > The current code is correct because the memcg is initialized to NULL
> > at the highest level in __alloc_pages_nodemask but this all is very
> > confusing and error prone. Let's make the semantic clear and move the
> > memcg parameter initialization to the highest level of kmem accounting
> > (memcg_kmem_newpage_charge).
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> Patch looks good, though.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH -v2 6/6] Revert "mm: memcg: fix race condition between memcg teardown and swapin"
  2014-02-04 13:28 [PATCH -v2 0/6] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
                   ` (4 preceding siblings ...)
  2014-02-04 13:28 ` [PATCH -v2 5/6] memcg, kmem: clean up memcg parameter handling Michal Hocko
@ 2014-02-04 13:29 ` Michal Hocko
  5 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2014-02-04 13:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

This reverts commit 96f1c58d853497a757463e0b57fed140d6858f3a
because it is no longer needed after "memcg: make sure that memcg is not
offline when charging" which makes sure that no charges will be accepted
after mem_cgroup_reparent_charges has started.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 36 ------------------------------------
 1 file changed, 36 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 46b9f461cedf..6226977d53d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6623,42 +6623,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	/*
-	 * XXX: css_offline() would be where we should reparent all
-	 * memory to prepare the cgroup for destruction.  However,
-	 * memcg does not do css_tryget() and res_counter charging
-	 * under the same RCU lock region, which means that charging
-	 * could race with offlining.  Offlining only happens to
-	 * cgroups with no tasks in them but charges can show up
-	 * without any tasks from the swapin path when the target
-	 * memcg is looked up from the swapout record and not from the
-	 * current task as it usually is.  A race like this can leak
-	 * charges and put pages with stale cgroup pointers into
-	 * circulation:
-	 *
-	 * #0                        #1
-	 *                           lookup_swap_cgroup_id()
-	 *                           rcu_read_lock()
-	 *                           mem_cgroup_lookup()
-	 *                           css_tryget()
-	 *                           rcu_read_unlock()
-	 * disable css_tryget()
-	 * call_rcu()
-	 *   offline_css()
-	 *     reparent_charges()
-	 *                           res_counter_charge()
-	 *                           css_put()
-	 *                             css_free()
-	 *                           pc->mem_cgroup = dead memcg
-	 *                           add page to lru
-	 *
-	 * The bulk of the charges are still moved in offline_css() to
-	 * avoid pinning a lot of pages in case a long-term reference
-	 * like a swapout record is deferring the css_free() to long
-	 * after offlining.  But this makes sure we catch any charges
-	 * made after offlining:
-	 */
-	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
 	__mem_cgroup_free(memcg);
-- 
1.9.rc1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-02-05 17:23 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-04 13:28 [PATCH -v2 0/6] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
2014-02-04 13:28 ` [PATCH -v2 1/6] memcg: do not replicate try_get_mem_cgroup_from_mm in __mem_cgroup_try_charge Michal Hocko
2014-02-04 15:55   ` Johannes Weiner
2014-02-04 16:05     ` Michal Hocko
2014-02-05 13:49       ` Michal Hocko
2014-02-04 13:28 ` [PATCH -v2 2/6] memcg: cleanup charge routines Michal Hocko
2014-02-04 16:05   ` Johannes Weiner
2014-02-04 16:12     ` Michal Hocko
2014-02-04 16:40       ` Johannes Weiner
2014-02-04 19:11         ` Michal Hocko
2014-02-04 19:36           ` Johannes Weiner
2014-02-04 13:28 ` [PATCH -v2 3/6] memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm Michal Hocko
2014-02-04 16:05   ` Johannes Weiner
2014-02-04 13:28 ` [PATCH -v2 4/6] memcg: make sure that memcg is not offline when charging Michal Hocko
2014-02-04 16:29   ` Johannes Weiner
2014-02-05 13:38     ` Michal Hocko
2014-02-05 15:28       ` Johannes Weiner
2014-02-05 15:42         ` Tejun Heo
2014-02-05 16:19         ` Michal Hocko
2014-02-05 16:29           ` Michal Hocko
2014-02-05 16:30           ` Tejun Heo
2014-02-05 16:45           ` Johannes Weiner
2014-02-05 17:23             ` Michal Hocko
2014-02-04 13:28 ` [PATCH -v2 5/6] memcg, kmem: clean up memcg parameter handling Michal Hocko
2014-02-04 16:32   ` Johannes Weiner
2014-02-04 16:42     ` Michal Hocko
2014-02-04 13:29 ` [PATCH -v2 6/6] Revert "mm: memcg: fix race condition between memcg teardown and swapin" Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).