[PATCHSET] memcg: fix and reimplement iterator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET] memcg: fix and reimplement iterator
@ 2013-06-04  0:44 Tejun Heo
  2013-06-04  0:44 ` [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Tejun Heo @ 2013-06-04  0:44 UTC (permalink / raw)
  To: mhocko, hannes, bsingharora; +Cc: cgroups, linux-mm, lizefan

mem_cgroup_iter() wraps around cgroup_next_descendant_pre() to provide
pre-order walk of memcg hierarchy.  In addition to normal walk, it
also implements shared iteration keyed by zone, node and priority so
that multiple reclaimers don't end up hitting on the same nodes.
Reclaimers working on the same zone, node and priority will push the
same iterator forward.

Unfortunately, the way this is implemented is disturbingly
complicated.  It ends up implementing pretty unique synchronization
construct inside memcg which is never a good sign for any subsystem.
While the implemented sychronization is overly elaborate and fragile,
the intention behind it is understandable as previously cgroup
iterator required each iteration to be contained inside a single RCU
read critical section disallowing implementation of saner mechanism.
To work around the limitation, memcg developed this Rube Goldberg
machine to detect whether the cached last cgroup is still alive, which
of course was ever so subtly broken.

Now that cgroup iterations can survive being dropped out of RCU read
critical section, this can be made a lot simpler.  This patchset
contains the following three patches.

 0001-memcg-fix-subtle-memory-barrier-bug-in-mem_cgroup_it.patch
 0002-memcg-restructure-mem_cgroup_iter.patch
 0003-memcg-simplify-mem_cgroup_reclaim_iter.patch

0001 is fix for a subtle memory barrier bug in the current
implementation.  Should be applied to for-3.10-fixes and backported
through -stable.  In general, memory barriers are bad ideas.  Please
don't do it unless utterly necessary, and, if you're doing it, please
add ample documentation explaining how they're paired and what they're
achieving.  Documenting is often extremely helpful for the implementor
oneself too because one ends up looking at and thinking about things a
lot more carefully.

0002 restructure mem_cgroup_iter() so that it's easier to follow and
change.

0003 reimplements the iterator sharing so that it's simpler and more
conventional.  It depends on the new cgroup iterator updates.

This patchset is on top of cgroup/for-3.11[1] which contains the
iterator updates this patchset depends upon and available in the
following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-memcg-simpler-iter

Lightly tested.  Proceed with caution.

 mm/memcontrol.c |  134 ++++++++++++++++++++++----------------------------------
 1 file changed, 54 insertions(+), 80 deletions(-)

Thanks.

--
tejun

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.11

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter()
  2013-06-04  0:44 [PATCHSET] memcg: fix and reimplement iterator Tejun Heo
@ 2013-06-04  0:44 ` Tejun Heo
  2013-06-04 13:03   ` Michal Hocko
  2013-06-04  0:44 ` [PATCH 2/3] memcg: restructure mem_cgroup_iter() Tejun Heo
  2013-06-04  0:44 ` [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter Tejun Heo
  2 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-04  0:44 UTC (permalink / raw)
  To: mhocko, hannes, bsingharora; +Cc: cgroups, linux-mm, lizefan, Tejun Heo

mem_cgroup_iter() plays a rather delicate game to allow sharing
reclaim iteration across multiple reclaimers.  It uses
reclaim_iter->last_visited and ->dead_count to remember the last
visited cgroup and verify whether the cgroup is still safe to access.

For the mechanism to work properly, updates to ->last_visited must be
visible before ->dead_count; otherwise, a stale ->last_visited may be
considered to be associated with more recent ->dead_count and thus
escape the dead condition detection which may lead to use-after-free.

The function has smp_rmb() where the dead condition is checked and
smp_wmb() where the two variables are updated but the smp_rmb() isn't
between dereferences of the two variables making the whole thing
pointless.  It's right after atomic_read(&root->dead_count) whose only
requirement is to belong to the same RCU critical section.

This patch moves the smp_rmb() between ->last_visited and
->last_dead_count dereferences and adds comment explaining how the
barriers are paired and for what.

Let's please not add memory barriers without explicitly explaining the
pairing and what they are achieving.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 mm/memcontrol.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cb1c9de..cb2f91c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1218,9 +1218,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			 * is alive.
 			 */
 			dead_count = atomic_read(&root->dead_count);
-			smp_rmb();
+
 			last_visited = iter->last_visited;
 			if (last_visited) {
+				/*
+				 * Paired with smp_wmb() below in this
+				 * function.  The pair guarantee that
+				 * last_visited is more current than
+				 * last_dead_count, which may lead to
+				 * spurious iteration resets but guarantees
+				 * reliable detection of dead condition.
+				 */
+				smp_rmb();
 				if ((dead_count != iter->last_dead_count) ||
 					!css_tryget(&last_visited->css)) {
 					last_visited = NULL;
@@ -1235,6 +1244,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				css_put(&last_visited->css);

 			iter->last_visited = memcg;
+			/* paired with smp_rmb() above in this function */
 			smp_wmb();
 			iter->last_dead_count = dead_count;

-- 
1.8.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter()
  2013-06-04  0:44 ` [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() Tejun Heo
@ 2013-06-04 13:03   ` Michal Hocko
  2013-06-04 13:58     ` Johannes Weiner
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-04 13:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

On Mon 03-06-13 17:44:37, Tejun Heo wrote:
[...]
> @@ -1218,9 +1218,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			 * is alive.
>  			 */
>  			dead_count = atomic_read(&root->dead_count);
> -			smp_rmb();
> +
>  			last_visited = iter->last_visited;
>  			if (last_visited) {
> +				/*
> +				 * Paired with smp_wmb() below in this
> +				 * function.  The pair guarantee that
> +				 * last_visited is more current than
> +				 * last_dead_count, which may lead to
> +				 * spurious iteration resets but guarantees
> +				 * reliable detection of dead condition.
> +				 */
> +				smp_rmb();
>  				if ((dead_count != iter->last_dead_count) ||
>  					!css_tryget(&last_visited->css)) {
>  					last_visited = NULL;

I originally had the barrier this way but Johannes pointed out it is not
correct https://lkml.org/lkml/2013/2/11/411
"
!> +			/*
!> +			 * last_visited might be invalid if some of the group
!> +			 * downwards was removed. As we do not know which one
!> +			 * disappeared we have to start all over again from the
!> +			 * root.
!> +			 * css ref count then makes sure that css won't
!> +			 * disappear while we iterate to the next memcg
!> +			 */
!> +			last_visited = iter->last_visited;
!> +			dead_count = atomic_read(&root->dead_count);
!> +			smp_rmb();
!
!Confused about this barrier, see below.
!
!As per above, if you remove the iter lock, those lines are mixed up.
!You need to read the dead count first because the writer updates the
!dead count after it sets the new position.  That way, if the dead
!count gives the go-ahead, you KNOW that the position cache is valid,
!because it has been updated first.  If either the two reads or the two
!writes get reordered, you risk seeing a matching dead count while the
!position cache is stale.
"

I think that explanation makes sense but I will leave
further explanation to Mr "I do not like mutual exclusion" :P
(https://lkml.org/lkml/2013/2/11/501 "My bumper sticker reads "I don't
believe in mutual exclusion" (the kernel hacker's version of smile for
the red light camera)")

> @@ -1235,6 +1244,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  				css_put(&last_visited->css);
>  
>  			iter->last_visited = memcg;
> +			/* paired with smp_rmb() above in this function */
>  			smp_wmb();
>  			iter->last_dead_count = dead_count;
>  
> -- 
> 1.8.2.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter()
  2013-06-04 13:03   ` Michal Hocko
@ 2013-06-04 13:58     ` Johannes Weiner
  2013-06-04 15:29       ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Johannes Weiner @ 2013-06-04 13:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, bsingharora, cgroups, linux-mm, lizefan

On Tue, Jun 04, 2013 at 03:03:36PM +0200, Michal Hocko wrote:
> On Mon 03-06-13 17:44:37, Tejun Heo wrote:
> [...]
> > @@ -1218,9 +1218,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  			 * is alive.
> >  			 */
> >  			dead_count = atomic_read(&root->dead_count);
> > -			smp_rmb();
> > +
> >  			last_visited = iter->last_visited;
> >  			if (last_visited) {
> > +				/*
> > +				 * Paired with smp_wmb() below in this
> > +				 * function.  The pair guarantee that
> > +				 * last_visited is more current than
> > +				 * last_dead_count, which may lead to
> > +				 * spurious iteration resets but guarantees
> > +				 * reliable detection of dead condition.
> > +				 */
> > +				smp_rmb();
> >  				if ((dead_count != iter->last_dead_count) ||
> >  					!css_tryget(&last_visited->css)) {
> >  					last_visited = NULL;
> 
> I originally had the barrier this way but Johannes pointed out it is not
> correct https://lkml.org/lkml/2013/2/11/411
> "
> !> +			/*
> !> +			 * last_visited might be invalid if some of the group
> !> +			 * downwards was removed. As we do not know which one
> !> +			 * disappeared we have to start all over again from the
> !> +			 * root.
> !> +			 * css ref count then makes sure that css won't
> !> +			 * disappear while we iterate to the next memcg
> !> +			 */
> !> +			last_visited = iter->last_visited;
> !> +			dead_count = atomic_read(&root->dead_count);
> !> +			smp_rmb();
> !
> !Confused about this barrier, see below.
> !
> !As per above, if you remove the iter lock, those lines are mixed up.
> !You need to read the dead count first because the writer updates the
> !dead count after it sets the new position.  That way, if the dead
> !count gives the go-ahead, you KNOW that the position cache is valid,
> !because it has been updated first.  If either the two reads or the two
> !writes get reordered, you risk seeing a matching dead count while the
> !position cache is stale.
> "

The original prototype code I sent looked like this:

mem_cgroup_iter:
rcu_read_lock()
if atomic_read(&root->dead_count) == iter->dead_count:
  smp_rmb()
  if tryget(iter->position):
    position = iter->position
memcg = find_next(postion)
css_put(position)
iter->position = memcg
smp_wmb() /* Write position cache BEFORE marking it uptodate */
iter->dead_count = atomic_read(&root->dead_count)
rcu_read_unlock()

iter->last_position is written, THEN iter->last_dead_count is written.

So, yes, you "need to read the dead count" first to be sure
iter->last_position is uptodate.  But iter->last_dead_count, not
root->dead_count.  I should have caught this in the final submission
of your patch :(

Tejun's patch is not correct, either.  Something like this?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 010d6c1..92830fa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1199,7 +1199,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 
 			mz = mem_cgroup_zoneinfo(root, nid, zid);
 			iter = &mz->reclaim_iter[reclaim->priority];
-			last_visited = iter->last_visited;
 			if (prev && reclaim->generation != iter->generation) {
 				iter->last_visited = NULL;
 				goto out_unlock;
@@ -1217,14 +1216,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			 * css_tryget() verifies the cgroup pointed to
 			 * is alive.
 			 */
+			last_visited = NULL;
 			dead_count = atomic_read(&root->dead_count);
-			smp_rmb();
-			last_visited = iter->last_visited;
-			if (last_visited) {
-				if ((dead_count != iter->last_dead_count) ||
-					!css_tryget(&last_visited->css)) {
+			if (dead_count == iter->last_dead_count) {
+				/*
+				 * The writer below sets the position
+				 * pointer, then the dead count.
+				 * Ensure we read the updated position
+				 * when the dead count matches.
+				 */
+				smp_rmb();
+				last_visited = iter->last_visited;
+				if (last_visited &&
+				    !css_tryget(&last_visited->css))
 					last_visited = NULL;
-				}
 			}
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter()
  2013-06-04 13:58     ` Johannes Weiner
@ 2013-06-04 15:29       ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2013-06-04 15:29 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Tejun Heo, bsingharora, cgroups, linux-mm, lizefan

On Tue 04-06-13 09:58:40, Johannes Weiner wrote:
> On Tue, Jun 04, 2013 at 03:03:36PM +0200, Michal Hocko wrote:
> > On Mon 03-06-13 17:44:37, Tejun Heo wrote:
> > [...]
> > > @@ -1218,9 +1218,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > >  			 * is alive.
> > >  			 */
> > >  			dead_count = atomic_read(&root->dead_count);
> > > -			smp_rmb();
> > > +
> > >  			last_visited = iter->last_visited;
> > >  			if (last_visited) {
> > > +				/*
> > > +				 * Paired with smp_wmb() below in this
> > > +				 * function.  The pair guarantee that
> > > +				 * last_visited is more current than
> > > +				 * last_dead_count, which may lead to
> > > +				 * spurious iteration resets but guarantees
> > > +				 * reliable detection of dead condition.
> > > +				 */
> > > +				smp_rmb();
> > >  				if ((dead_count != iter->last_dead_count) ||
> > >  					!css_tryget(&last_visited->css)) {
> > >  					last_visited = NULL;
> > 
> > I originally had the barrier this way but Johannes pointed out it is not
> > correct https://lkml.org/lkml/2013/2/11/411
> > "
> > !> +			/*
> > !> +			 * last_visited might be invalid if some of the group
> > !> +			 * downwards was removed. As we do not know which one
> > !> +			 * disappeared we have to start all over again from the
> > !> +			 * root.
> > !> +			 * css ref count then makes sure that css won't
> > !> +			 * disappear while we iterate to the next memcg
> > !> +			 */
> > !> +			last_visited = iter->last_visited;
> > !> +			dead_count = atomic_read(&root->dead_count);
> > !> +			smp_rmb();
> > !
> > !Confused about this barrier, see below.
> > !
> > !As per above, if you remove the iter lock, those lines are mixed up.
> > !You need to read the dead count first because the writer updates the
> > !dead count after it sets the new position.  That way, if the dead
> > !count gives the go-ahead, you KNOW that the position cache is valid,
> > !because it has been updated first.  If either the two reads or the two
> > !writes get reordered, you risk seeing a matching dead count while the
> > !position cache is stale.
> > "
> 
> The original prototype code I sent looked like this:
> 
> mem_cgroup_iter:
> rcu_read_lock()
> if atomic_read(&root->dead_count) == iter->dead_count:
>   smp_rmb()
>   if tryget(iter->position):
>     position = iter->position
> memcg = find_next(postion)
> css_put(position)
> iter->position = memcg
> smp_wmb() /* Write position cache BEFORE marking it uptodate */
> iter->dead_count = atomic_read(&root->dead_count)
> rcu_read_unlock()
> 
> iter->last_position is written, THEN iter->last_dead_count is written.
> 
> So, yes, you "need to read the dead count" first to be sure
> iter->last_position is uptodate.  But iter->last_dead_count, not
> root->dead_count.  I should have caught this in the final submission
> of your patch :(

OK, right you are. I managed to confuse myself by the three dependencies
here. dead_count -> last_visited -> last_dead_count. The first one is
invalid because last_visited doesn't care about dead_count and that
makes it much more clear now.

> Tejun's patch is not correct, either.  Something like this?

Yes this looks saner and correct. Care to send a full patch?

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 010d6c1..92830fa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1199,7 +1199,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  
>  			mz = mem_cgroup_zoneinfo(root, nid, zid);
>  			iter = &mz->reclaim_iter[reclaim->priority];
> -			last_visited = iter->last_visited;
>  			if (prev && reclaim->generation != iter->generation) {
>  				iter->last_visited = NULL;
>  				goto out_unlock;
> @@ -1217,14 +1216,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			 * css_tryget() verifies the cgroup pointed to
>  			 * is alive.
>  			 */
> +			last_visited = NULL;
>  			dead_count = atomic_read(&root->dead_count);
> -			smp_rmb();
> -			last_visited = iter->last_visited;
> -			if (last_visited) {
> -				if ((dead_count != iter->last_dead_count) ||
> -					!css_tryget(&last_visited->css)) {
> +			if (dead_count == iter->last_dead_count) {
> +				/*
> +				 * The writer below sets the position
> +				 * pointer, then the dead count.
> +				 * Ensure we read the updated position
> +				 * when the dead count matches.
> +				 */
> +				smp_rmb();
> +				last_visited = iter->last_visited;
> +				if (last_visited &&
> +				    !css_tryget(&last_visited->css))
>  					last_visited = NULL;
> -				}
>  			}
>  		}
>  

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/3] memcg: restructure mem_cgroup_iter()
  2013-06-04  0:44 [PATCHSET] memcg: fix and reimplement iterator Tejun Heo
  2013-06-04  0:44 ` [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() Tejun Heo
@ 2013-06-04  0:44 ` Tejun Heo
  2013-06-04 13:21   ` Michal Hocko
  2013-06-04  0:44 ` [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter Tejun Heo
  2 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-04  0:44 UTC (permalink / raw)
  To: mhocko, hannes, bsingharora; +Cc: cgroups, linux-mm, lizefan, Tejun Heo

mem_cgroup_iter() implements two iteration modes - plain and reclaim.
The former is normal pre-order tree walk.  The latter tries to share
iteration cursor per zone and priority pair among multiple reclaimers
so that they all contribute to scanning forward rather than banging on
the same cgroups simultaneously.

Implementing the two in the same function allows them to share code
paths which is fine but the current structure is unnecessarily
convoluted with conditionals on @reclaim spread across the function
rather obscurely and with a somewhat strange control flow which checks
for conditions which can't be and has duplicate tests for the same
conditions in different forms.

This patch restructures the function such that there's single test on
@reclaim and !reclaim path is contained in its block, which simplifies
both !reclaim and reclaim paths.  Also, the control flow in the
reclaim path is restructured and commented so that it's easier to
follow what's going on why.

Note that after the patch reclaim->generation is synchronized to the
iter's on success whether @prev was specified or not.  This doesn't
cause any functional differences as the two generation numbers are
guaranteed to be the same at that point if @prev and makes the code
slightly easier to follow.

This patch is pure restructuring and shouldn't introduce any
functional differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 mm/memcontrol.c | 131 ++++++++++++++++++++++++++++++--------------------------
 1 file changed, 71 insertions(+), 60 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cb2f91c..99e7357 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1170,8 +1170,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				   struct mem_cgroup_reclaim_cookie *reclaim)
 {
 	struct mem_cgroup *memcg = NULL;
-	struct mem_cgroup *last_visited = NULL;
-	unsigned long uninitialized_var(dead_count);
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_reclaim_iter *iter;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1179,9 +1179,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (!root)
 		root = root_mem_cgroup;
 
-	if (prev && !reclaim)
-		last_visited = prev;
-
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
 			goto out_css_put;
@@ -1189,73 +1186,87 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	}
 
 	rcu_read_lock();
-	while (!memcg) {
-		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-
-		if (reclaim) {
-			int nid = zone_to_nid(reclaim->zone);
-			int zid = zone_idx(reclaim->zone);
-			struct mem_cgroup_per_zone *mz;
-
-			mz = mem_cgroup_zoneinfo(root, nid, zid);
-			iter = &mz->reclaim_iter[reclaim->priority];
-			last_visited = iter->last_visited;
-			if (prev && reclaim->generation != iter->generation) {
-				iter->last_visited = NULL;
-				goto out_unlock;
-			}
 
+	/* non reclaim case is simple - just iterate from @prev */
+	if (!reclaim) {
+		memcg = __mem_cgroup_iter_next(root, prev);
+		goto out_unlock;
+	}
+
+	/*
+	 * @reclaim specified - find and share the per-zone-priority
+	 * iterator.
+	 */
+	mz = mem_cgroup_zoneinfo(root, zone_to_nid(reclaim->zone),
+				 zone_idx(reclaim->zone));
+	iter = &mz->reclaim_iter[reclaim->priority];
+
+	while (true) {
+		struct mem_cgroup *last_visited;
+		unsigned long dead_count;
+
+		/*
+		 * If this caller already iterated through some and @iter
+		 * wrapped since, finish this the iteration.
+		 */
+		if (prev && reclaim->generation != iter->generation) {
+			iter->last_visited = NULL;
+			break;
+		}
+
+		/*
+		 * If the dead_count mismatches, a destruction has happened
+		 * or is happening concurrently.  If the dead_count
+		 * matches, a destruction might still happen concurrently,
+		 * but since we checked under RCU, that destruction won't
+		 * free the object until we release the RCU reader lock.
+		 * Thus, the dead_count check verifies the pointer is still
+		 * valid, css_tryget() verifies the cgroup pointed to is
+		 * alive.
+		 */
+		dead_count = atomic_read(&root->dead_count);
+
+		last_visited = iter->last_visited;
+		if (last_visited) {
 			/*
-			 * If the dead_count mismatches, a destruction
-			 * has happened or is happening concurrently.
-			 * If the dead_count matches, a destruction
-			 * might still happen concurrently, but since
-			 * we checked under RCU, that destruction
-			 * won't free the object until we release the
-			 * RCU reader lock.  Thus, the dead_count
-			 * check verifies the pointer is still valid,
-			 * css_tryget() verifies the cgroup pointed to
-			 * is alive.
+			 * Paired with smp_wmb() below in this function.
+			 * The pair guarantee that last_visited is more
+			 * current than last_dead_count, which may lead to
+			 * spurious iteration resets but guarantees
+			 * reliable detection of dead condition.
 			 */
-			dead_count = atomic_read(&root->dead_count);
-
-			last_visited = iter->last_visited;
-			if (last_visited) {
-				/*
-				 * Paired with smp_wmb() below in this
-				 * function.  The pair guarantee that
-				 * last_visited is more current than
-				 * last_dead_count, which may lead to
-				 * spurious iteration resets but guarantees
-				 * reliable detection of dead condition.
-				 */
-				smp_rmb();
-				if ((dead_count != iter->last_dead_count) ||
-					!css_tryget(&last_visited->css)) {
-					last_visited = NULL;
-				}
+			smp_rmb();
+			if ((dead_count != iter->last_dead_count) ||
+			    !css_tryget(&last_visited->css)) {
+				last_visited = NULL;
 			}
 		}
 
 		memcg = __mem_cgroup_iter_next(root, last_visited);
 
-		if (reclaim) {
-			if (last_visited)
-				css_put(&last_visited->css);
+		if (last_visited)
+			css_put(&last_visited->css);
 
-			iter->last_visited = memcg;
-			/* paired with smp_rmb() above in this function */
-			smp_wmb();
-			iter->last_dead_count = dead_count;
+		iter->last_visited = memcg;
+		/* paired with smp_rmb() above in this function */
+		smp_wmb();
+		iter->last_dead_count = dead_count;
 
-			if (!memcg)
-				iter->generation++;
-			else if (!prev && memcg)
-				reclaim->generation = iter->generation;
+		/* if successful, sync the generation number and return */
+		if (likely(memcg)) {
+			reclaim->generation = iter->generation;
+			break;
 		}
 
-		if (prev && !memcg)
-			goto out_unlock;
+		/*
+		 * The iterator reached the end.  If this reclaimer already
+		 * visited some cgroups, finish the iteration; otherwise,
+		 * start a new iteration from the beginning.
+		 */
+		if (prev)
+			break;
+
+		iter->generation++;
 	}
 out_unlock:
 	rcu_read_unlock();
-- 
1.8.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/3] memcg: restructure mem_cgroup_iter()
  2013-06-04  0:44 ` [PATCH 2/3] memcg: restructure mem_cgroup_iter() Tejun Heo
@ 2013-06-04 13:21   ` Michal Hocko
  2013-06-04 20:51     ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-04 13:21 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

On Mon 03-06-13 17:44:38, Tejun Heo wrote:
> mem_cgroup_iter() implements two iteration modes - plain and reclaim.
> The former is normal pre-order tree walk.  The latter tries to share
> iteration cursor per zone and priority pair among multiple reclaimers
> so that they all contribute to scanning forward rather than banging on
> the same cgroups simultaneously.
> 
> Implementing the two in the same function allows them to share code
> paths which is fine but the current structure is unnecessarily
> convoluted with conditionals on @reclaim spread across the function
> rather obscurely and with a somewhat strange control flow which checks
> for conditions which can't be and has duplicate tests for the same
> conditions in different forms.
> 
> This patch restructures the function such that there's single test on
> @reclaim and !reclaim path is contained in its block, which simplifies
> both !reclaim and reclaim paths.  Also, the control flow in the
> reclaim path is restructured and commented so that it's easier to
> follow what's going on why.
> 
> Note that after the patch reclaim->generation is synchronized to the
> iter's on success whether @prev was specified or not.  This doesn't
> cause any functional differences as the two generation numbers are
> guaranteed to be the same at that point if @prev and makes the code
> slightly easier to follow.
> 
> This patch is pure restructuring and shouldn't introduce any
> functional differences.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  mm/memcontrol.c | 131 ++++++++++++++++++++++++++++++--------------------------
>  1 file changed, 71 insertions(+), 60 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cb2f91c..99e7357 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1170,8 +1170,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  				   struct mem_cgroup_reclaim_cookie *reclaim)
>  {
>  	struct mem_cgroup *memcg = NULL;
> -	struct mem_cgroup *last_visited = NULL;
> -	unsigned long uninitialized_var(dead_count);
> +	struct mem_cgroup_per_zone *mz;
> +	struct mem_cgroup_reclaim_iter *iter;
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -1179,9 +1179,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  	if (!root)
>  		root = root_mem_cgroup;
>  
> -	if (prev && !reclaim)
> -		last_visited = prev;
> -
>  	if (!root->use_hierarchy && root != root_mem_cgroup) {
>  		if (prev)
>  			goto out_css_put;
> @@ -1189,73 +1186,87 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  	}
>  
>  	rcu_read_lock();
> -	while (!memcg) {
> -		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
> -
> -		if (reclaim) {
> -			int nid = zone_to_nid(reclaim->zone);
> -			int zid = zone_idx(reclaim->zone);
> -			struct mem_cgroup_per_zone *mz;
> -
> -			mz = mem_cgroup_zoneinfo(root, nid, zid);
> -			iter = &mz->reclaim_iter[reclaim->priority];
> -			last_visited = iter->last_visited;
> -			if (prev && reclaim->generation != iter->generation) {
> -				iter->last_visited = NULL;
> -				goto out_unlock;
> -			}
>  
> +	/* non reclaim case is simple - just iterate from @prev */
> +	if (!reclaim) {
> +		memcg = __mem_cgroup_iter_next(root, prev);
> +		goto out_unlock;
> +	}

I do not have objections for pulling !reclaim case like this, but could
you base this on top of the patch which adds predicates into the
operators, please?

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/3] memcg: restructure mem_cgroup_iter()
  2013-06-04 13:21   ` Michal Hocko
@ 2013-06-04 20:51     ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2013-06-04 20:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

On Tue, Jun 04, 2013 at 03:21:20PM +0200, Michal Hocko wrote:
> > +	/* non reclaim case is simple - just iterate from @prev */
> > +	if (!reclaim) {
> > +		memcg = __mem_cgroup_iter_next(root, prev);
> > +		goto out_unlock;
> > +	}
> 
> I do not have objections for pulling !reclaim case like this, but could
> you base this on top of the patch which adds predicates into the
> operators, please?

I don't really mind either ways but let's see how the other series
goes.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-04  0:44 [PATCHSET] memcg: fix and reimplement iterator Tejun Heo
  2013-06-04  0:44 ` [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() Tejun Heo
  2013-06-04  0:44 ` [PATCH 2/3] memcg: restructure mem_cgroup_iter() Tejun Heo
@ 2013-06-04  0:44 ` Tejun Heo
  2013-06-04 13:18   ` Michal Hocko
  2 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-04  0:44 UTC (permalink / raw)
  To: mhocko, hannes, bsingharora; +Cc: cgroups, linux-mm, lizefan, Tejun Heo

mem_cgroup_iter() shares mem_cgroup_reclaim_iters among multiple
reclaimers to prevent multiple reclaimers banging on the same cgroups.
To achieve this, mem_cgroup_reclaim_iter remembers the last visited
cgroup.  Before the recent changes, cgroup_next_descendant_pre()
required that the current cgroup is alive or RCU grace period hasn't
passed after its removal as ->sibling.next couldn't be trusted
otherwise.

As bumping cgroup_subsys_state reference doesn't prevent the cgroup
from being removed, instead of pinning the current cgroup,
mem_cgroup_reclaim_iter tracks the number of cgroup removal events in
the subtree and resets the iteration if any removal has happened since
caching the current cgroup.  This scheme involves an overly elaborate
and hard-to-follow synchronization scheme as it needs to game cgroup
removal RCU grace period.

Now that cgroup_next_descendant_pre() can return the next sibling
reliably regardless of the state of the current cgroup, this can be
implemented in a much simpler and more conventional way.
mem_cgroup_reclaim_iter can pin the current cgroup and use
__mem_cgroup_iter_next() on it for the next iteration.  The whole
thing becomes normal RCU synchronization.  Updating the cursor to the
next position is slightly more involved as multiple tasks could be
trying to update it at the same time; however, it can be easily
implemented using xchg().

This replaces the overly elaborate synchronization scheme along with
->dead_count management with a more conventional RCU usage.  As an
added bonus, the new implementation doesn't reset the cursor everytime
a cgroup is deleted in the subtree.  It safely continues the
iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 mm/memcontrol.c | 89 ++++++++++++++-------------------------------------------
 1 file changed, 21 insertions(+), 68 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 99e7357..4057730 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -155,12 +155,8 @@ struct mem_cgroup_stat_cpu {
 };
 
 struct mem_cgroup_reclaim_iter {
-	/*
-	 * last scanned hierarchy member. Valid only if last_dead_count
-	 * matches memcg->dead_count of the hierarchy root group.
-	 */
-	struct mem_cgroup *last_visited;
-	unsigned long last_dead_count;
+	/* last scanned hierarchy member, pinned */
+	struct mem_cgroup __rcu *last_visited;
 
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
@@ -1172,6 +1168,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	struct mem_cgroup *memcg = NULL;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_reclaim_iter *iter;
+	struct mem_cgroup *last_visited;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1195,63 +1192,25 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 
 	/*
 	 * @reclaim specified - find and share the per-zone-priority
-	 * iterator.
+	 * iterator.  Because @iter->last_visited holds the reference and
+	 * css's are RCU protected, it's guaranteed that last_visited will
+	 * remain accessible while we're holding RCU read lock.
 	 */
 	mz = mem_cgroup_zoneinfo(root, zone_to_nid(reclaim->zone),
 				 zone_idx(reclaim->zone));
 	iter = &mz->reclaim_iter[reclaim->priority];
+	last_visited = rcu_dereference(iter->last_visited);
 
 	while (true) {
-		struct mem_cgroup *last_visited;
-		unsigned long dead_count;
-
 		/*
 		 * If this caller already iterated through some and @iter
 		 * wrapped since, finish this the iteration.
 		 */
-		if (prev && reclaim->generation != iter->generation) {
-			iter->last_visited = NULL;
+		if (prev && reclaim->generation != iter->generation)
 			break;
-		}
-
-		/*
-		 * If the dead_count mismatches, a destruction has happened
-		 * or is happening concurrently.  If the dead_count
-		 * matches, a destruction might still happen concurrently,
-		 * but since we checked under RCU, that destruction won't
-		 * free the object until we release the RCU reader lock.
-		 * Thus, the dead_count check verifies the pointer is still
-		 * valid, css_tryget() verifies the cgroup pointed to is
-		 * alive.
-		 */
-		dead_count = atomic_read(&root->dead_count);
-
-		last_visited = iter->last_visited;
-		if (last_visited) {
-			/*
-			 * Paired with smp_wmb() below in this function.
-			 * The pair guarantee that last_visited is more
-			 * current than last_dead_count, which may lead to
-			 * spurious iteration resets but guarantees
-			 * reliable detection of dead condition.
-			 */
-			smp_rmb();
-			if ((dead_count != iter->last_dead_count) ||
-			    !css_tryget(&last_visited->css)) {
-				last_visited = NULL;
-			}
-		}
 
 		memcg = __mem_cgroup_iter_next(root, last_visited);
 
-		if (last_visited)
-			css_put(&last_visited->css);
-
-		iter->last_visited = memcg;
-		/* paired with smp_rmb() above in this function */
-		smp_wmb();
-		iter->last_dead_count = dead_count;
-
 		/* if successful, sync the generation number and return */
 		if (likely(memcg)) {
 			reclaim->generation = iter->generation;
@@ -1267,7 +1226,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			break;
 
 		iter->generation++;
+		last_visited = NULL;
 	}
+
+	/*
+	 * Update @iter to the new position.  As multiple tasks could be
+	 * executing this path, atomically swap the new and old.  We want
+	 * RCU assignment here but there's no rcu_xchg() and the plain
+	 * xchg() has enough memory barrier semantics.
+	 */
+	if (memcg)
+		css_get(&memcg->css);
+	last_visited = xchg(&iter->last_visited, memcg);
+	if (last_visited)
+		css_put(&last_visited->css);
 out_unlock:
 	rcu_read_unlock();
 out_css_put:
@@ -6324,29 +6296,10 @@ mem_cgroup_css_online(struct cgroup *cont)
 	return error;
 }
 
-/*
- * Announce all parents that a group from their hierarchy is gone.
- */
-static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup *parent = memcg;
-
-	while ((parent = parent_mem_cgroup(parent)))
-		atomic_inc(&parent->dead_count);
-
-	/*
-	 * if the root memcg is not hierarchical we have to check it
-	 * explicitely.
-	 */
-	if (!root_mem_cgroup->use_hierarchy)
-		atomic_inc(&root_mem_cgroup->dead_count);
-}
-
 static void mem_cgroup_css_offline(struct cgroup *cont)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
-	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.8.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-04  0:44 ` [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter Tejun Heo
@ 2013-06-04 13:18   ` Michal Hocko
  2013-06-04 20:50     ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-04 13:18 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

On Mon 03-06-13 17:44:39, Tejun Heo wrote:
[...]
> @@ -1267,7 +1226,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			break;
>  
>  		iter->generation++;
> +		last_visited = NULL;
>  	}
> +
> +	/*
> +	 * Update @iter to the new position.  As multiple tasks could be
> +	 * executing this path, atomically swap the new and old.  We want
> +	 * RCU assignment here but there's no rcu_xchg() and the plain
> +	 * xchg() has enough memory barrier semantics.
> +	 */
> +	if (memcg)
> +		css_get(&memcg->css);

This is all good and nice but it re-introduces the same problem which
has been fixed by (5f578161: memcg: relax memcg iter caching). You are
pinning memcg in memory for unbounded amount of time because css
reference will not let object to leave and rest.

I understand your frustration about the complexity of the current
synchronization but we didn't come up with anything easier.
Originally I though that your tree walk updates which allow dropping rcu
would help here but then I realized that not really because the iterator
(resp. pos) has to be a valid pointer and there is only one possibility
to do that AFAICS here and that is css pinning. And is no-go.

> +	last_visited = xchg(&iter->last_visited, memcg);
> +	if (last_visited)
> +		css_put(&last_visited->css);
>  out_unlock:
>  	rcu_read_unlock();
>  out_css_put:
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-04 13:18   ` Michal Hocko
@ 2013-06-04 20:50     ` Tejun Heo
  2013-06-04 21:28       ` Michal Hocko
  2013-06-04 21:40       ` Johannes Weiner
  0 siblings, 2 replies; 42+ messages in thread
From: Tejun Heo @ 2013-06-04 20:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

Hello, Michal.

On Tue, Jun 04, 2013 at 03:18:43PM +0200, Michal Hocko wrote:
> > +	if (memcg)
> > +		css_get(&memcg->css);
> 
> This is all good and nice but it re-introduces the same problem which
> has been fixed by (5f578161: memcg: relax memcg iter caching). You are
> pinning memcg in memory for unbounded amount of time because css
> reference will not let object to leave and rest.

I don't get why that is a problem.  Can you please elaborate?  css's
now explicitly allow holding onto them.  We now have clear separation
of "destruction" and "release" and blkcg also depends on it.  If memcg
still doesn't distinguish the two properly, that's where the problem
should be fixed.

> I understand your frustration about the complexity of the current
> synchronization but we didn't come up with anything easier.
> Originally I though that your tree walk updates which allow dropping rcu
> would help here but then I realized that not really because the iterator
> (resp. pos) has to be a valid pointer and there is only one possibility
> to do that AFAICS here and that is css pinning. And is no-go.

I find the above really weird.  If css can't be pinned for position
caching, isn't it natural to ask why it can't be and then fix it?
Because that's what the whole refcnt thing is about and a usage which
cgroup explicitly allows (e.g. blkcg also does it).  Why do you go
from there to "this batshit crazy barrier dancing is the only
solution"?

Can you please explain why memcg css's can't be pinned?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-04 20:50     ` Tejun Heo
@ 2013-06-04 21:28       ` Michal Hocko
  2013-06-04 21:55         ` Tejun Heo
  2013-06-04 21:40       ` Johannes Weiner
  1 sibling, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-04 21:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

On Tue 04-06-13 13:50:25, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Jun 04, 2013 at 03:18:43PM +0200, Michal Hocko wrote:
> > > +	if (memcg)
> > > +		css_get(&memcg->css);
> > 
> > This is all good and nice but it re-introduces the same problem which
> > has been fixed by (5f578161: memcg: relax memcg iter caching). You are
> > pinning memcg in memory for unbounded amount of time because css
> > reference will not let object to leave and rest.
> 
> I don't get why that is a problem.  Can you please elaborate?  css's
> now explicitly allow holding onto them.  We now have clear separation
> of "destruction" and "release" and blkcg also depends on it.  If memcg
> still doesn't distinguish the two properly, that's where the problem
> should be fixed.
> 
> > I understand your frustration about the complexity of the current
> > synchronization but we didn't come up with anything easier.
> > Originally I though that your tree walk updates which allow dropping rcu
> > would help here but then I realized that not really because the iterator
> > (resp. pos) has to be a valid pointer and there is only one possibility
> > to do that AFAICS here and that is css pinning. And is no-go.
> 
> I find the above really weird.  If css can't be pinned for position
> caching, isn't it natural to ask why it can't be and then fix it?

Well, I do not mind pinning when I know that somebody releases the
reference in a predictable future (ideally almost immediately). But the
cached iter represents time unbounded pinning because nobody can
guarantee that priority 3 at zone Normal at node 3 will be ever scanned
again and the pointer in the last_visited node will be stuck there for
eternity. Can we free memcg with only css elevated and safely check that
the cached pointer can be used without similar dances we have now?

I am open to any suggestions.

> Because that's what the whole refcnt thing is about and a usage which
> cgroup explicitly allows (e.g. blkcg also does it).  Why do you go
> from there to "this batshit crazy barrier dancing is the only
> solution"?
> 
> Can you please explain why memcg css's can't be pinned?

Because it pins memcg as well AFAIU and we just do not want to keep
those around for eternity.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-04 21:28       ` Michal Hocko
@ 2013-06-04 21:55         ` Tejun Heo
  2013-06-05  7:30           ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-04 21:55 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

Hello, Michal.

On Tue, Jun 04, 2013 at 11:28:08PM +0200, Michal Hocko wrote:
> Well, I do not mind pinning when I know that somebody releases the
> reference in a predictable future (ideally almost immediately). But the
> cached iter represents time unbounded pinning because nobody can
> guarantee that priority 3 at zone Normal at node 3 will be ever scanned
> again and the pointer in the last_visited node will be stuck there for

I don't really get that.  As long as the amount is bound and the
overhead negligible / acceptable, why does it matter how long the
pinning persists?  We aren't talking about something gigantic or can
leak continuously.  It will only matter iff cgroups are continuously
created and destroyed and each live memcg will be able to pin one
memcg (BTW, I think I forgot to unpin on memcg destruction).

> eternity. Can we free memcg with only css elevated and safely check that
> the cached pointer can be used without similar dances we have now?
> I am open to any suggestions.

I really think this is worrying too much about something which doesn't
really matter and then coming up with an over-engineered solution for
the imagined problem.  This isn't a real problem.  No solution is
necessary.

In the off chance that this is a real problem, which I strongly doubt,
as I wrote to Johannes, we can implement extremely dumb cleanup
routine rather than this weak reference beast.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-04 21:55         ` Tejun Heo
@ 2013-06-05  7:30           ` Michal Hocko
  2013-06-05  8:20             ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-05  7:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

On Tue 04-06-13 14:55:35, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Jun 04, 2013 at 11:28:08PM +0200, Michal Hocko wrote:
> > Well, I do not mind pinning when I know that somebody releases the
> > reference in a predictable future (ideally almost immediately). But the
> > cached iter represents time unbounded pinning because nobody can
> > guarantee that priority 3 at zone Normal at node 3 will be ever scanned
> > again and the pointer in the last_visited node will be stuck there for
> 
> I don't really get that.  As long as the amount is bound and the
> overhead negligible / acceptable, why does it matter how long the
> pinning persists? 

Because the amount is not bound either. Just create a hierarchy and
trigger the hard limit and if you are careful enough you can always keep
some of the children in the cached pointer (with css reference, if you
will) and then release the hierarchy. You can do that repeatedly and
leak considerable amount of memory.

> We aren't talking about something gigantic or can

mem_cgroup is 888B now (depending on configuration). So I wouldn't call
it negligible.

> leak continuously.  It will only matter iff cgroups are continuously
> created and destroyed and each live memcg will be able to pin one
> memcg (BTW, I think I forgot to unpin on memcg destruction).
> 
> > eternity. Can we free memcg with only css elevated and safely check that
> > the cached pointer can be used without similar dances we have now?
> > I am open to any suggestions.
> 
> I really think this is worrying too much about something which doesn't
> really matter and then coming up with an over-engineered solution for
> the imagined problem.  This isn't a real problem.  No solution is
> necessary.
> 
> In the off chance that this is a real problem, which I strongly doubt,
> as I wrote to Johannes, we can implement extremely dumb cleanup
> routine rather than this weak reference beast.

That was my first version (https://lkml.org/lkml/2013/1/3/298) and
Johannes didn't like. To be honest I do not care _much_ which way we go
but we definitely cannot pin those objects for ever.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05  7:30           ` Michal Hocko
@ 2013-06-05  8:20             ` Tejun Heo
  2013-06-05  8:36               ` Michal Hocko
  2013-06-05 14:39               ` Johannes Weiner
  0 siblings, 2 replies; 42+ messages in thread
From: Tejun Heo @ 2013-06-05  8:20 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

Hello, Michal.

On Wed, Jun 05, 2013 at 09:30:23AM +0200, Michal Hocko wrote:
> > I don't really get that.  As long as the amount is bound and the
> > overhead negligible / acceptable, why does it matter how long the
> > pinning persists? 
> 
> Because the amount is not bound either. Just create a hierarchy and
> trigger the hard limit and if you are careful enough you can always keep
> some of the children in the cached pointer (with css reference, if you
> will) and then release the hierarchy. You can do that repeatedly and
> leak considerable amount of memory.

It's still bound, no?  Each live memcg can only keep limited number of
cgroups cached, right?

> > We aren't talking about something gigantic or can
> 
> mem_cgroup is 888B now (depending on configuration). So I wouldn't call
> it negligible.

Do you think that the number can actually grow harmful?  Would you be
kind enough to share some calculations with me?

> > In the off chance that this is a real problem, which I strongly doubt,
> > as I wrote to Johannes, we can implement extremely dumb cleanup
> > routine rather than this weak reference beast.
> 
> That was my first version (https://lkml.org/lkml/2013/1/3/298) and
> Johannes didn't like. To be honest I do not care _much_ which way we go
> but we definitely cannot pin those objects for ever.

I'll get to the barrier thread but really complex barrier dancing like
that is only justifiable in extremely hot paths a lot of people pay
attention to.  It doesn't belong inside memcg proper.  If the cached
amount is an actual concern, let's please implement a simple clean up
thing.  All we need is a single delayed_work which scans the tree
periodically.

Johannes, what do you think?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05  8:20             ` Tejun Heo
@ 2013-06-05  8:36               ` Michal Hocko
  2013-06-05  8:44                 ` Tejun Heo
  2013-06-05 14:39               ` Johannes Weiner
  1 sibling, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-05  8:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

On Wed 05-06-13 01:20:23, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Jun 05, 2013 at 09:30:23AM +0200, Michal Hocko wrote:
> > > I don't really get that.  As long as the amount is bound and the
> > > overhead negligible / acceptable, why does it matter how long the
> > > pinning persists? 
> > 
> > Because the amount is not bound either. Just create a hierarchy and
> > trigger the hard limit and if you are careful enough you can always keep
> > some of the children in the cached pointer (with css reference, if you
> > will) and then release the hierarchy. You can do that repeatedly and
> > leak considerable amount of memory.
> 
> It's still bound, no?  Each live memcg can only keep limited number of
> cgroups cached, right?

Assuming that they are cleaned up when the memcg is offlined then yes.

> > > We aren't talking about something gigantic or can
> > 
> > mem_cgroup is 888B now (depending on configuration). So I wouldn't call
> > it negligible.
> 
> Do you think that the number can actually grow harmful?  Would you be
> kind enough to share some calculations with me?

Well, each intermediate node might pin up-to NR_NODES * NR_ZONES *
NR_PRIORITY groups. You would need a big hierarchy to have chance to
cache different groups so that it starts matter.

The problem is the clean up though. It might be a simple object at the
time when it never gets freed. So there _must_ be something that would
release the css reference to free the associated resources. As I said
this can be done either during css_offline or in a lazy fashion that we
have currently. I really do not care much which way it is done.

> > > In the off chance that this is a real problem, which I strongly doubt,
> > > as I wrote to Johannes, we can implement extremely dumb cleanup
> > > routine rather than this weak reference beast.
> > 
> > That was my first version (https://lkml.org/lkml/2013/1/3/298) and
> > Johannes didn't like. To be honest I do not care _much_ which way we go
> > but we definitely cannot pin those objects for ever.
> 
> I'll get to the barrier thread but really complex barrier dancing like
> that is only justifiable in extremely hot paths a lot of people pay
> attention to.  It doesn't belong inside memcg proper.  If the cached
> amount is an actual concern, let's please implement a simple clean up
> thing.  All we need is a single delayed_work which scans the tree
> periodically.

And do what? css_try_get to find out whether the cached memcg is still
alive. Sorry, I do not like it at all. I find it much better to clean up
when the group is removed. Because doing things asynchronously just
makes it more obscure. There is no reason to do such a thing on the
background when we know _when_ to do the cleanup and that is definitely
_not a hot path_.

> Johannes, what do you think?
> 
> Thanks.
> 
> -- 
> tejun

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05  8:36               ` Michal Hocko
@ 2013-06-05  8:44                 ` Tejun Heo
  2013-06-05  8:55                   ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-05  8:44 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

Hey,

On Wed, Jun 05, 2013 at 10:36:28AM +0200, Michal Hocko wrote:
> > It's still bound, no?  Each live memcg can only keep limited number of
> > cgroups cached, right?
> 
> Assuming that they are cleaned up when the memcg is offlined then yes.

Oh yeah, that's just me being forgetful.  We definitely need to clean
it up on offlining.

> > Do you think that the number can actually grow harmful?  Would you be
> > kind enough to share some calculations with me?
> 
> Well, each intermediate node might pin up-to NR_NODES * NR_ZONES *
> NR_PRIORITY groups. You would need a big hierarchy to have chance to
> cache different groups so that it starts matter.

Yeah, NR_NODES can be pretty big.  I'm still not sure whether this
would be a problem in practice but yeah it can grow pretty big.

> And do what? css_try_get to find out whether the cached memcg is still

Hmmm? It can just look at the timestamp and if too old do

	cached = xchg(&iter->hint, NULL);
	if (cached)
		css_put(cached);

> alive. Sorry, I do not like it at all. I find it much better to clean up
> when the group is removed. Because doing things asynchronously just
> makes it more obscure. There is no reason to do such a thing on the
> background when we know _when_ to do the cleanup and that is definitely
> _not a hot path_.

Yeah, that's true.  I just wanna avoid the barrier dancing.  Only one
of the ancestors can cache a memcg, right?  Walking up the tree
scanning for cached ones and putting them should work?  Is that what
you were suggesting?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05  8:44                 ` Tejun Heo
@ 2013-06-05  8:55                   ` Michal Hocko
  2013-06-05  9:03                     ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-05  8:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

On Wed 05-06-13 01:44:56, Tejun Heo wrote:
[...]
> > alive. Sorry, I do not like it at all. I find it much better to clean up
> > when the group is removed. Because doing things asynchronously just
> > makes it more obscure. There is no reason to do such a thing on the
> > background when we know _when_ to do the cleanup and that is definitely
> > _not a hot path_.
> 
> Yeah, that's true.  I just wanna avoid the barrier dancing.  Only one
> of the ancestors can cache a memcg, right?

No. All of them on the way up hierarchy. Basically each parent which
ever triggered the reclaim caches reclaimers.

> Walking up the tree scanning for cached ones and putting them should
> work?  Is that what you were suggesting?

That was my first version of the patch I linked in the previous email.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05  8:55                   ` Michal Hocko
@ 2013-06-05  9:03                     ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2013-06-05  9:03 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan

Hey,

On Wed, Jun 05, 2013 at 10:55:31AM +0200, Michal Hocko wrote:
> > Yeah, that's true.  I just wanna avoid the barrier dancing.  Only one
> > of the ancestors can cache a memcg, right?
> 
> No. All of them on the way up hierarchy. Basically each parent which
> ever triggered the reclaim caches reclaimers.

Oh, I meant only the ancestors can cache a memcg, so yeap.

> > Walking up the tree scanning for cached ones and putting them should
> > work?  Is that what you were suggesting?
> 
> That was my first version of the patch I linked in the previous email.

Yeah, indeed.  Johannes, what do you think?  Between the recent cgroup
iterator update and xchg(), we don't need the weak referencing and
it's just wrong to have that level of complexity in memcg proper.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05  8:20             ` Tejun Heo
  2013-06-05  8:36               ` Michal Hocko
@ 2013-06-05 14:39               ` Johannes Weiner
  2013-06-05 14:50                 ` Johannes Weiner
                                   ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Johannes Weiner @ 2013-06-05 14:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan

On Wed, Jun 05, 2013 at 01:20:23AM -0700, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Jun 05, 2013 at 09:30:23AM +0200, Michal Hocko wrote:
> > > I don't really get that.  As long as the amount is bound and the
> > > overhead negligible / acceptable, why does it matter how long the
> > > pinning persists? 
> > 
> > Because the amount is not bound either. Just create a hierarchy and
> > trigger the hard limit and if you are careful enough you can always keep
> > some of the children in the cached pointer (with css reference, if you
> > will) and then release the hierarchy. You can do that repeatedly and
> > leak considerable amount of memory.
> 
> It's still bound, no?  Each live memcg can only keep limited number of
> cgroups cached, right?

It is bounded by the number of memcgs.  Each one can have 12
(DEF_PRIORITY) references.

> > > We aren't talking about something gigantic or can
> > 
> > mem_cgroup is 888B now (depending on configuration). So I wouldn't call
> > it negligible.
> 
> Do you think that the number can actually grow harmful?  Would you be
> kind enough to share some calculations with me?

5k cgroups * say 10 priority levels * 1k struct mem_cgroup may pin 51M
of dead struct mem_cgroup, plus whatever else the css pins.

> > > In the off chance that this is a real problem, which I strongly doubt,
> > > as I wrote to Johannes, we can implement extremely dumb cleanup
> > > routine rather than this weak reference beast.
> > 
> > That was my first version (https://lkml.org/lkml/2013/1/3/298) and
> > Johannes didn't like. To be honest I do not care _much_ which way we go
> > but we definitely cannot pin those objects for ever.
> 
> I'll get to the barrier thread but really complex barrier dancing like
> that is only justifiable in extremely hot paths a lot of people pay
> attention to.  It doesn't belong inside memcg proper.  If the cached
> amount is an actual concern, let's please implement a simple clean up
> thing.  All we need is a single delayed_work which scans the tree
> periodically.
> 
> Johannes, what do you think?

While I see your concerns about complexity (and this certainly is not
the most straight-forward code), I really can't get too excited about
asynchroneous garbage collection, even worse when it's time-based. It
would probably start out with less code but two releases later we
would have added all this stuff that's required to get the interaction
right and fix unpredictable reclaim disruption that hits when the
reaper coincides just right with heavy reclaim once a week etc.  I
just don't think that asynchroneous models are simpler than state
machines.  Harder to reason about, harder to debug.

Now, there are separate things that add complexity to our current
code: the weak pointers, the lockless iterator, and the fact that all
of it is jam-packed into one monolithic iterator function.  I can see
why you are not happy.  But that does not mean we have to get rid of
everything wholesale.

You hate the barriers, so let's add a lock to access the iterator.
That path is not too hot in most cases.

On the other hand, the weak pointer is not too esoteric of a pattern
and we can neatly abstract it into two functions: one that takes an
iterator and returns a verified css reference or NULL, and one to
invalidate pointers when called from the memcg destruction code.

These two things should greatly simplify mem_cgroup_iter() while not
completely abandoning all our optimizations.

What do you think?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05 14:39               ` Johannes Weiner
@ 2013-06-05 14:50                 ` Johannes Weiner
  2013-06-05 14:56                 ` Michal Hocko
  2013-06-05 17:22                 ` Tejun Heo
  2 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2013-06-05 14:50 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan

On Wed, Jun 05, 2013 at 10:39:49AM -0400, Johannes Weiner wrote:
> On Wed, Jun 05, 2013 at 01:20:23AM -0700, Tejun Heo wrote:
> > Hello, Michal.
> > 
> > On Wed, Jun 05, 2013 at 09:30:23AM +0200, Michal Hocko wrote:
> > > > We aren't talking about something gigantic or can
> > > 
> > > mem_cgroup is 888B now (depending on configuration). So I wouldn't call
> > > it negligible.
> > 
> > Do you think that the number can actually grow harmful?  Would you be
> > kind enough to share some calculations with me?
> 
> 5k cgroups * say 10 priority levels * 1k struct mem_cgroup may pin 51M
> of dead struct mem_cgroup, plus whatever else the css pins.

Bleh, ... * nr_node_ids * MAX_NR_ZONES.  So it is a couple hundred MB
in that case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05 14:39               ` Johannes Weiner
  2013-06-05 14:50                 ` Johannes Weiner
@ 2013-06-05 14:56                 ` Michal Hocko
  2013-06-05 17:22                 ` Tejun Heo
  2 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2013-06-05 14:56 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Tejun Heo, bsingharora, cgroups, linux-mm, lizefan

On Wed 05-06-13 10:39:49, Johannes Weiner wrote:
[...]
> You hate the barriers, so let's add a lock to access the iterator.
> That path is not too hot in most cases.

yes, basically only reclaimers use that path which makes it a slow path
by definition.

> On the other hand, the weak pointer is not too esoteric of a pattern
> and we can neatly abstract it into two functions: one that takes an
> iterator and returns a verified css reference or NULL, and one to
> invalidate pointers when called from the memcg destruction code.

would be nice.

> These two things should greatly simplify mem_cgroup_iter() while not
> completely abandoning all our optimizations.
> 
> What do you think?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05 14:39               ` Johannes Weiner
  2013-06-05 14:50                 ` Johannes Weiner
  2013-06-05 14:56                 ` Michal Hocko
@ 2013-06-05 17:22                 ` Tejun Heo
  2013-06-05 19:45                   ` Johannes Weiner
  2 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-05 17:22 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan

Hey, Johannes.

On Wed, Jun 05, 2013 at 10:39:49AM -0400, Johannes Weiner wrote:
> 5k cgroups * say 10 priority levels * 1k struct mem_cgroup may pin 51M
> of dead struct mem_cgroup, plus whatever else the css pins.

Yeah, it seems like it can grow quite a bit.

> > I'll get to the barrier thread but really complex barrier dancing like
> > that is only justifiable in extremely hot paths a lot of people pay
> > attention to.  It doesn't belong inside memcg proper.  If the cached
> > amount is an actual concern, let's please implement a simple clean up
> > thing.  All we need is a single delayed_work which scans the tree
> > periodically.
> > 
> > Johannes, what do you think?
> 
> While I see your concerns about complexity (and this certainly is not
> the most straight-forward code), I really can't get too excited about
> asynchroneous garbage collection, even worse when it's time-based. It
> would probably start out with less code but two releases later we
> would have added all this stuff that's required to get the interaction
> right and fix unpredictable reclaim disruption that hits when the
> reaper coincides just right with heavy reclaim once a week etc.  I
> just don't think that asynchroneous models are simpler than state
> machines.  Harder to reason about, harder to debug.

Agreed, but we can do the cleanup from ->css_offline() as Michal
suggested.  Naively implemented, this will lose the nice property of
keeping the iteration point even when the cursor cgroup is removed,
which can be an issue if we're actually worrying about cases with 5k
cgroups continuously being created and destroyed.  Maybe we can make
it point to the next cgroup to visit rather than the last visited one
and update it from ->css_offline().

> Now, there are separate things that add complexity to our current
> code: the weak pointers, the lockless iterator, and the fact that all
> of it is jam-packed into one monolithic iterator function.  I can see
> why you are not happy.  But that does not mean we have to get rid of
> everything wholesale.
> 
> You hate the barriers, so let's add a lock to access the iterator.
> That path is not too hot in most cases.
> 
> On the other hand, the weak pointer is not too esoteric of a pattern
> and we can neatly abstract it into two functions: one that takes an
> iterator and returns a verified css reference or NULL, and one to
> invalidate pointers when called from the memcg destruction code.
>
> These two things should greatly simplify mem_cgroup_iter() while not
> completely abandoning all our optimizations.
> 
> What do you think?

I really think the weak pointers should go especially as we can
achieve about the same thing with normal RCU dereference.  Also, I'm a
bit confused about what you're suggesting.  If we have invalidation
from offline, why do we need weak pointers?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05 17:22                 ` Tejun Heo
@ 2013-06-05 19:45                   ` Johannes Weiner
  2013-06-05 20:06                     ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Johannes Weiner @ 2013-06-05 19:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan

On Wed, Jun 05, 2013 at 10:22:12AM -0700, Tejun Heo wrote:
> Hey, Johannes.
> 
> On Wed, Jun 05, 2013 at 10:39:49AM -0400, Johannes Weiner wrote:
> > 5k cgroups * say 10 priority levels * 1k struct mem_cgroup may pin 51M
> > of dead struct mem_cgroup, plus whatever else the css pins.
> 
> Yeah, it seems like it can grow quite a bit.
> 
> > > I'll get to the barrier thread but really complex barrier dancing like
> > > that is only justifiable in extremely hot paths a lot of people pay
> > > attention to.  It doesn't belong inside memcg proper.  If the cached
> > > amount is an actual concern, let's please implement a simple clean up
> > > thing.  All we need is a single delayed_work which scans the tree
> > > periodically.
> > > 
> > > Johannes, what do you think?
> > 
> > While I see your concerns about complexity (and this certainly is not
> > the most straight-forward code), I really can't get too excited about
> > asynchroneous garbage collection, even worse when it's time-based. It
> > would probably start out with less code but two releases later we
> > would have added all this stuff that's required to get the interaction
> > right and fix unpredictable reclaim disruption that hits when the
> > reaper coincides just right with heavy reclaim once a week etc.  I
> > just don't think that asynchroneous models are simpler than state
> > machines.  Harder to reason about, harder to debug.
> 
> Agreed, but we can do the cleanup from ->css_offline() as Michal
> suggested.  Naively implemented, this will lose the nice property of
> keeping the iteration point even when the cursor cgroup is removed,
> which can be an issue if we're actually worrying about cases with 5k
> cgroups continuously being created and destroyed.  Maybe we can make
> it point to the next cgroup to visit rather than the last visited one
> and update it from ->css_offline().

I'm not sure what you are suggesting.  Synchroneously invalidate every
individual iterator upwards the hierarchy every time a cgroup is
destroyed?

> > Now, there are separate things that add complexity to our current
> > code: the weak pointers, the lockless iterator, and the fact that all
> > of it is jam-packed into one monolithic iterator function.  I can see
> > why you are not happy.  But that does not mean we have to get rid of
> > everything wholesale.
> > 
> > You hate the barriers, so let's add a lock to access the iterator.
> > That path is not too hot in most cases.
> > 
> > On the other hand, the weak pointer is not too esoteric of a pattern
> > and we can neatly abstract it into two functions: one that takes an
> > iterator and returns a verified css reference or NULL, and one to
> > invalidate pointers when called from the memcg destruction code.
> >
> > These two things should greatly simplify mem_cgroup_iter() while not
> > completely abandoning all our optimizations.
> > 
> > What do you think?
> 
> I really think the weak pointers should go especially as we can
> achieve about the same thing with normal RCU dereference.  Also, I'm a
> bit confused about what you're suggesting.  If we have invalidation
> from offline, why do we need weak pointers?

The invalidation I am talking about is what we do by increasing the
dead counts.  This lazily invalidates all the weak pointers in the
iterators of the hierarchy root.

Of course if you do a synchroneous invalidation of individual
iterators, we don't need weak pointers anymore and RCU is enough, but
that would mean nr_levels * nr_nodes * nr_zones * nr_priority_levels
invalidation operations per destruction, whereas the weak pointers are
invalidated with one atomic_inc() per nesting level.

As I said, the weak pointers are only a few lines of code that can be
neatly self-contained (see the invalidate, load, store functions
below).  Please convince me that your alternative solution will save
complexity to such an extent that either the memory waste of
indefinite css pinning, or the computational overhead of non-lazy
iterator cleanup, is justifiable.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 010d6c1..e872554 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1148,6 +1148,57 @@ skip_node:
 	return NULL;
 }
 
+static void mem_cgroup_iter_invalidate(struct mem_cgroup *root)
+{
+	/*
+	 * When a group in the hierarchy below root is destroyed, the
+	 * hierarchy iterator can no longer be trusted since it might
+	 * have pointed to the destroyed group.  Invalidate it.
+	 */
+	atomic_inc(&root->dead_count);
+}
+
+static struct mem_cgroup *mem_cgroup_iter_load(struct mem_cgroup_reclaim_iter *iter,
+					       struct mem_cgroup *root,
+					       int *sequence)
+{
+	struct mem_cgroup *position = NULL;
+	/*
+	 * A cgroup destruction happens in two stages: offlining and
+	 * release.  They are separated by a RCU grace period.
+	 *
+	 * If the iterator is valid, we may still race with an
+	 * offlining.  The RCU lock ensures the object won't be
+	 * released, tryget will fail if we lost the race.
+	 */
+	*sequence = atomic_read(&root->dead_count);
+	if (iter->last_dead_count == *sequence) {
+		smp_rmb();
+		position = iter->last_visited;
+		if (position && !css_tryget(&position->css))
+			position = NULL;
+	}
+	return position;
+}
+
+static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter,
+				   struct mem_cgroup *last_visited,
+				   struct mem_cgroup *new_position,
+				   int sequence)
+{
+	if (last_visited)
+		css_put(&last_visited->css);
+	/*
+	 * We store the sequence count from the time @last_visited was
+	 * loaded successfully instead of rereading it here so that we
+	 * don't lose destruction events in between.  We could have
+	 * raced with the destruction of @new_position after all.
+	 */
+	iter->last_visited = new_position;
+	smp_wmb();
+	iter->last_dead_count = sequence;
+}
+
 /**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
  * @root: hierarchy root
@@ -1171,7 +1222,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 {
 	struct mem_cgroup *memcg = NULL;
 	struct mem_cgroup *last_visited = NULL;
-	unsigned long uninitialized_var(dead_count);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1191,6 +1241,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 	while (!memcg) {
 		struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
+		int sequence;
 
 		if (reclaim) {
 			int nid = zone_to_nid(reclaim->zone);
@@ -1205,38 +1256,13 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				goto out_unlock;
 			}
 
-			/*
-			 * If the dead_count mismatches, a destruction
-			 * has happened or is happening concurrently.
-			 * If the dead_count matches, a destruction
-			 * might still happen concurrently, but since
-			 * we checked under RCU, that destruction
-			 * won't free the object until we release the
-			 * RCU reader lock.  Thus, the dead_count
-			 * check verifies the pointer is still valid,
-			 * css_tryget() verifies the cgroup pointed to
-			 * is alive.
-			 */
-			dead_count = atomic_read(&root->dead_count);
-			smp_rmb();
-			last_visited = iter->last_visited;
-			if (last_visited) {
-				if ((dead_count != iter->last_dead_count) ||
-					!css_tryget(&last_visited->css)) {
-					last_visited = NULL;
-				}
-			}
+			last_visited = mem_cgroup_iter_load(iter, root, &sequence);
 		}
 
 		memcg = __mem_cgroup_iter_next(root, last_visited);
 
 		if (reclaim) {
-			if (last_visited)
-				css_put(&last_visited->css);
-
-			iter->last_visited = memcg;
-			smp_wmb();
-			iter->last_dead_count = dead_count;
+			mem_cgroup_iter_update(iter, last_visited, memcg, sequence);
 
 			if (!memcg)
 				iter->generation++;
@@ -6321,14 +6347,14 @@ static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg)
 	struct mem_cgroup *parent = memcg;
 
 	while ((parent = parent_mem_cgroup(parent)))
-		atomic_inc(&parent->dead_count);
+		mem_cgroup_iter_invalidate(parent);
 
 	/*
 	 * if the root memcg is not hierarchical we have to check it
 	 * explicitely.
 	 */
 	if (!root_mem_cgroup->use_hierarchy)
-		atomic_inc(&root_mem_cgroup->dead_count);
+		mem_cgroup_iter_invalidate(parent);
 }
 
 static void mem_cgroup_css_offline(struct cgroup *cont)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05 19:45                   ` Johannes Weiner
@ 2013-06-05 20:06                     ` Tejun Heo
  2013-06-05 21:17                       ` Johannes Weiner
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-05 20:06 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan

Hello,

On Wed, Jun 05, 2013 at 03:45:52PM -0400, Johannes Weiner wrote:
> I'm not sure what you are suggesting.  Synchroneously invalidate every
> individual iterator upwards the hierarchy every time a cgroup is
> destroyed?

Yeap.

> The invalidation I am talking about is what we do by increasing the
> dead counts.  This lazily invalidates all the weak pointers in the
> iterators of the hierarchy root.
> 
> Of course if you do a synchroneous invalidation of individual
> iterators, we don't need weak pointers anymore and RCU is enough, but
> that would mean nr_levels * nr_nodes * nr_zones * nr_priority_levels
> invalidation operations per destruction, whereas the weak pointers are
> invalidated with one atomic_inc() per nesting level.

While it does have to traverse the arrays, it's still bound by the
depth of nesting and cgroup destruction is a pretty cold path.  I
don't think it'd matter that much.

> As I said, the weak pointers are only a few lines of code that can be
> neatly self-contained (see the invalidate, load, store functions
> below).  Please convince me that your alternative solution will save
> complexity to such an extent that either the memory waste of
> indefinite css pinning, or the computational overhead of non-lazy
> iterator cleanup, is justifiable.

The biggest issue I see with the weak pointer is that it's special and
tricky.  If this is something which is absolutely necessary, it should
be somewhere more generic.  Also, if we can use the usual RCU deref
with O(depth) cleanup in the cold path, I don't see how this deviation
is justifiable.

For people who've been looking at it for long enough, it probably
isn't that different from using plain RCU but that's just because that
person has spent the time to build that pattern into his/her brain.
We now have a lot of people accustomed to plain RCU usages which in
itself is tricky already and introducing new constructs is actively
deterimental to maintainability.  We sure can do that when there's no
alternative but I don't think avoiding synchronous cleanup on cgroup
destruction path is a good enough reason.  It feels like an
over-engineering to me.

Another thing is that this matters the most when there are continuous
creation and destruction of cgroups and the weak pointer
implementation would keep resetting the iteration to the beginning.
Depending on timing, it'd be able to live-lock reclaim cursor to the
beginning of iteration even with fairly low rate of destruction,
right?  It can be pretty bad high up the tree.  With synchronous
cleanup, depending on how it's implemented, it can be made to keep the
iteration position.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05 20:06                     ` Tejun Heo
@ 2013-06-05 21:17                       ` Johannes Weiner
  2013-06-05 22:20                         ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Johannes Weiner @ 2013-06-05 21:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan

Hi Tejun,

On Wed, Jun 05, 2013 at 01:06:12PM -0700, Tejun Heo wrote:
> On Wed, Jun 05, 2013 at 03:45:52PM -0400, Johannes Weiner wrote:
> > I'm not sure what you are suggesting.  Synchroneously invalidate every
> > individual iterator upwards the hierarchy every time a cgroup is
> > destroyed?
> 
> Yeap.

> > As I said, the weak pointers are only a few lines of code that can be
> > neatly self-contained (see the invalidate, load, store functions
> > below).  Please convince me that your alternative solution will save
> > complexity to such an extent that either the memory waste of
> > indefinite css pinning, or the computational overhead of non-lazy
> > iterator cleanup, is justifiable.
> 
> The biggest issue I see with the weak pointer is that it's special and
> tricky.  If this is something which is absolutely necessary, it should
> be somewhere more generic.  Also, if we can use the usual RCU deref
> with O(depth) cleanup in the cold path, I don't see how this deviation
> is justifiable.
>
> For people who've been looking at it for long enough, it probably
> isn't that different from using plain RCU but that's just because that
> person has spent the time to build that pattern into his/her brain.
> We now have a lot of people accustomed to plain RCU usages which in
> itself is tricky already and introducing new constructs is actively
> deterimental to maintainability.  We sure can do that when there's no
> alternative but I don't think avoiding synchronous cleanup on cgroup
> destruction path is a good enough reason.  It feels like an
> over-engineering to me.
>
> Another thing is that this matters the most when there are continuous
> creation and destruction of cgroups and the weak pointer
> implementation would keep resetting the iteration to the beginning.
> Depending on timing, it'd be able to live-lock reclaim cursor to the
> beginning of iteration even with fairly low rate of destruction,
> right?  It can be pretty bad high up the tree.  With synchronous
> cleanup, depending on how it's implemented, it can be made to keep the
> iteration position.

That could be an advantage, yes.  But keep in mind that every
destruction has to perform this invalidation operation against the
global root_mem_cgroup's nr_node * nr_zone * nr_priority_levels
iterators, so you can't muck around forever, while possibly holding a
lock at this level.  It's not a hot path, but you don't want to turn
it into one, either.

The upshot for me is this: whether you do long-term pinning or greedy
iterator invalidation, the cost of cgroup destruction increases.
Either in terms of memory usage or in terms of compute time.  I would
have loved to see something as simple as the long-term pinning work
out in practice, because it truly would have been simpler.  But at
this point, I don't really care much because the projected margins of
reduction in complexity and increase of cost from your proposal are
too small for me to feel strongly about one solution or the other, or
go ahead and write the code.  I'll look at your patches, though ;-)

Either way, I'll prepare the patch set that includes the barrier fix
and a small cleanup to make the weak pointer management more
palatable.  I'm still open to code proposals, so don't let it distract
you, but we might as well make it a bit more readable in the meantime.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05 21:17                       ` Johannes Weiner
@ 2013-06-05 22:20                         ` Tejun Heo
  2013-06-05 22:27                           ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-05 22:20 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan

Yo,

On Wed, Jun 05, 2013 at 05:17:04PM -0400, Johannes Weiner wrote:
> That could be an advantage, yes.  But keep in mind that every
> destruction has to perform this invalidation operation against the
> global root_mem_cgroup's nr_node * nr_zone * nr_priority_levels
> iterators, so you can't muck around forever, while possibly holding a
> lock at this level.  It's not a hot path, but you don't want to turn
> it into one, either.

nr_node tends to be pretty low in most cases, so it shouldn't be a
problem there but yeah with high enough nodes and high enough rate of
cgroup destruction, I guess it could be an issue in extreme cases.

> The upshot for me is this: whether you do long-term pinning or greedy
> iterator invalidation, the cost of cgroup destruction increases.
> Either in terms of memory usage or in terms of compute time.  I would
> have loved to see something as simple as the long-term pinning work
> out in practice, because it truly would have been simpler.  But at
> this point, I don't really care much because the projected margins of
> reduction in complexity and increase of cost from your proposal are
> too small for me to feel strongly about one solution or the other, or
> go ahead and write the code.  I'll look at your patches, though ;-)

I don't know.  I've developed this deep-seated distrust of any code
which makes creative use of barriers and object lifetimes.  We get
them wrong too often, it makes other devs a lot more reluctant to
review and dive into the code, and it's hellish to track down when
something actually goes wrong.  I'd happily pay a bit of computation
or memory overhead for more conventional construct.  In extremely hot
paths, sure, we just bite and do it but I don't think this reaches
that level.

> Either way, I'll prepare the patch set that includes the barrier fix
> and a small cleanup to make the weak pointer management more
> palatable.  I'm still open to code proposals, so don't let it distract
> you, but we might as well make it a bit more readable in the meantime.

Sure thing.  We need to get it fixed for -stable anyway.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05 22:20                         ` Tejun Heo
@ 2013-06-05 22:27                           ` Tejun Heo
  2013-06-06 11:50                             ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-05 22:27 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan

On Wed, Jun 05, 2013 at 03:20:21PM -0700, Tejun Heo wrote:
> Yo,
> 
> On Wed, Jun 05, 2013 at 05:17:04PM -0400, Johannes Weiner wrote:
> > That could be an advantage, yes.  But keep in mind that every
> > destruction has to perform this invalidation operation against the
> > global root_mem_cgroup's nr_node * nr_zone * nr_priority_levels
> > iterators, so you can't muck around forever, while possibly holding a
> > lock at this level.  It's not a hot path, but you don't want to turn
> > it into one, either.
> 
> nr_node tends to be pretty low in most cases, so it shouldn't be a
> problem there but yeah with high enough nodes and high enough rate of

Also, do we need to hold a lock?  It doesn't have to be completely
strict, so we might as well get away with something like,

	for_each_cached_pos() {
		if (hint == me) {
			/* simple clearing implementation, we prolly wanna push it forward */
			cached = xchg(hint, NULL);
			if (cached)
				css_put(cached);
		}
	}

It still scans the memory but wouldn't create any contention.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-05 22:27                           ` Tejun Heo
@ 2013-06-06 11:50                             ` Michal Hocko
  2013-06-07  0:52                               ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-06 11:50 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

On Wed 05-06-13 15:27:09, Tejun Heo wrote:
> On Wed, Jun 05, 2013 at 03:20:21PM -0700, Tejun Heo wrote:
> > Yo,
> > 
> > On Wed, Jun 05, 2013 at 05:17:04PM -0400, Johannes Weiner wrote:
> > > That could be an advantage, yes.  But keep in mind that every
> > > destruction has to perform this invalidation operation against the
> > > global root_mem_cgroup's nr_node * nr_zone * nr_priority_levels
> > > iterators, so you can't muck around forever, while possibly holding a
> > > lock at this level.  It's not a hot path, but you don't want to turn
> > > it into one, either.
> > 
> > nr_node tends to be pretty low in most cases, so it shouldn't be a
> > problem there but yeah with high enough nodes and high enough rate of
> 
> Also, do we need to hold a lock?  It doesn't have to be completely
> strict, so we might as well get away with something like,
> 
> 	for_each_cached_pos() {
> 		if (hint == me) {
> 			/* simple clearing implementation, we prolly wanna push it forward */
> 			cached = xchg(hint, NULL);
> 			if (cached)
> 				css_put(cached);
> 		}
> 	}

This would be racy:
mem_cgroup_iter
  rcu_read_lock
  __mem_cgroup_iter_next		cgroup_destroy_locked
    css_tryget(memcg)
  					  atomic_add(CSS_DEACT_BIAS)
    					  offline_css(memcg)
					    xchg(memcg, NULL)
  mem_cgroup_iter_update
    iter->last_visited = memcg
  rcy_read_unlock

But if it was called from call_rcu the we should be safe AFAICS.

> It still scans the memory but wouldn't create any contention.
> 
> Thanks.
> 
> -- 
> tejun

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-06 11:50                             ` Michal Hocko
@ 2013-06-07  0:52                               ` Tejun Heo
  2013-06-07  7:37                                 ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-07  0:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

Hello,

On Thu, Jun 06, 2013 at 01:50:31PM +0200, Michal Hocko wrote:
> > Also, do we need to hold a lock?  It doesn't have to be completely
> > strict, so we might as well get away with something like,
> > 
> > 	for_each_cached_pos() {
> > 		if (hint == me) {
> > 			/* simple clearing implementation, we prolly wanna push it forward */
> > 			cached = xchg(hint, NULL);
> > 			if (cached)
> > 				css_put(cached);
> > 		}
> > 	}
> 
> This would be racy:
> mem_cgroup_iter
>   rcu_read_lock
>   __mem_cgroup_iter_next		cgroup_destroy_locked
>     css_tryget(memcg)
>   					  atomic_add(CSS_DEACT_BIAS)
>     					  offline_css(memcg)
> 					    xchg(memcg, NULL)
>   mem_cgroup_iter_update
>     iter->last_visited = memcg
>   rcy_read_unlock
> 
> But if it was called from call_rcu the we should be safe AFAICS.

Oh yeah, it is racy.  That's what I meant by "not having to be
completely strict".  The race window is small enough and it's not like
we're messing up refcnt or may end up with use-after-free.  Doing it
from RCU would make the race go away but I'm not sure whether the
extra RCU bouncing is worthwhile.  I don't know.  Maybe.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-07  0:52                               ` Tejun Heo
@ 2013-06-07  7:37                                 ` Michal Hocko
  2013-06-07 23:25                                   ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-07  7:37 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

On Thu 06-06-13 17:52:42, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jun 06, 2013 at 01:50:31PM +0200, Michal Hocko wrote:
> > > Also, do we need to hold a lock?  It doesn't have to be completely
> > > strict, so we might as well get away with something like,
> > > 
> > > 	for_each_cached_pos() {
> > > 		if (hint == me) {
> > > 			/* simple clearing implementation, we prolly wanna push it forward */
> > > 			cached = xchg(hint, NULL);
> > > 			if (cached)
> > > 				css_put(cached);
> > > 		}
> > > 	}
> > 
> > This would be racy:
> > mem_cgroup_iter
> >   rcu_read_lock
> >   __mem_cgroup_iter_next		cgroup_destroy_locked
> >     css_tryget(memcg)
> >   					  atomic_add(CSS_DEACT_BIAS)
> >     					  offline_css(memcg)
> > 					    xchg(memcg, NULL)
> >   mem_cgroup_iter_update
> >     iter->last_visited = memcg
> >   rcy_read_unlock
> > 
> > But if it was called from call_rcu the we should be safe AFAICS.
> 
> Oh yeah, it is racy.  That's what I meant by "not having to be
> completely strict".  The race window is small enough and it's not like
> we're messing up refcnt or may end up with use-after-free. 

But it would potentially pin (aka leak) the memcg for ever.

> Doing it from RCU would make the race go away but I'm not sure whether
> the extra RCU bouncing is worthwhile.  I don't know.  Maybe.
> 
> Thanks.
> 
> -- 
> tejun

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-07  7:37                                 ` Michal Hocko
@ 2013-06-07 23:25                                   ` Tejun Heo
  2013-06-10  8:02                                     ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-07 23:25 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

Hello, Michal.

On Fri, Jun 07, 2013 at 09:37:54AM +0200, Michal Hocko wrote:
> > Oh yeah, it is racy.  That's what I meant by "not having to be
> > completely strict".  The race window is small enough and it's not like
> > we're messing up refcnt or may end up with use-after-free. 
> 
> But it would potentially pin (aka leak) the memcg for ever.

It wouldn't be anything systemetic tho - race condition's likliness is
low and increases with the frequency of reclaim iteration, which at
the same time means that it's likely to remedy itself pretty soon.
I'm doubtful it'd matter.  If it's still bothering, we sure can do it
from RCU callback.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-07 23:25                                   ` Tejun Heo
@ 2013-06-10  8:02                                     ` Michal Hocko
  2013-06-10 19:54                                       ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-10  8:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

On Fri 07-06-13 16:25:57, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Jun 07, 2013 at 09:37:54AM +0200, Michal Hocko wrote:
> > > Oh yeah, it is racy.  That's what I meant by "not having to be
> > > completely strict".  The race window is small enough and it's not like
> > > we're messing up refcnt or may end up with use-after-free. 
> > 
> > But it would potentially pin (aka leak) the memcg for ever.
> 
> It wouldn't be anything systemetic tho - race condition's likliness is
> low and increases with the frequency of reclaim iteration, which at
> the same time means that it's likely to remedy itself pretty soon.

Sure a next visit on the same root subtree (same node, zone and prio)
would css_put it but what if that root goes away itself. Still fixable,
if every group checks its own cached iters and css_put everybody but
that is even uglier. So doing the up-the-hierarchy cleanup in RCU
callback is much easier.

> I'm doubtful it'd matter.  If it's still bothering, we sure can do it
> from RCU callback.

Yes, I would definitely prefer correctness over likeliness here.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-10  8:02                                     ` Michal Hocko
@ 2013-06-10 19:54                                       ` Tejun Heo
  2013-06-10 20:48                                         ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-10 19:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

Hello, Michal.

On Mon, Jun 10, 2013 at 10:02:08AM +0200, Michal Hocko wrote:
> Sure a next visit on the same root subtree (same node, zone and prio)
> would css_put it but what if that root goes away itself. Still fixable,
> if every group checks its own cached iters and css_put everybody but
> that is even uglier. So doing the up-the-hierarchy cleanup in RCU
> callback is much easier.

Ooh, right, we don't need cleanup of the cached cursors on destruction
if we get this correct - especially if we make cursors point to the
next cgroup to visit as self is always the first one to visit.  Yeah,
if we can do away with that, doing that way is definitely better.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-10 19:54                                       ` Tejun Heo
@ 2013-06-10 20:48                                         ` Michal Hocko
  2013-06-10 23:13                                           ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-10 20:48 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

On Mon 10-06-13 12:54:26, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Jun 10, 2013 at 10:02:08AM +0200, Michal Hocko wrote:
> > Sure a next visit on the same root subtree (same node, zone and prio)
> > would css_put it but what if that root goes away itself. Still fixable,
> > if every group checks its own cached iters and css_put everybody but
> > that is even uglier. So doing the up-the-hierarchy cleanup in RCU
> > callback is much easier.
> 
> Ooh, right, we don't need cleanup of the cached cursors on destruction
> if we get this correct - especially if we make cursors point to the
> next cgroup to visit as self is always the first one to visit. 

You would need to pin the next-to-visit memcg as well, so you need a
cleanup on the removal.

> Yeah, if we can do away with that, doing that way is definitely
> better.

The only advantage I can see from next-to-visit caching is that the
destruction path can reuse __mem_cgroup_iter_next unlike last_visited
which would need to develop a code to get the previous member. Maybe it
is worth a try.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-10 20:48                                         ` Michal Hocko
@ 2013-06-10 23:13                                           ` Tejun Heo
  2013-06-11  7:27                                             ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-10 23:13 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

Hey,

On Mon, Jun 10, 2013 at 10:48:01PM +0200, Michal Hocko wrote:
> > Ooh, right, we don't need cleanup of the cached cursors on destruction
> > if we get this correct - especially if we make cursors point to the
> > next cgroup to visit as self is always the first one to visit. 
> 
> You would need to pin the next-to-visit memcg as well, so you need a
> cleanup on the removal.

But that'd be one of the descendants of the said cgroup and there can
no descendant left when the cgroup is being removed.  What am I
missing?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-10 23:13                                           ` Tejun Heo
@ 2013-06-11  7:27                                             ` Michal Hocko
  2013-06-11  7:44                                               ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-11  7:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

On Mon 10-06-13 16:13:58, Tejun Heo wrote:
> Hey,
> 
> On Mon, Jun 10, 2013 at 10:48:01PM +0200, Michal Hocko wrote:
> > > Ooh, right, we don't need cleanup of the cached cursors on destruction
> > > if we get this correct - especially if we make cursors point to the
> > > next cgroup to visit as self is always the first one to visit. 
> > 
> > You would need to pin the next-to-visit memcg as well, so you need a
> > cleanup on the removal.
> 
> But that'd be one of the descendants of the said cgroup and there can
> no descendant left when the cgroup is being removed.  What am I
> missing?
            .
            .
            .
            A (cached=E)
	   /|\____________
          / |             \
	 B  D (cached=E)   F<
	/   |               \
       C<   E                G
            ^
	 removed

* D level cache - nobody left for either approach approach
* A level is 
	- F for next-to-visit
	- C for last_visited

You have to get up the hierarchy and handle root cgroup as a special
case for !root->use_hierarchy. Once you have non-NULL new cache the it
can be propagated without a new search (which I haven't realized when
working on this approach the last time - not that it would safe some
code in the end).

Makes sense?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-11  7:27                                             ` Michal Hocko
@ 2013-06-11  7:44                                               ` Tejun Heo
  2013-06-11  7:55                                                 ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2013-06-11  7:44 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

On Tue, Jun 11, 2013 at 09:27:43AM +0200, Michal Hocko wrote:
>           .
>           .
>           .
>           A (cached=E)
>          /|\____________
>         / |             \
> 	 B  D (cached=E)   F<
> 	/   |               \
>      C<   E                G
>           ^
> 	 removed
> 
> * D level cache - nobody left for either approach approach
> * A level is 
> 	- F for next-to-visit
> 	- C for last_visited
> 
> You have to get up the hierarchy and handle root cgroup as a special
> case for !root->use_hierarchy. Once you have non-NULL new cache the it
> can be propagated without a new search (which I haven't realized when
> working on this approach the last time - not that it would safe some
> code in the end).
> 
> Makes sense?

I don't think we're talking about the same thing.  I wasn't talking
about skipping walking up the hierarchy (differently depending on
use_hierarchy of course) when E is removed.  I was talking about
skipped cleaning E's cache when removing E as it's guaranteed to be
empty by then.  The difference between caching the last and next one
is that if we put the last one in the cache, E's cache could be
pointing to itself and needs to be scanned.

Not a big difference either way but if you combine that with the need
for special rewinding which will basically come down to traversing the
sibling list again, pointing to the next entry is just easier.

Anyways, I think we're getting too deep into details but one more
thing, what do you mean by "non-NULL new cache"?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-11  7:44                                               ` Tejun Heo
@ 2013-06-11  7:55                                                 ` Michal Hocko
  2013-06-11  8:00                                                   ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2013-06-11  7:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

On Tue 11-06-13 00:44:04, Tejun Heo wrote:
> On Tue, Jun 11, 2013 at 09:27:43AM +0200, Michal Hocko wrote:
> >           .
> >           .
> >           .
> >           A (cached=E)
> >          /|\____________
> >         / |             \
> > 	 B  D (cached=E)   F<
> > 	/   |               \
> >      C<   E                G
> >           ^
> > 	 removed
> > 
> > * D level cache - nobody left for either approach approach
> > * A level is 
> > 	- F for next-to-visit
> > 	- C for last_visited
> > 
> > You have to get up the hierarchy and handle root cgroup as a special
> > case for !root->use_hierarchy. Once you have non-NULL new cache the it
> > can be propagated without a new search (which I haven't realized when
> > working on this approach the last time - not that it would safe some
> > code in the end).
> > 
> > Makes sense?
> 
> I don't think we're talking about the same thing.  I wasn't talking
> about skipping walking up the hierarchy (differently depending on
> use_hierarchy of course) when E is removed.  I was talking about
> skipped cleaning E's cache when removing E as it's guaranteed to be
> empty by then.

Ahh, sorry I have misread your reply then. You are right that caching
the next-to-visit would keep its own cache clean.

> The difference between caching the last and next one is that if we put
> the last one in the cache, E's cache could be pointing to itself and
> needs to be scanned.

Right.

> Not a big difference either way but if you combine that with the need
> for special rewinding which will basically come down to traversing the
> sibling list again, pointing to the next entry is just easier.
> 
> Anyways, I think we're getting too deep into details but one more
> thing, what do you mean by "non-NULL new cache"?

If you replace cached memcg by a new (non-NULL) one then all the parents
up the hierarchy can reuse the same replacement and do not have to
search again.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-11  7:55                                                 ` Michal Hocko
@ 2013-06-11  8:00                                                   ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2013-06-11  8:00 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan

Hello,

On Tue, Jun 11, 2013 at 09:55:40AM +0200, Michal Hocko wrote:
> > Anyways, I think we're getting too deep into details but one more
> > thing, what do you mean by "non-NULL new cache"?
> 
> If you replace cached memcg by a new (non-NULL) one then all the parents
> up the hierarchy can reuse the same replacement and do not have to
> search again.

As finding the next one to visit is pretty cheap, it isn't likely to
be a big difference but yeah we can definitely re-use the first
non-NULL next for all further ancestors.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-04 20:50     ` Tejun Heo
  2013-06-04 21:28       ` Michal Hocko
@ 2013-06-04 21:40       ` Johannes Weiner
  2013-06-04 21:49         ` Tejun Heo
  1 sibling, 1 reply; 42+ messages in thread
From: Johannes Weiner @ 2013-06-04 21:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan

On Tue, Jun 04, 2013 at 01:50:25PM -0700, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Jun 04, 2013 at 03:18:43PM +0200, Michal Hocko wrote:
> > > +	if (memcg)
> > > +		css_get(&memcg->css);
> > 
> > This is all good and nice but it re-introduces the same problem which
> > has been fixed by (5f578161: memcg: relax memcg iter caching). You are
> > pinning memcg in memory for unbounded amount of time because css
> > reference will not let object to leave and rest.
> 
> I don't get why that is a problem.  Can you please elaborate?  css's
> now explicitly allow holding onto them.  We now have clear separation
> of "destruction" and "release" and blkcg also depends on it.  If memcg
> still doesn't distinguish the two properly, that's where the problem
> should be fixed.
> 
> > I understand your frustration about the complexity of the current
> > synchronization but we didn't come up with anything easier.
> > Originally I though that your tree walk updates which allow dropping rcu
> > would help here but then I realized that not really because the iterator
> > (resp. pos) has to be a valid pointer and there is only one possibility
> > to do that AFAICS here and that is css pinning. And is no-go.
> 
> I find the above really weird.  If css can't be pinned for position
> caching, isn't it natural to ask why it can't be and then fix it?
> Because that's what the whole refcnt thing is about and a usage which
> cgroup explicitly allows (e.g. blkcg also does it).  Why do you go
> from there to "this batshit crazy barrier dancing is the only
> solution"?
> 
> Can you please explain why memcg css's can't be pinned?

We might pin them indefinitely.  In a hierarchy with hundreds of
groups that is short by 10M of memory, we only reclaim from a couple
of groups before we stop and leave the iterator pointing somewhere in
the hierarchy.  Until the next reclaimer comes along, which might be a
split second later or three days later.

There is a reclaim iterator for every memcg (since every memcg
represents a hierarchy), so we could pin a lot of csss for an
indefinite amount of time.

If you say that the delta between destruction and release is small
enough, I'd be happy to get rid of the weak referencing.

We had weak referencing with css_id before and didn't want to lose
predictability and efficiency of our resource usage when switching
away from it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter
  2013-06-04 21:40       ` Johannes Weiner
@ 2013-06-04 21:49         ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2013-06-04 21:49 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan

Hello, Johannes.

On Tue, Jun 04, 2013 at 05:40:50PM -0400, Johannes Weiner wrote:
> We might pin them indefinitely.  In a hierarchy with hundreds of
> groups that is short by 10M of memory, we only reclaim from a couple
> of groups before we stop and leave the iterator pointing somewhere in
> the hierarchy.  Until the next reclaimer comes along, which might be a
> split second later or three days later.
> 
> There is a reclaim iterator for every memcg (since every memcg
> represents a hierarchy), so we could pin a lot of csss for an
> indefinite amount of time.

As long as it's bound by the actual number of memcgs in the system and
dead cgroups don't pin any other resources, I don't think pinning css
and thus memcg struct itself is something we need to worry about.
Especially not at the cost of this weak referencing thing.  If the
large number of unused but pinned css's actually is a problem (which I
seriously doubt), we can implement a trivial timer based cache
expiration which can be extremely coarse - ie. each iterator just
keeps the last time stamp it was used and cleanup runs every ten mins
or whatever.  It'll be like twenty lines of completely obvious code.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2013-06-11  8:00 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-04  0:44 [PATCHSET] memcg: fix and reimplement iterator Tejun Heo
2013-06-04  0:44 ` [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() Tejun Heo
2013-06-04 13:03   ` Michal Hocko
2013-06-04 13:58     ` Johannes Weiner
2013-06-04 15:29       ` Michal Hocko
2013-06-04  0:44 ` [PATCH 2/3] memcg: restructure mem_cgroup_iter() Tejun Heo
2013-06-04 13:21   ` Michal Hocko
2013-06-04 20:51     ` Tejun Heo
2013-06-04  0:44 ` [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter Tejun Heo
2013-06-04 13:18   ` Michal Hocko
2013-06-04 20:50     ` Tejun Heo
2013-06-04 21:28       ` Michal Hocko
2013-06-04 21:55         ` Tejun Heo
2013-06-05  7:30           ` Michal Hocko
2013-06-05  8:20             ` Tejun Heo
2013-06-05  8:36               ` Michal Hocko
2013-06-05  8:44                 ` Tejun Heo
2013-06-05  8:55                   ` Michal Hocko
2013-06-05  9:03                     ` Tejun Heo
2013-06-05 14:39               ` Johannes Weiner
2013-06-05 14:50                 ` Johannes Weiner
2013-06-05 14:56                 ` Michal Hocko
2013-06-05 17:22                 ` Tejun Heo
2013-06-05 19:45                   ` Johannes Weiner
2013-06-05 20:06                     ` Tejun Heo
2013-06-05 21:17                       ` Johannes Weiner
2013-06-05 22:20                         ` Tejun Heo
2013-06-05 22:27                           ` Tejun Heo
2013-06-06 11:50                             ` Michal Hocko
2013-06-07  0:52                               ` Tejun Heo
2013-06-07  7:37                                 ` Michal Hocko
2013-06-07 23:25                                   ` Tejun Heo
2013-06-10  8:02                                     ` Michal Hocko
2013-06-10 19:54                                       ` Tejun Heo
2013-06-10 20:48                                         ` Michal Hocko
2013-06-10 23:13                                           ` Tejun Heo
2013-06-11  7:27                                             ` Michal Hocko
2013-06-11  7:44                                               ` Tejun Heo
2013-06-11  7:55                                                 ` Michal Hocko
2013-06-11  8:00                                                   ` Tejun Heo
2013-06-04 21:40       ` Johannes Weiner
2013-06-04 21:49         ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).