* [PATCHSET] memcg: fix and reimplement iterator @ 2013-06-04 0:44 Tejun Heo 2013-06-04 0:44 ` [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() Tejun Heo ` (2 more replies) 0 siblings, 3 replies; 42+ messages in thread From: Tejun Heo @ 2013-06-04 0:44 UTC (permalink / raw) To: mhocko, hannes, bsingharora; +Cc: cgroups, linux-mm, lizefan mem_cgroup_iter() wraps around cgroup_next_descendant_pre() to provide pre-order walk of memcg hierarchy. In addition to normal walk, it also implements shared iteration keyed by zone, node and priority so that multiple reclaimers don't end up hitting on the same nodes. Reclaimers working on the same zone, node and priority will push the same iterator forward. Unfortunately, the way this is implemented is disturbingly complicated. It ends up implementing pretty unique synchronization construct inside memcg which is never a good sign for any subsystem. While the implemented sychronization is overly elaborate and fragile, the intention behind it is understandable as previously cgroup iterator required each iteration to be contained inside a single RCU read critical section disallowing implementation of saner mechanism. To work around the limitation, memcg developed this Rube Goldberg machine to detect whether the cached last cgroup is still alive, which of course was ever so subtly broken. Now that cgroup iterations can survive being dropped out of RCU read critical section, this can be made a lot simpler. This patchset contains the following three patches. 0001-memcg-fix-subtle-memory-barrier-bug-in-mem_cgroup_it.patch 0002-memcg-restructure-mem_cgroup_iter.patch 0003-memcg-simplify-mem_cgroup_reclaim_iter.patch 0001 is fix for a subtle memory barrier bug in the current implementation. Should be applied to for-3.10-fixes and backported through -stable. In general, memory barriers are bad ideas. Please don't do it unless utterly necessary, and, if you're doing it, please add ample documentation explaining how they're paired and what they're achieving. Documenting is often extremely helpful for the implementor oneself too because one ends up looking at and thinking about things a lot more carefully. 0002 restructure mem_cgroup_iter() so that it's easier to follow and change. 0003 reimplements the iterator sharing so that it's simpler and more conventional. It depends on the new cgroup iterator updates. This patchset is on top of cgroup/for-3.11[1] which contains the iterator updates this patchset depends upon and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-memcg-simpler-iter Lightly tested. Proceed with caution. mm/memcontrol.c | 134 ++++++++++++++++++++++---------------------------------- 1 file changed, 54 insertions(+), 80 deletions(-) Thanks. -- tejun [1] git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.11 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() 2013-06-04 0:44 [PATCHSET] memcg: fix and reimplement iterator Tejun Heo @ 2013-06-04 0:44 ` Tejun Heo 2013-06-04 13:03 ` Michal Hocko 2013-06-04 0:44 ` [PATCH 2/3] memcg: restructure mem_cgroup_iter() Tejun Heo 2013-06-04 0:44 ` [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter Tejun Heo 2 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-04 0:44 UTC (permalink / raw) To: mhocko, hannes, bsingharora; +Cc: cgroups, linux-mm, lizefan, Tejun Heo mem_cgroup_iter() plays a rather delicate game to allow sharing reclaim iteration across multiple reclaimers. It uses reclaim_iter->last_visited and ->dead_count to remember the last visited cgroup and verify whether the cgroup is still safe to access. For the mechanism to work properly, updates to ->last_visited must be visible before ->dead_count; otherwise, a stale ->last_visited may be considered to be associated with more recent ->dead_count and thus escape the dead condition detection which may lead to use-after-free. The function has smp_rmb() where the dead condition is checked and smp_wmb() where the two variables are updated but the smp_rmb() isn't between dereferences of the two variables making the whole thing pointless. It's right after atomic_read(&root->dead_count) whose only requirement is to belong to the same RCU critical section. This patch moves the smp_rmb() between ->last_visited and ->last_dead_count dereferences and adds comment explaining how the barriers are paired and for what. Let's please not add memory barriers without explicitly explaining the pairing and what they are achieving. Signed-off-by: Tejun Heo <tj@kernel.org> --- mm/memcontrol.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cb1c9de..cb2f91c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1218,9 +1218,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, * is alive. */ dead_count = atomic_read(&root->dead_count); - smp_rmb(); + last_visited = iter->last_visited; if (last_visited) { + /* + * Paired with smp_wmb() below in this + * function. The pair guarantee that + * last_visited is more current than + * last_dead_count, which may lead to + * spurious iteration resets but guarantees + * reliable detection of dead condition. + */ + smp_rmb(); if ((dead_count != iter->last_dead_count) || !css_tryget(&last_visited->css)) { last_visited = NULL; @@ -1235,6 +1244,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, css_put(&last_visited->css); iter->last_visited = memcg; + /* paired with smp_rmb() above in this function */ smp_wmb(); iter->last_dead_count = dead_count; -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() 2013-06-04 0:44 ` [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() Tejun Heo @ 2013-06-04 13:03 ` Michal Hocko 2013-06-04 13:58 ` Johannes Weiner 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-04 13:03 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan On Mon 03-06-13 17:44:37, Tejun Heo wrote: [...] > @@ -1218,9 +1218,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > * is alive. > */ > dead_count = atomic_read(&root->dead_count); > - smp_rmb(); > + > last_visited = iter->last_visited; > if (last_visited) { > + /* > + * Paired with smp_wmb() below in this > + * function. The pair guarantee that > + * last_visited is more current than > + * last_dead_count, which may lead to > + * spurious iteration resets but guarantees > + * reliable detection of dead condition. > + */ > + smp_rmb(); > if ((dead_count != iter->last_dead_count) || > !css_tryget(&last_visited->css)) { > last_visited = NULL; I originally had the barrier this way but Johannes pointed out it is not correct https://lkml.org/lkml/2013/2/11/411 " !> + /* !> + * last_visited might be invalid if some of the group !> + * downwards was removed. As we do not know which one !> + * disappeared we have to start all over again from the !> + * root. !> + * css ref count then makes sure that css won't !> + * disappear while we iterate to the next memcg !> + */ !> + last_visited = iter->last_visited; !> + dead_count = atomic_read(&root->dead_count); !> + smp_rmb(); ! !Confused about this barrier, see below. ! !As per above, if you remove the iter lock, those lines are mixed up. !You need to read the dead count first because the writer updates the !dead count after it sets the new position. That way, if the dead !count gives the go-ahead, you KNOW that the position cache is valid, !because it has been updated first. If either the two reads or the two !writes get reordered, you risk seeing a matching dead count while the !position cache is stale. " I think that explanation makes sense but I will leave further explanation to Mr "I do not like mutual exclusion" :P (https://lkml.org/lkml/2013/2/11/501 "My bumper sticker reads "I don't believe in mutual exclusion" (the kernel hacker's version of smile for the red light camera)") > @@ -1235,6 +1244,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > css_put(&last_visited->css); > > iter->last_visited = memcg; > + /* paired with smp_rmb() above in this function */ > smp_wmb(); > iter->last_dead_count = dead_count; > > -- > 1.8.2.1 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() 2013-06-04 13:03 ` Michal Hocko @ 2013-06-04 13:58 ` Johannes Weiner 2013-06-04 15:29 ` Michal Hocko 0 siblings, 1 reply; 42+ messages in thread From: Johannes Weiner @ 2013-06-04 13:58 UTC (permalink / raw) To: Michal Hocko; +Cc: Tejun Heo, bsingharora, cgroups, linux-mm, lizefan On Tue, Jun 04, 2013 at 03:03:36PM +0200, Michal Hocko wrote: > On Mon 03-06-13 17:44:37, Tejun Heo wrote: > [...] > > @@ -1218,9 +1218,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > * is alive. > > */ > > dead_count = atomic_read(&root->dead_count); > > - smp_rmb(); > > + > > last_visited = iter->last_visited; > > if (last_visited) { > > + /* > > + * Paired with smp_wmb() below in this > > + * function. The pair guarantee that > > + * last_visited is more current than > > + * last_dead_count, which may lead to > > + * spurious iteration resets but guarantees > > + * reliable detection of dead condition. > > + */ > > + smp_rmb(); > > if ((dead_count != iter->last_dead_count) || > > !css_tryget(&last_visited->css)) { > > last_visited = NULL; > > I originally had the barrier this way but Johannes pointed out it is not > correct https://lkml.org/lkml/2013/2/11/411 > " > !> + /* > !> + * last_visited might be invalid if some of the group > !> + * downwards was removed. As we do not know which one > !> + * disappeared we have to start all over again from the > !> + * root. > !> + * css ref count then makes sure that css won't > !> + * disappear while we iterate to the next memcg > !> + */ > !> + last_visited = iter->last_visited; > !> + dead_count = atomic_read(&root->dead_count); > !> + smp_rmb(); > ! > !Confused about this barrier, see below. > ! > !As per above, if you remove the iter lock, those lines are mixed up. > !You need to read the dead count first because the writer updates the > !dead count after it sets the new position. That way, if the dead > !count gives the go-ahead, you KNOW that the position cache is valid, > !because it has been updated first. If either the two reads or the two > !writes get reordered, you risk seeing a matching dead count while the > !position cache is stale. > " The original prototype code I sent looked like this: mem_cgroup_iter: rcu_read_lock() if atomic_read(&root->dead_count) == iter->dead_count: smp_rmb() if tryget(iter->position): position = iter->position memcg = find_next(postion) css_put(position) iter->position = memcg smp_wmb() /* Write position cache BEFORE marking it uptodate */ iter->dead_count = atomic_read(&root->dead_count) rcu_read_unlock() iter->last_position is written, THEN iter->last_dead_count is written. So, yes, you "need to read the dead count" first to be sure iter->last_position is uptodate. But iter->last_dead_count, not root->dead_count. I should have caught this in the final submission of your patch :( Tejun's patch is not correct, either. Something like this? diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 010d6c1..92830fa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1199,7 +1199,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; - last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { iter->last_visited = NULL; goto out_unlock; @@ -1217,14 +1216,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, * css_tryget() verifies the cgroup pointed to * is alive. */ + last_visited = NULL; dead_count = atomic_read(&root->dead_count); - smp_rmb(); - last_visited = iter->last_visited; - if (last_visited) { - if ((dead_count != iter->last_dead_count) || - !css_tryget(&last_visited->css)) { + if (dead_count == iter->last_dead_count) { + /* + * The writer below sets the position + * pointer, then the dead count. + * Ensure we read the updated position + * when the dead count matches. + */ + smp_rmb(); + last_visited = iter->last_visited; + if (last_visited && + !css_tryget(&last_visited->css)) last_visited = NULL; - } } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() 2013-06-04 13:58 ` Johannes Weiner @ 2013-06-04 15:29 ` Michal Hocko 0 siblings, 0 replies; 42+ messages in thread From: Michal Hocko @ 2013-06-04 15:29 UTC (permalink / raw) To: Johannes Weiner; +Cc: Tejun Heo, bsingharora, cgroups, linux-mm, lizefan On Tue 04-06-13 09:58:40, Johannes Weiner wrote: > On Tue, Jun 04, 2013 at 03:03:36PM +0200, Michal Hocko wrote: > > On Mon 03-06-13 17:44:37, Tejun Heo wrote: > > [...] > > > @@ -1218,9 +1218,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > > * is alive. > > > */ > > > dead_count = atomic_read(&root->dead_count); > > > - smp_rmb(); > > > + > > > last_visited = iter->last_visited; > > > if (last_visited) { > > > + /* > > > + * Paired with smp_wmb() below in this > > > + * function. The pair guarantee that > > > + * last_visited is more current than > > > + * last_dead_count, which may lead to > > > + * spurious iteration resets but guarantees > > > + * reliable detection of dead condition. > > > + */ > > > + smp_rmb(); > > > if ((dead_count != iter->last_dead_count) || > > > !css_tryget(&last_visited->css)) { > > > last_visited = NULL; > > > > I originally had the barrier this way but Johannes pointed out it is not > > correct https://lkml.org/lkml/2013/2/11/411 > > " > > !> + /* > > !> + * last_visited might be invalid if some of the group > > !> + * downwards was removed. As we do not know which one > > !> + * disappeared we have to start all over again from the > > !> + * root. > > !> + * css ref count then makes sure that css won't > > !> + * disappear while we iterate to the next memcg > > !> + */ > > !> + last_visited = iter->last_visited; > > !> + dead_count = atomic_read(&root->dead_count); > > !> + smp_rmb(); > > ! > > !Confused about this barrier, see below. > > ! > > !As per above, if you remove the iter lock, those lines are mixed up. > > !You need to read the dead count first because the writer updates the > > !dead count after it sets the new position. That way, if the dead > > !count gives the go-ahead, you KNOW that the position cache is valid, > > !because it has been updated first. If either the two reads or the two > > !writes get reordered, you risk seeing a matching dead count while the > > !position cache is stale. > > " > > The original prototype code I sent looked like this: > > mem_cgroup_iter: > rcu_read_lock() > if atomic_read(&root->dead_count) == iter->dead_count: > smp_rmb() > if tryget(iter->position): > position = iter->position > memcg = find_next(postion) > css_put(position) > iter->position = memcg > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > iter->dead_count = atomic_read(&root->dead_count) > rcu_read_unlock() > > iter->last_position is written, THEN iter->last_dead_count is written. > > So, yes, you "need to read the dead count" first to be sure > iter->last_position is uptodate. But iter->last_dead_count, not > root->dead_count. I should have caught this in the final submission > of your patch :( OK, right you are. I managed to confuse myself by the three dependencies here. dead_count -> last_visited -> last_dead_count. The first one is invalid because last_visited doesn't care about dead_count and that makes it much more clear now. > Tejun's patch is not correct, either. Something like this? Yes this looks saner and correct. Care to send a full patch? > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 010d6c1..92830fa 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1199,7 +1199,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > mz = mem_cgroup_zoneinfo(root, nid, zid); > iter = &mz->reclaim_iter[reclaim->priority]; > - last_visited = iter->last_visited; > if (prev && reclaim->generation != iter->generation) { > iter->last_visited = NULL; > goto out_unlock; > @@ -1217,14 +1216,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > * css_tryget() verifies the cgroup pointed to > * is alive. > */ > + last_visited = NULL; > dead_count = atomic_read(&root->dead_count); > - smp_rmb(); > - last_visited = iter->last_visited; > - if (last_visited) { > - if ((dead_count != iter->last_dead_count) || > - !css_tryget(&last_visited->css)) { > + if (dead_count == iter->last_dead_count) { > + /* > + * The writer below sets the position > + * pointer, then the dead count. > + * Ensure we read the updated position > + * when the dead count matches. > + */ > + smp_rmb(); > + last_visited = iter->last_visited; > + if (last_visited && > + !css_tryget(&last_visited->css)) > last_visited = NULL; > - } > } > } > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 2/3] memcg: restructure mem_cgroup_iter() 2013-06-04 0:44 [PATCHSET] memcg: fix and reimplement iterator Tejun Heo 2013-06-04 0:44 ` [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() Tejun Heo @ 2013-06-04 0:44 ` Tejun Heo 2013-06-04 13:21 ` Michal Hocko 2013-06-04 0:44 ` [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter Tejun Heo 2 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-04 0:44 UTC (permalink / raw) To: mhocko, hannes, bsingharora; +Cc: cgroups, linux-mm, lizefan, Tejun Heo mem_cgroup_iter() implements two iteration modes - plain and reclaim. The former is normal pre-order tree walk. The latter tries to share iteration cursor per zone and priority pair among multiple reclaimers so that they all contribute to scanning forward rather than banging on the same cgroups simultaneously. Implementing the two in the same function allows them to share code paths which is fine but the current structure is unnecessarily convoluted with conditionals on @reclaim spread across the function rather obscurely and with a somewhat strange control flow which checks for conditions which can't be and has duplicate tests for the same conditions in different forms. This patch restructures the function such that there's single test on @reclaim and !reclaim path is contained in its block, which simplifies both !reclaim and reclaim paths. Also, the control flow in the reclaim path is restructured and commented so that it's easier to follow what's going on why. Note that after the patch reclaim->generation is synchronized to the iter's on success whether @prev was specified or not. This doesn't cause any functional differences as the two generation numbers are guaranteed to be the same at that point if @prev and makes the code slightly easier to follow. This patch is pure restructuring and shouldn't introduce any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> --- mm/memcontrol.c | 131 ++++++++++++++++++++++++++++++-------------------------- 1 file changed, 71 insertions(+), 60 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cb2f91c..99e7357 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1170,8 +1170,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup_reclaim_cookie *reclaim) { struct mem_cgroup *memcg = NULL; - struct mem_cgroup *last_visited = NULL; - unsigned long uninitialized_var(dead_count); + struct mem_cgroup_per_zone *mz; + struct mem_cgroup_reclaim_iter *iter; if (mem_cgroup_disabled()) return NULL; @@ -1179,9 +1179,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (!root) root = root_mem_cgroup; - if (prev && !reclaim) - last_visited = prev; - if (!root->use_hierarchy && root != root_mem_cgroup) { if (prev) goto out_css_put; @@ -1189,73 +1186,87 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, } rcu_read_lock(); - while (!memcg) { - struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - - if (reclaim) { - int nid = zone_to_nid(reclaim->zone); - int zid = zone_idx(reclaim->zone); - struct mem_cgroup_per_zone *mz; - - mz = mem_cgroup_zoneinfo(root, nid, zid); - iter = &mz->reclaim_iter[reclaim->priority]; - last_visited = iter->last_visited; - if (prev && reclaim->generation != iter->generation) { - iter->last_visited = NULL; - goto out_unlock; - } + /* non reclaim case is simple - just iterate from @prev */ + if (!reclaim) { + memcg = __mem_cgroup_iter_next(root, prev); + goto out_unlock; + } + + /* + * @reclaim specified - find and share the per-zone-priority + * iterator. + */ + mz = mem_cgroup_zoneinfo(root, zone_to_nid(reclaim->zone), + zone_idx(reclaim->zone)); + iter = &mz->reclaim_iter[reclaim->priority]; + + while (true) { + struct mem_cgroup *last_visited; + unsigned long dead_count; + + /* + * If this caller already iterated through some and @iter + * wrapped since, finish this the iteration. + */ + if (prev && reclaim->generation != iter->generation) { + iter->last_visited = NULL; + break; + } + + /* + * If the dead_count mismatches, a destruction has happened + * or is happening concurrently. If the dead_count + * matches, a destruction might still happen concurrently, + * but since we checked under RCU, that destruction won't + * free the object until we release the RCU reader lock. + * Thus, the dead_count check verifies the pointer is still + * valid, css_tryget() verifies the cgroup pointed to is + * alive. + */ + dead_count = atomic_read(&root->dead_count); + + last_visited = iter->last_visited; + if (last_visited) { /* - * If the dead_count mismatches, a destruction - * has happened or is happening concurrently. - * If the dead_count matches, a destruction - * might still happen concurrently, but since - * we checked under RCU, that destruction - * won't free the object until we release the - * RCU reader lock. Thus, the dead_count - * check verifies the pointer is still valid, - * css_tryget() verifies the cgroup pointed to - * is alive. + * Paired with smp_wmb() below in this function. + * The pair guarantee that last_visited is more + * current than last_dead_count, which may lead to + * spurious iteration resets but guarantees + * reliable detection of dead condition. */ - dead_count = atomic_read(&root->dead_count); - - last_visited = iter->last_visited; - if (last_visited) { - /* - * Paired with smp_wmb() below in this - * function. The pair guarantee that - * last_visited is more current than - * last_dead_count, which may lead to - * spurious iteration resets but guarantees - * reliable detection of dead condition. - */ - smp_rmb(); - if ((dead_count != iter->last_dead_count) || - !css_tryget(&last_visited->css)) { - last_visited = NULL; - } + smp_rmb(); + if ((dead_count != iter->last_dead_count) || + !css_tryget(&last_visited->css)) { + last_visited = NULL; } } memcg = __mem_cgroup_iter_next(root, last_visited); - if (reclaim) { - if (last_visited) - css_put(&last_visited->css); + if (last_visited) + css_put(&last_visited->css); - iter->last_visited = memcg; - /* paired with smp_rmb() above in this function */ - smp_wmb(); - iter->last_dead_count = dead_count; + iter->last_visited = memcg; + /* paired with smp_rmb() above in this function */ + smp_wmb(); + iter->last_dead_count = dead_count; - if (!memcg) - iter->generation++; - else if (!prev && memcg) - reclaim->generation = iter->generation; + /* if successful, sync the generation number and return */ + if (likely(memcg)) { + reclaim->generation = iter->generation; + break; } - if (prev && !memcg) - goto out_unlock; + /* + * The iterator reached the end. If this reclaimer already + * visited some cgroups, finish the iteration; otherwise, + * start a new iteration from the beginning. + */ + if (prev) + break; + + iter->generation++; } out_unlock: rcu_read_unlock(); -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] memcg: restructure mem_cgroup_iter() 2013-06-04 0:44 ` [PATCH 2/3] memcg: restructure mem_cgroup_iter() Tejun Heo @ 2013-06-04 13:21 ` Michal Hocko 2013-06-04 20:51 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-04 13:21 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan On Mon 03-06-13 17:44:38, Tejun Heo wrote: > mem_cgroup_iter() implements two iteration modes - plain and reclaim. > The former is normal pre-order tree walk. The latter tries to share > iteration cursor per zone and priority pair among multiple reclaimers > so that they all contribute to scanning forward rather than banging on > the same cgroups simultaneously. > > Implementing the two in the same function allows them to share code > paths which is fine but the current structure is unnecessarily > convoluted with conditionals on @reclaim spread across the function > rather obscurely and with a somewhat strange control flow which checks > for conditions which can't be and has duplicate tests for the same > conditions in different forms. > > This patch restructures the function such that there's single test on > @reclaim and !reclaim path is contained in its block, which simplifies > both !reclaim and reclaim paths. Also, the control flow in the > reclaim path is restructured and commented so that it's easier to > follow what's going on why. > > Note that after the patch reclaim->generation is synchronized to the > iter's on success whether @prev was specified or not. This doesn't > cause any functional differences as the two generation numbers are > guaranteed to be the same at that point if @prev and makes the code > slightly easier to follow. > > This patch is pure restructuring and shouldn't introduce any > functional differences. > > Signed-off-by: Tejun Heo <tj@kernel.org> > --- > mm/memcontrol.c | 131 ++++++++++++++++++++++++++++++-------------------------- > 1 file changed, 71 insertions(+), 60 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index cb2f91c..99e7357 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1170,8 +1170,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > struct mem_cgroup_reclaim_cookie *reclaim) > { > struct mem_cgroup *memcg = NULL; > - struct mem_cgroup *last_visited = NULL; > - unsigned long uninitialized_var(dead_count); > + struct mem_cgroup_per_zone *mz; > + struct mem_cgroup_reclaim_iter *iter; > > if (mem_cgroup_disabled()) > return NULL; > @@ -1179,9 +1179,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (!root) > root = root_mem_cgroup; > > - if (prev && !reclaim) > - last_visited = prev; > - > if (!root->use_hierarchy && root != root_mem_cgroup) { > if (prev) > goto out_css_put; > @@ -1189,73 +1186,87 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > } > > rcu_read_lock(); > - while (!memcg) { > - struct mem_cgroup_reclaim_iter *uninitialized_var(iter); > - > - if (reclaim) { > - int nid = zone_to_nid(reclaim->zone); > - int zid = zone_idx(reclaim->zone); > - struct mem_cgroup_per_zone *mz; > - > - mz = mem_cgroup_zoneinfo(root, nid, zid); > - iter = &mz->reclaim_iter[reclaim->priority]; > - last_visited = iter->last_visited; > - if (prev && reclaim->generation != iter->generation) { > - iter->last_visited = NULL; > - goto out_unlock; > - } > > + /* non reclaim case is simple - just iterate from @prev */ > + if (!reclaim) { > + memcg = __mem_cgroup_iter_next(root, prev); > + goto out_unlock; > + } I do not have objections for pulling !reclaim case like this, but could you base this on top of the patch which adds predicates into the operators, please? [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] memcg: restructure mem_cgroup_iter() 2013-06-04 13:21 ` Michal Hocko @ 2013-06-04 20:51 ` Tejun Heo 0 siblings, 0 replies; 42+ messages in thread From: Tejun Heo @ 2013-06-04 20:51 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan On Tue, Jun 04, 2013 at 03:21:20PM +0200, Michal Hocko wrote: > > + /* non reclaim case is simple - just iterate from @prev */ > > + if (!reclaim) { > > + memcg = __mem_cgroup_iter_next(root, prev); > > + goto out_unlock; > > + } > > I do not have objections for pulling !reclaim case like this, but could > you base this on top of the patch which adds predicates into the > operators, please? I don't really mind either ways but let's see how the other series goes. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-04 0:44 [PATCHSET] memcg: fix and reimplement iterator Tejun Heo 2013-06-04 0:44 ` [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() Tejun Heo 2013-06-04 0:44 ` [PATCH 2/3] memcg: restructure mem_cgroup_iter() Tejun Heo @ 2013-06-04 0:44 ` Tejun Heo 2013-06-04 13:18 ` Michal Hocko 2 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-04 0:44 UTC (permalink / raw) To: mhocko, hannes, bsingharora; +Cc: cgroups, linux-mm, lizefan, Tejun Heo mem_cgroup_iter() shares mem_cgroup_reclaim_iters among multiple reclaimers to prevent multiple reclaimers banging on the same cgroups. To achieve this, mem_cgroup_reclaim_iter remembers the last visited cgroup. Before the recent changes, cgroup_next_descendant_pre() required that the current cgroup is alive or RCU grace period hasn't passed after its removal as ->sibling.next couldn't be trusted otherwise. As bumping cgroup_subsys_state reference doesn't prevent the cgroup from being removed, instead of pinning the current cgroup, mem_cgroup_reclaim_iter tracks the number of cgroup removal events in the subtree and resets the iteration if any removal has happened since caching the current cgroup. This scheme involves an overly elaborate and hard-to-follow synchronization scheme as it needs to game cgroup removal RCU grace period. Now that cgroup_next_descendant_pre() can return the next sibling reliably regardless of the state of the current cgroup, this can be implemented in a much simpler and more conventional way. mem_cgroup_reclaim_iter can pin the current cgroup and use __mem_cgroup_iter_next() on it for the next iteration. The whole thing becomes normal RCU synchronization. Updating the cursor to the next position is slightly more involved as multiple tasks could be trying to update it at the same time; however, it can be easily implemented using xchg(). This replaces the overly elaborate synchronization scheme along with ->dead_count management with a more conventional RCU usage. As an added bonus, the new implementation doesn't reset the cursor everytime a cgroup is deleted in the subtree. It safely continues the iteration. Signed-off-by: Tejun Heo <tj@kernel.org> --- mm/memcontrol.c | 89 ++++++++++++++------------------------------------------- 1 file changed, 21 insertions(+), 68 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 99e7357..4057730 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -155,12 +155,8 @@ struct mem_cgroup_stat_cpu { }; struct mem_cgroup_reclaim_iter { - /* - * last scanned hierarchy member. Valid only if last_dead_count - * matches memcg->dead_count of the hierarchy root group. - */ - struct mem_cgroup *last_visited; - unsigned long last_dead_count; + /* last scanned hierarchy member, pinned */ + struct mem_cgroup __rcu *last_visited; /* scan generation, increased every round-trip */ unsigned int generation; @@ -1172,6 +1168,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *memcg = NULL; struct mem_cgroup_per_zone *mz; struct mem_cgroup_reclaim_iter *iter; + struct mem_cgroup *last_visited; if (mem_cgroup_disabled()) return NULL; @@ -1195,63 +1192,25 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, /* * @reclaim specified - find and share the per-zone-priority - * iterator. + * iterator. Because @iter->last_visited holds the reference and + * css's are RCU protected, it's guaranteed that last_visited will + * remain accessible while we're holding RCU read lock. */ mz = mem_cgroup_zoneinfo(root, zone_to_nid(reclaim->zone), zone_idx(reclaim->zone)); iter = &mz->reclaim_iter[reclaim->priority]; + last_visited = rcu_dereference(iter->last_visited); while (true) { - struct mem_cgroup *last_visited; - unsigned long dead_count; - /* * If this caller already iterated through some and @iter * wrapped since, finish this the iteration. */ - if (prev && reclaim->generation != iter->generation) { - iter->last_visited = NULL; + if (prev && reclaim->generation != iter->generation) break; - } - - /* - * If the dead_count mismatches, a destruction has happened - * or is happening concurrently. If the dead_count - * matches, a destruction might still happen concurrently, - * but since we checked under RCU, that destruction won't - * free the object until we release the RCU reader lock. - * Thus, the dead_count check verifies the pointer is still - * valid, css_tryget() verifies the cgroup pointed to is - * alive. - */ - dead_count = atomic_read(&root->dead_count); - - last_visited = iter->last_visited; - if (last_visited) { - /* - * Paired with smp_wmb() below in this function. - * The pair guarantee that last_visited is more - * current than last_dead_count, which may lead to - * spurious iteration resets but guarantees - * reliable detection of dead condition. - */ - smp_rmb(); - if ((dead_count != iter->last_dead_count) || - !css_tryget(&last_visited->css)) { - last_visited = NULL; - } - } memcg = __mem_cgroup_iter_next(root, last_visited); - if (last_visited) - css_put(&last_visited->css); - - iter->last_visited = memcg; - /* paired with smp_rmb() above in this function */ - smp_wmb(); - iter->last_dead_count = dead_count; - /* if successful, sync the generation number and return */ if (likely(memcg)) { reclaim->generation = iter->generation; @@ -1267,7 +1226,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, break; iter->generation++; + last_visited = NULL; } + + /* + * Update @iter to the new position. As multiple tasks could be + * executing this path, atomically swap the new and old. We want + * RCU assignment here but there's no rcu_xchg() and the plain + * xchg() has enough memory barrier semantics. + */ + if (memcg) + css_get(&memcg->css); + last_visited = xchg(&iter->last_visited, memcg); + if (last_visited) + css_put(&last_visited->css); out_unlock: rcu_read_unlock(); out_css_put: @@ -6324,29 +6296,10 @@ mem_cgroup_css_online(struct cgroup *cont) return error; } -/* - * Announce all parents that a group from their hierarchy is gone. - */ -static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) -{ - struct mem_cgroup *parent = memcg; - - while ((parent = parent_mem_cgroup(parent))) - atomic_inc(&parent->dead_count); - - /* - * if the root memcg is not hierarchical we have to check it - * explicitely. - */ - if (!root_mem_cgroup->use_hierarchy) - atomic_inc(&root_mem_cgroup->dead_count); -} - static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - mem_cgroup_invalidate_reclaim_iterators(memcg); mem_cgroup_reparent_charges(memcg); mem_cgroup_destroy_all_caches(memcg); } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-04 0:44 ` [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter Tejun Heo @ 2013-06-04 13:18 ` Michal Hocko 2013-06-04 20:50 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-04 13:18 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan On Mon 03-06-13 17:44:39, Tejun Heo wrote: [...] > @@ -1267,7 +1226,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > break; > > iter->generation++; > + last_visited = NULL; > } > + > + /* > + * Update @iter to the new position. As multiple tasks could be > + * executing this path, atomically swap the new and old. We want > + * RCU assignment here but there's no rcu_xchg() and the plain > + * xchg() has enough memory barrier semantics. > + */ > + if (memcg) > + css_get(&memcg->css); This is all good and nice but it re-introduces the same problem which has been fixed by (5f578161: memcg: relax memcg iter caching). You are pinning memcg in memory for unbounded amount of time because css reference will not let object to leave and rest. I understand your frustration about the complexity of the current synchronization but we didn't come up with anything easier. Originally I though that your tree walk updates which allow dropping rcu would help here but then I realized that not really because the iterator (resp. pos) has to be a valid pointer and there is only one possibility to do that AFAICS here and that is css pinning. And is no-go. > + last_visited = xchg(&iter->last_visited, memcg); > + if (last_visited) > + css_put(&last_visited->css); > out_unlock: > rcu_read_unlock(); > out_css_put: [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-04 13:18 ` Michal Hocko @ 2013-06-04 20:50 ` Tejun Heo 2013-06-04 21:28 ` Michal Hocko 2013-06-04 21:40 ` Johannes Weiner 0 siblings, 2 replies; 42+ messages in thread From: Tejun Heo @ 2013-06-04 20:50 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan Hello, Michal. On Tue, Jun 04, 2013 at 03:18:43PM +0200, Michal Hocko wrote: > > + if (memcg) > > + css_get(&memcg->css); > > This is all good and nice but it re-introduces the same problem which > has been fixed by (5f578161: memcg: relax memcg iter caching). You are > pinning memcg in memory for unbounded amount of time because css > reference will not let object to leave and rest. I don't get why that is a problem. Can you please elaborate? css's now explicitly allow holding onto them. We now have clear separation of "destruction" and "release" and blkcg also depends on it. If memcg still doesn't distinguish the two properly, that's where the problem should be fixed. > I understand your frustration about the complexity of the current > synchronization but we didn't come up with anything easier. > Originally I though that your tree walk updates which allow dropping rcu > would help here but then I realized that not really because the iterator > (resp. pos) has to be a valid pointer and there is only one possibility > to do that AFAICS here and that is css pinning. And is no-go. I find the above really weird. If css can't be pinned for position caching, isn't it natural to ask why it can't be and then fix it? Because that's what the whole refcnt thing is about and a usage which cgroup explicitly allows (e.g. blkcg also does it). Why do you go from there to "this batshit crazy barrier dancing is the only solution"? Can you please explain why memcg css's can't be pinned? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-04 20:50 ` Tejun Heo @ 2013-06-04 21:28 ` Michal Hocko 2013-06-04 21:55 ` Tejun Heo 2013-06-04 21:40 ` Johannes Weiner 1 sibling, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-04 21:28 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan On Tue 04-06-13 13:50:25, Tejun Heo wrote: > Hello, Michal. > > On Tue, Jun 04, 2013 at 03:18:43PM +0200, Michal Hocko wrote: > > > + if (memcg) > > > + css_get(&memcg->css); > > > > This is all good and nice but it re-introduces the same problem which > > has been fixed by (5f578161: memcg: relax memcg iter caching). You are > > pinning memcg in memory for unbounded amount of time because css > > reference will not let object to leave and rest. > > I don't get why that is a problem. Can you please elaborate? css's > now explicitly allow holding onto them. We now have clear separation > of "destruction" and "release" and blkcg also depends on it. If memcg > still doesn't distinguish the two properly, that's where the problem > should be fixed. > > > I understand your frustration about the complexity of the current > > synchronization but we didn't come up with anything easier. > > Originally I though that your tree walk updates which allow dropping rcu > > would help here but then I realized that not really because the iterator > > (resp. pos) has to be a valid pointer and there is only one possibility > > to do that AFAICS here and that is css pinning. And is no-go. > > I find the above really weird. If css can't be pinned for position > caching, isn't it natural to ask why it can't be and then fix it? Well, I do not mind pinning when I know that somebody releases the reference in a predictable future (ideally almost immediately). But the cached iter represents time unbounded pinning because nobody can guarantee that priority 3 at zone Normal at node 3 will be ever scanned again and the pointer in the last_visited node will be stuck there for eternity. Can we free memcg with only css elevated and safely check that the cached pointer can be used without similar dances we have now? I am open to any suggestions. > Because that's what the whole refcnt thing is about and a usage which > cgroup explicitly allows (e.g. blkcg also does it). Why do you go > from there to "this batshit crazy barrier dancing is the only > solution"? > > Can you please explain why memcg css's can't be pinned? Because it pins memcg as well AFAIU and we just do not want to keep those around for eternity. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-04 21:28 ` Michal Hocko @ 2013-06-04 21:55 ` Tejun Heo 2013-06-05 7:30 ` Michal Hocko 0 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-04 21:55 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan Hello, Michal. On Tue, Jun 04, 2013 at 11:28:08PM +0200, Michal Hocko wrote: > Well, I do not mind pinning when I know that somebody releases the > reference in a predictable future (ideally almost immediately). But the > cached iter represents time unbounded pinning because nobody can > guarantee that priority 3 at zone Normal at node 3 will be ever scanned > again and the pointer in the last_visited node will be stuck there for I don't really get that. As long as the amount is bound and the overhead negligible / acceptable, why does it matter how long the pinning persists? We aren't talking about something gigantic or can leak continuously. It will only matter iff cgroups are continuously created and destroyed and each live memcg will be able to pin one memcg (BTW, I think I forgot to unpin on memcg destruction). > eternity. Can we free memcg with only css elevated and safely check that > the cached pointer can be used without similar dances we have now? > I am open to any suggestions. I really think this is worrying too much about something which doesn't really matter and then coming up with an over-engineered solution for the imagined problem. This isn't a real problem. No solution is necessary. In the off chance that this is a real problem, which I strongly doubt, as I wrote to Johannes, we can implement extremely dumb cleanup routine rather than this weak reference beast. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-04 21:55 ` Tejun Heo @ 2013-06-05 7:30 ` Michal Hocko 2013-06-05 8:20 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-05 7:30 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan On Tue 04-06-13 14:55:35, Tejun Heo wrote: > Hello, Michal. > > On Tue, Jun 04, 2013 at 11:28:08PM +0200, Michal Hocko wrote: > > Well, I do not mind pinning when I know that somebody releases the > > reference in a predictable future (ideally almost immediately). But the > > cached iter represents time unbounded pinning because nobody can > > guarantee that priority 3 at zone Normal at node 3 will be ever scanned > > again and the pointer in the last_visited node will be stuck there for > > I don't really get that. As long as the amount is bound and the > overhead negligible / acceptable, why does it matter how long the > pinning persists? Because the amount is not bound either. Just create a hierarchy and trigger the hard limit and if you are careful enough you can always keep some of the children in the cached pointer (with css reference, if you will) and then release the hierarchy. You can do that repeatedly and leak considerable amount of memory. > We aren't talking about something gigantic or can mem_cgroup is 888B now (depending on configuration). So I wouldn't call it negligible. > leak continuously. It will only matter iff cgroups are continuously > created and destroyed and each live memcg will be able to pin one > memcg (BTW, I think I forgot to unpin on memcg destruction). > > > eternity. Can we free memcg with only css elevated and safely check that > > the cached pointer can be used without similar dances we have now? > > I am open to any suggestions. > > I really think this is worrying too much about something which doesn't > really matter and then coming up with an over-engineered solution for > the imagined problem. This isn't a real problem. No solution is > necessary. > > In the off chance that this is a real problem, which I strongly doubt, > as I wrote to Johannes, we can implement extremely dumb cleanup > routine rather than this weak reference beast. That was my first version (https://lkml.org/lkml/2013/1/3/298) and Johannes didn't like. To be honest I do not care _much_ which way we go but we definitely cannot pin those objects for ever. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 7:30 ` Michal Hocko @ 2013-06-05 8:20 ` Tejun Heo 2013-06-05 8:36 ` Michal Hocko 2013-06-05 14:39 ` Johannes Weiner 0 siblings, 2 replies; 42+ messages in thread From: Tejun Heo @ 2013-06-05 8:20 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan Hello, Michal. On Wed, Jun 05, 2013 at 09:30:23AM +0200, Michal Hocko wrote: > > I don't really get that. As long as the amount is bound and the > > overhead negligible / acceptable, why does it matter how long the > > pinning persists? > > Because the amount is not bound either. Just create a hierarchy and > trigger the hard limit and if you are careful enough you can always keep > some of the children in the cached pointer (with css reference, if you > will) and then release the hierarchy. You can do that repeatedly and > leak considerable amount of memory. It's still bound, no? Each live memcg can only keep limited number of cgroups cached, right? > > We aren't talking about something gigantic or can > > mem_cgroup is 888B now (depending on configuration). So I wouldn't call > it negligible. Do you think that the number can actually grow harmful? Would you be kind enough to share some calculations with me? > > In the off chance that this is a real problem, which I strongly doubt, > > as I wrote to Johannes, we can implement extremely dumb cleanup > > routine rather than this weak reference beast. > > That was my first version (https://lkml.org/lkml/2013/1/3/298) and > Johannes didn't like. To be honest I do not care _much_ which way we go > but we definitely cannot pin those objects for ever. I'll get to the barrier thread but really complex barrier dancing like that is only justifiable in extremely hot paths a lot of people pay attention to. It doesn't belong inside memcg proper. If the cached amount is an actual concern, let's please implement a simple clean up thing. All we need is a single delayed_work which scans the tree periodically. Johannes, what do you think? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 8:20 ` Tejun Heo @ 2013-06-05 8:36 ` Michal Hocko 2013-06-05 8:44 ` Tejun Heo 2013-06-05 14:39 ` Johannes Weiner 1 sibling, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-05 8:36 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan On Wed 05-06-13 01:20:23, Tejun Heo wrote: > Hello, Michal. > > On Wed, Jun 05, 2013 at 09:30:23AM +0200, Michal Hocko wrote: > > > I don't really get that. As long as the amount is bound and the > > > overhead negligible / acceptable, why does it matter how long the > > > pinning persists? > > > > Because the amount is not bound either. Just create a hierarchy and > > trigger the hard limit and if you are careful enough you can always keep > > some of the children in the cached pointer (with css reference, if you > > will) and then release the hierarchy. You can do that repeatedly and > > leak considerable amount of memory. > > It's still bound, no? Each live memcg can only keep limited number of > cgroups cached, right? Assuming that they are cleaned up when the memcg is offlined then yes. > > > We aren't talking about something gigantic or can > > > > mem_cgroup is 888B now (depending on configuration). So I wouldn't call > > it negligible. > > Do you think that the number can actually grow harmful? Would you be > kind enough to share some calculations with me? Well, each intermediate node might pin up-to NR_NODES * NR_ZONES * NR_PRIORITY groups. You would need a big hierarchy to have chance to cache different groups so that it starts matter. The problem is the clean up though. It might be a simple object at the time when it never gets freed. So there _must_ be something that would release the css reference to free the associated resources. As I said this can be done either during css_offline or in a lazy fashion that we have currently. I really do not care much which way it is done. > > > In the off chance that this is a real problem, which I strongly doubt, > > > as I wrote to Johannes, we can implement extremely dumb cleanup > > > routine rather than this weak reference beast. > > > > That was my first version (https://lkml.org/lkml/2013/1/3/298) and > > Johannes didn't like. To be honest I do not care _much_ which way we go > > but we definitely cannot pin those objects for ever. > > I'll get to the barrier thread but really complex barrier dancing like > that is only justifiable in extremely hot paths a lot of people pay > attention to. It doesn't belong inside memcg proper. If the cached > amount is an actual concern, let's please implement a simple clean up > thing. All we need is a single delayed_work which scans the tree > periodically. And do what? css_try_get to find out whether the cached memcg is still alive. Sorry, I do not like it at all. I find it much better to clean up when the group is removed. Because doing things asynchronously just makes it more obscure. There is no reason to do such a thing on the background when we know _when_ to do the cleanup and that is definitely _not a hot path_. > Johannes, what do you think? > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 8:36 ` Michal Hocko @ 2013-06-05 8:44 ` Tejun Heo 2013-06-05 8:55 ` Michal Hocko 0 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-05 8:44 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan Hey, On Wed, Jun 05, 2013 at 10:36:28AM +0200, Michal Hocko wrote: > > It's still bound, no? Each live memcg can only keep limited number of > > cgroups cached, right? > > Assuming that they are cleaned up when the memcg is offlined then yes. Oh yeah, that's just me being forgetful. We definitely need to clean it up on offlining. > > Do you think that the number can actually grow harmful? Would you be > > kind enough to share some calculations with me? > > Well, each intermediate node might pin up-to NR_NODES * NR_ZONES * > NR_PRIORITY groups. You would need a big hierarchy to have chance to > cache different groups so that it starts matter. Yeah, NR_NODES can be pretty big. I'm still not sure whether this would be a problem in practice but yeah it can grow pretty big. > And do what? css_try_get to find out whether the cached memcg is still Hmmm? It can just look at the timestamp and if too old do cached = xchg(&iter->hint, NULL); if (cached) css_put(cached); > alive. Sorry, I do not like it at all. I find it much better to clean up > when the group is removed. Because doing things asynchronously just > makes it more obscure. There is no reason to do such a thing on the > background when we know _when_ to do the cleanup and that is definitely > _not a hot path_. Yeah, that's true. I just wanna avoid the barrier dancing. Only one of the ancestors can cache a memcg, right? Walking up the tree scanning for cached ones and putting them should work? Is that what you were suggesting? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 8:44 ` Tejun Heo @ 2013-06-05 8:55 ` Michal Hocko 2013-06-05 9:03 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-05 8:55 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan On Wed 05-06-13 01:44:56, Tejun Heo wrote: [...] > > alive. Sorry, I do not like it at all. I find it much better to clean up > > when the group is removed. Because doing things asynchronously just > > makes it more obscure. There is no reason to do such a thing on the > > background when we know _when_ to do the cleanup and that is definitely > > _not a hot path_. > > Yeah, that's true. I just wanna avoid the barrier dancing. Only one > of the ancestors can cache a memcg, right? No. All of them on the way up hierarchy. Basically each parent which ever triggered the reclaim caches reclaimers. > Walking up the tree scanning for cached ones and putting them should > work? Is that what you were suggesting? That was my first version of the patch I linked in the previous email. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 8:55 ` Michal Hocko @ 2013-06-05 9:03 ` Tejun Heo 0 siblings, 0 replies; 42+ messages in thread From: Tejun Heo @ 2013-06-05 9:03 UTC (permalink / raw) To: Michal Hocko; +Cc: hannes, bsingharora, cgroups, linux-mm, lizefan Hey, On Wed, Jun 05, 2013 at 10:55:31AM +0200, Michal Hocko wrote: > > Yeah, that's true. I just wanna avoid the barrier dancing. Only one > > of the ancestors can cache a memcg, right? > > No. All of them on the way up hierarchy. Basically each parent which > ever triggered the reclaim caches reclaimers. Oh, I meant only the ancestors can cache a memcg, so yeap. > > Walking up the tree scanning for cached ones and putting them should > > work? Is that what you were suggesting? > > That was my first version of the patch I linked in the previous email. Yeah, indeed. Johannes, what do you think? Between the recent cgroup iterator update and xchg(), we don't need the weak referencing and it's just wrong to have that level of complexity in memcg proper. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 8:20 ` Tejun Heo 2013-06-05 8:36 ` Michal Hocko @ 2013-06-05 14:39 ` Johannes Weiner 2013-06-05 14:50 ` Johannes Weiner ` (2 more replies) 1 sibling, 3 replies; 42+ messages in thread From: Johannes Weiner @ 2013-06-05 14:39 UTC (permalink / raw) To: Tejun Heo; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan On Wed, Jun 05, 2013 at 01:20:23AM -0700, Tejun Heo wrote: > Hello, Michal. > > On Wed, Jun 05, 2013 at 09:30:23AM +0200, Michal Hocko wrote: > > > I don't really get that. As long as the amount is bound and the > > > overhead negligible / acceptable, why does it matter how long the > > > pinning persists? > > > > Because the amount is not bound either. Just create a hierarchy and > > trigger the hard limit and if you are careful enough you can always keep > > some of the children in the cached pointer (with css reference, if you > > will) and then release the hierarchy. You can do that repeatedly and > > leak considerable amount of memory. > > It's still bound, no? Each live memcg can only keep limited number of > cgroups cached, right? It is bounded by the number of memcgs. Each one can have 12 (DEF_PRIORITY) references. > > > We aren't talking about something gigantic or can > > > > mem_cgroup is 888B now (depending on configuration). So I wouldn't call > > it negligible. > > Do you think that the number can actually grow harmful? Would you be > kind enough to share some calculations with me? 5k cgroups * say 10 priority levels * 1k struct mem_cgroup may pin 51M of dead struct mem_cgroup, plus whatever else the css pins. > > > In the off chance that this is a real problem, which I strongly doubt, > > > as I wrote to Johannes, we can implement extremely dumb cleanup > > > routine rather than this weak reference beast. > > > > That was my first version (https://lkml.org/lkml/2013/1/3/298) and > > Johannes didn't like. To be honest I do not care _much_ which way we go > > but we definitely cannot pin those objects for ever. > > I'll get to the barrier thread but really complex barrier dancing like > that is only justifiable in extremely hot paths a lot of people pay > attention to. It doesn't belong inside memcg proper. If the cached > amount is an actual concern, let's please implement a simple clean up > thing. All we need is a single delayed_work which scans the tree > periodically. > > Johannes, what do you think? While I see your concerns about complexity (and this certainly is not the most straight-forward code), I really can't get too excited about asynchroneous garbage collection, even worse when it's time-based. It would probably start out with less code but two releases later we would have added all this stuff that's required to get the interaction right and fix unpredictable reclaim disruption that hits when the reaper coincides just right with heavy reclaim once a week etc. I just don't think that asynchroneous models are simpler than state machines. Harder to reason about, harder to debug. Now, there are separate things that add complexity to our current code: the weak pointers, the lockless iterator, and the fact that all of it is jam-packed into one monolithic iterator function. I can see why you are not happy. But that does not mean we have to get rid of everything wholesale. You hate the barriers, so let's add a lock to access the iterator. That path is not too hot in most cases. On the other hand, the weak pointer is not too esoteric of a pattern and we can neatly abstract it into two functions: one that takes an iterator and returns a verified css reference or NULL, and one to invalidate pointers when called from the memcg destruction code. These two things should greatly simplify mem_cgroup_iter() while not completely abandoning all our optimizations. What do you think? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 14:39 ` Johannes Weiner @ 2013-06-05 14:50 ` Johannes Weiner 2013-06-05 14:56 ` Michal Hocko 2013-06-05 17:22 ` Tejun Heo 2 siblings, 0 replies; 42+ messages in thread From: Johannes Weiner @ 2013-06-05 14:50 UTC (permalink / raw) To: Tejun Heo; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan On Wed, Jun 05, 2013 at 10:39:49AM -0400, Johannes Weiner wrote: > On Wed, Jun 05, 2013 at 01:20:23AM -0700, Tejun Heo wrote: > > Hello, Michal. > > > > On Wed, Jun 05, 2013 at 09:30:23AM +0200, Michal Hocko wrote: > > > > We aren't talking about something gigantic or can > > > > > > mem_cgroup is 888B now (depending on configuration). So I wouldn't call > > > it negligible. > > > > Do you think that the number can actually grow harmful? Would you be > > kind enough to share some calculations with me? > > 5k cgroups * say 10 priority levels * 1k struct mem_cgroup may pin 51M > of dead struct mem_cgroup, plus whatever else the css pins. Bleh, ... * nr_node_ids * MAX_NR_ZONES. So it is a couple hundred MB in that case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 14:39 ` Johannes Weiner 2013-06-05 14:50 ` Johannes Weiner @ 2013-06-05 14:56 ` Michal Hocko 2013-06-05 17:22 ` Tejun Heo 2 siblings, 0 replies; 42+ messages in thread From: Michal Hocko @ 2013-06-05 14:56 UTC (permalink / raw) To: Johannes Weiner; +Cc: Tejun Heo, bsingharora, cgroups, linux-mm, lizefan On Wed 05-06-13 10:39:49, Johannes Weiner wrote: [...] > You hate the barriers, so let's add a lock to access the iterator. > That path is not too hot in most cases. yes, basically only reclaimers use that path which makes it a slow path by definition. > On the other hand, the weak pointer is not too esoteric of a pattern > and we can neatly abstract it into two functions: one that takes an > iterator and returns a verified css reference or NULL, and one to > invalidate pointers when called from the memcg destruction code. would be nice. > These two things should greatly simplify mem_cgroup_iter() while not > completely abandoning all our optimizations. > > What do you think? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 14:39 ` Johannes Weiner 2013-06-05 14:50 ` Johannes Weiner 2013-06-05 14:56 ` Michal Hocko @ 2013-06-05 17:22 ` Tejun Heo 2013-06-05 19:45 ` Johannes Weiner 2 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-05 17:22 UTC (permalink / raw) To: Johannes Weiner; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan Hey, Johannes. On Wed, Jun 05, 2013 at 10:39:49AM -0400, Johannes Weiner wrote: > 5k cgroups * say 10 priority levels * 1k struct mem_cgroup may pin 51M > of dead struct mem_cgroup, plus whatever else the css pins. Yeah, it seems like it can grow quite a bit. > > I'll get to the barrier thread but really complex barrier dancing like > > that is only justifiable in extremely hot paths a lot of people pay > > attention to. It doesn't belong inside memcg proper. If the cached > > amount is an actual concern, let's please implement a simple clean up > > thing. All we need is a single delayed_work which scans the tree > > periodically. > > > > Johannes, what do you think? > > While I see your concerns about complexity (and this certainly is not > the most straight-forward code), I really can't get too excited about > asynchroneous garbage collection, even worse when it's time-based. It > would probably start out with less code but two releases later we > would have added all this stuff that's required to get the interaction > right and fix unpredictable reclaim disruption that hits when the > reaper coincides just right with heavy reclaim once a week etc. I > just don't think that asynchroneous models are simpler than state > machines. Harder to reason about, harder to debug. Agreed, but we can do the cleanup from ->css_offline() as Michal suggested. Naively implemented, this will lose the nice property of keeping the iteration point even when the cursor cgroup is removed, which can be an issue if we're actually worrying about cases with 5k cgroups continuously being created and destroyed. Maybe we can make it point to the next cgroup to visit rather than the last visited one and update it from ->css_offline(). > Now, there are separate things that add complexity to our current > code: the weak pointers, the lockless iterator, and the fact that all > of it is jam-packed into one monolithic iterator function. I can see > why you are not happy. But that does not mean we have to get rid of > everything wholesale. > > You hate the barriers, so let's add a lock to access the iterator. > That path is not too hot in most cases. > > On the other hand, the weak pointer is not too esoteric of a pattern > and we can neatly abstract it into two functions: one that takes an > iterator and returns a verified css reference or NULL, and one to > invalidate pointers when called from the memcg destruction code. > > These two things should greatly simplify mem_cgroup_iter() while not > completely abandoning all our optimizations. > > What do you think? I really think the weak pointers should go especially as we can achieve about the same thing with normal RCU dereference. Also, I'm a bit confused about what you're suggesting. If we have invalidation from offline, why do we need weak pointers? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 17:22 ` Tejun Heo @ 2013-06-05 19:45 ` Johannes Weiner 2013-06-05 20:06 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Johannes Weiner @ 2013-06-05 19:45 UTC (permalink / raw) To: Tejun Heo; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan On Wed, Jun 05, 2013 at 10:22:12AM -0700, Tejun Heo wrote: > Hey, Johannes. > > On Wed, Jun 05, 2013 at 10:39:49AM -0400, Johannes Weiner wrote: > > 5k cgroups * say 10 priority levels * 1k struct mem_cgroup may pin 51M > > of dead struct mem_cgroup, plus whatever else the css pins. > > Yeah, it seems like it can grow quite a bit. > > > > I'll get to the barrier thread but really complex barrier dancing like > > > that is only justifiable in extremely hot paths a lot of people pay > > > attention to. It doesn't belong inside memcg proper. If the cached > > > amount is an actual concern, let's please implement a simple clean up > > > thing. All we need is a single delayed_work which scans the tree > > > periodically. > > > > > > Johannes, what do you think? > > > > While I see your concerns about complexity (and this certainly is not > > the most straight-forward code), I really can't get too excited about > > asynchroneous garbage collection, even worse when it's time-based. It > > would probably start out with less code but two releases later we > > would have added all this stuff that's required to get the interaction > > right and fix unpredictable reclaim disruption that hits when the > > reaper coincides just right with heavy reclaim once a week etc. I > > just don't think that asynchroneous models are simpler than state > > machines. Harder to reason about, harder to debug. > > Agreed, but we can do the cleanup from ->css_offline() as Michal > suggested. Naively implemented, this will lose the nice property of > keeping the iteration point even when the cursor cgroup is removed, > which can be an issue if we're actually worrying about cases with 5k > cgroups continuously being created and destroyed. Maybe we can make > it point to the next cgroup to visit rather than the last visited one > and update it from ->css_offline(). I'm not sure what you are suggesting. Synchroneously invalidate every individual iterator upwards the hierarchy every time a cgroup is destroyed? > > Now, there are separate things that add complexity to our current > > code: the weak pointers, the lockless iterator, and the fact that all > > of it is jam-packed into one monolithic iterator function. I can see > > why you are not happy. But that does not mean we have to get rid of > > everything wholesale. > > > > You hate the barriers, so let's add a lock to access the iterator. > > That path is not too hot in most cases. > > > > On the other hand, the weak pointer is not too esoteric of a pattern > > and we can neatly abstract it into two functions: one that takes an > > iterator and returns a verified css reference or NULL, and one to > > invalidate pointers when called from the memcg destruction code. > > > > These two things should greatly simplify mem_cgroup_iter() while not > > completely abandoning all our optimizations. > > > > What do you think? > > I really think the weak pointers should go especially as we can > achieve about the same thing with normal RCU dereference. Also, I'm a > bit confused about what you're suggesting. If we have invalidation > from offline, why do we need weak pointers? The invalidation I am talking about is what we do by increasing the dead counts. This lazily invalidates all the weak pointers in the iterators of the hierarchy root. Of course if you do a synchroneous invalidation of individual iterators, we don't need weak pointers anymore and RCU is enough, but that would mean nr_levels * nr_nodes * nr_zones * nr_priority_levels invalidation operations per destruction, whereas the weak pointers are invalidated with one atomic_inc() per nesting level. As I said, the weak pointers are only a few lines of code that can be neatly self-contained (see the invalidate, load, store functions below). Please convince me that your alternative solution will save complexity to such an extent that either the memory waste of indefinite css pinning, or the computational overhead of non-lazy iterator cleanup, is justifiable. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 010d6c1..e872554 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1148,6 +1148,57 @@ skip_node: return NULL; } +static void mem_cgroup_iter_invalidate(struct mem_cgroup *root) +{ + /* + * When a group in the hierarchy below root is destroyed, the + * hierarchy iterator can no longer be trusted since it might + * have pointed to the destroyed group. Invalidate it. + */ + atomic_inc(&root->dead_count); +} + +static struct mem_cgroup *mem_cgroup_iter_load(struct mem_cgroup_reclaim_iter *iter, + struct mem_cgroup *root, + int *sequence) +{ + struct mem_cgroup *position = NULL; + /* + * A cgroup destruction happens in two stages: offlining and + * release. They are separated by a RCU grace period. + * + * If the iterator is valid, we may still race with an + * offlining. The RCU lock ensures the object won't be + * released, tryget will fail if we lost the race. + */ + *sequence = atomic_read(&root->dead_count); + if (iter->last_dead_count == *sequence) { + smp_rmb(); + position = iter->last_visited; + if (position && !css_tryget(&position->css)) + position = NULL; + } + return position; +} + +static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter, + struct mem_cgroup *last_visited, + struct mem_cgroup *new_position, + int sequence) +{ + if (last_visited) + css_put(&last_visited->css); + /* + * We store the sequence count from the time @last_visited was + * loaded successfully instead of rereading it here so that we + * don't lose destruction events in between. We could have + * raced with the destruction of @new_position after all. + */ + iter->last_visited = new_position; + smp_wmb(); + iter->last_dead_count = sequence; +} + /** * mem_cgroup_iter - iterate over memory cgroup hierarchy * @root: hierarchy root @@ -1171,7 +1222,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, { struct mem_cgroup *memcg = NULL; struct mem_cgroup *last_visited = NULL; - unsigned long uninitialized_var(dead_count); if (mem_cgroup_disabled()) return NULL; @@ -1191,6 +1241,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, rcu_read_lock(); while (!memcg) { struct mem_cgroup_reclaim_iter *uninitialized_var(iter); + int sequence; if (reclaim) { int nid = zone_to_nid(reclaim->zone); @@ -1205,38 +1256,13 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, goto out_unlock; } - /* - * If the dead_count mismatches, a destruction - * has happened or is happening concurrently. - * If the dead_count matches, a destruction - * might still happen concurrently, but since - * we checked under RCU, that destruction - * won't free the object until we release the - * RCU reader lock. Thus, the dead_count - * check verifies the pointer is still valid, - * css_tryget() verifies the cgroup pointed to - * is alive. - */ - dead_count = atomic_read(&root->dead_count); - smp_rmb(); - last_visited = iter->last_visited; - if (last_visited) { - if ((dead_count != iter->last_dead_count) || - !css_tryget(&last_visited->css)) { - last_visited = NULL; - } - } + last_visited = mem_cgroup_iter_load(iter, root, &sequence); } memcg = __mem_cgroup_iter_next(root, last_visited); if (reclaim) { - if (last_visited) - css_put(&last_visited->css); - - iter->last_visited = memcg; - smp_wmb(); - iter->last_dead_count = dead_count; + mem_cgroup_iter_update(iter, last_visited, memcg, sequence); if (!memcg) iter->generation++; @@ -6321,14 +6347,14 @@ static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) struct mem_cgroup *parent = memcg; while ((parent = parent_mem_cgroup(parent))) - atomic_inc(&parent->dead_count); + mem_cgroup_iter_invalidate(parent); /* * if the root memcg is not hierarchical we have to check it * explicitely. */ if (!root_mem_cgroup->use_hierarchy) - atomic_inc(&root_mem_cgroup->dead_count); + mem_cgroup_iter_invalidate(parent); } static void mem_cgroup_css_offline(struct cgroup *cont) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 19:45 ` Johannes Weiner @ 2013-06-05 20:06 ` Tejun Heo 2013-06-05 21:17 ` Johannes Weiner 0 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-05 20:06 UTC (permalink / raw) To: Johannes Weiner; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan Hello, On Wed, Jun 05, 2013 at 03:45:52PM -0400, Johannes Weiner wrote: > I'm not sure what you are suggesting. Synchroneously invalidate every > individual iterator upwards the hierarchy every time a cgroup is > destroyed? Yeap. > The invalidation I am talking about is what we do by increasing the > dead counts. This lazily invalidates all the weak pointers in the > iterators of the hierarchy root. > > Of course if you do a synchroneous invalidation of individual > iterators, we don't need weak pointers anymore and RCU is enough, but > that would mean nr_levels * nr_nodes * nr_zones * nr_priority_levels > invalidation operations per destruction, whereas the weak pointers are > invalidated with one atomic_inc() per nesting level. While it does have to traverse the arrays, it's still bound by the depth of nesting and cgroup destruction is a pretty cold path. I don't think it'd matter that much. > As I said, the weak pointers are only a few lines of code that can be > neatly self-contained (see the invalidate, load, store functions > below). Please convince me that your alternative solution will save > complexity to such an extent that either the memory waste of > indefinite css pinning, or the computational overhead of non-lazy > iterator cleanup, is justifiable. The biggest issue I see with the weak pointer is that it's special and tricky. If this is something which is absolutely necessary, it should be somewhere more generic. Also, if we can use the usual RCU deref with O(depth) cleanup in the cold path, I don't see how this deviation is justifiable. For people who've been looking at it for long enough, it probably isn't that different from using plain RCU but that's just because that person has spent the time to build that pattern into his/her brain. We now have a lot of people accustomed to plain RCU usages which in itself is tricky already and introducing new constructs is actively deterimental to maintainability. We sure can do that when there's no alternative but I don't think avoiding synchronous cleanup on cgroup destruction path is a good enough reason. It feels like an over-engineering to me. Another thing is that this matters the most when there are continuous creation and destruction of cgroups and the weak pointer implementation would keep resetting the iteration to the beginning. Depending on timing, it'd be able to live-lock reclaim cursor to the beginning of iteration even with fairly low rate of destruction, right? It can be pretty bad high up the tree. With synchronous cleanup, depending on how it's implemented, it can be made to keep the iteration position. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 20:06 ` Tejun Heo @ 2013-06-05 21:17 ` Johannes Weiner 2013-06-05 22:20 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Johannes Weiner @ 2013-06-05 21:17 UTC (permalink / raw) To: Tejun Heo; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan Hi Tejun, On Wed, Jun 05, 2013 at 01:06:12PM -0700, Tejun Heo wrote: > On Wed, Jun 05, 2013 at 03:45:52PM -0400, Johannes Weiner wrote: > > I'm not sure what you are suggesting. Synchroneously invalidate every > > individual iterator upwards the hierarchy every time a cgroup is > > destroyed? > > Yeap. > > As I said, the weak pointers are only a few lines of code that can be > > neatly self-contained (see the invalidate, load, store functions > > below). Please convince me that your alternative solution will save > > complexity to such an extent that either the memory waste of > > indefinite css pinning, or the computational overhead of non-lazy > > iterator cleanup, is justifiable. > > The biggest issue I see with the weak pointer is that it's special and > tricky. If this is something which is absolutely necessary, it should > be somewhere more generic. Also, if we can use the usual RCU deref > with O(depth) cleanup in the cold path, I don't see how this deviation > is justifiable. > > For people who've been looking at it for long enough, it probably > isn't that different from using plain RCU but that's just because that > person has spent the time to build that pattern into his/her brain. > We now have a lot of people accustomed to plain RCU usages which in > itself is tricky already and introducing new constructs is actively > deterimental to maintainability. We sure can do that when there's no > alternative but I don't think avoiding synchronous cleanup on cgroup > destruction path is a good enough reason. It feels like an > over-engineering to me. > > Another thing is that this matters the most when there are continuous > creation and destruction of cgroups and the weak pointer > implementation would keep resetting the iteration to the beginning. > Depending on timing, it'd be able to live-lock reclaim cursor to the > beginning of iteration even with fairly low rate of destruction, > right? It can be pretty bad high up the tree. With synchronous > cleanup, depending on how it's implemented, it can be made to keep the > iteration position. That could be an advantage, yes. But keep in mind that every destruction has to perform this invalidation operation against the global root_mem_cgroup's nr_node * nr_zone * nr_priority_levels iterators, so you can't muck around forever, while possibly holding a lock at this level. It's not a hot path, but you don't want to turn it into one, either. The upshot for me is this: whether you do long-term pinning or greedy iterator invalidation, the cost of cgroup destruction increases. Either in terms of memory usage or in terms of compute time. I would have loved to see something as simple as the long-term pinning work out in practice, because it truly would have been simpler. But at this point, I don't really care much because the projected margins of reduction in complexity and increase of cost from your proposal are too small for me to feel strongly about one solution or the other, or go ahead and write the code. I'll look at your patches, though ;-) Either way, I'll prepare the patch set that includes the barrier fix and a small cleanup to make the weak pointer management more palatable. I'm still open to code proposals, so don't let it distract you, but we might as well make it a bit more readable in the meantime. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 21:17 ` Johannes Weiner @ 2013-06-05 22:20 ` Tejun Heo 2013-06-05 22:27 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-05 22:20 UTC (permalink / raw) To: Johannes Weiner; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan Yo, On Wed, Jun 05, 2013 at 05:17:04PM -0400, Johannes Weiner wrote: > That could be an advantage, yes. But keep in mind that every > destruction has to perform this invalidation operation against the > global root_mem_cgroup's nr_node * nr_zone * nr_priority_levels > iterators, so you can't muck around forever, while possibly holding a > lock at this level. It's not a hot path, but you don't want to turn > it into one, either. nr_node tends to be pretty low in most cases, so it shouldn't be a problem there but yeah with high enough nodes and high enough rate of cgroup destruction, I guess it could be an issue in extreme cases. > The upshot for me is this: whether you do long-term pinning or greedy > iterator invalidation, the cost of cgroup destruction increases. > Either in terms of memory usage or in terms of compute time. I would > have loved to see something as simple as the long-term pinning work > out in practice, because it truly would have been simpler. But at > this point, I don't really care much because the projected margins of > reduction in complexity and increase of cost from your proposal are > too small for me to feel strongly about one solution or the other, or > go ahead and write the code. I'll look at your patches, though ;-) I don't know. I've developed this deep-seated distrust of any code which makes creative use of barriers and object lifetimes. We get them wrong too often, it makes other devs a lot more reluctant to review and dive into the code, and it's hellish to track down when something actually goes wrong. I'd happily pay a bit of computation or memory overhead for more conventional construct. In extremely hot paths, sure, we just bite and do it but I don't think this reaches that level. > Either way, I'll prepare the patch set that includes the barrier fix > and a small cleanup to make the weak pointer management more > palatable. I'm still open to code proposals, so don't let it distract > you, but we might as well make it a bit more readable in the meantime. Sure thing. We need to get it fixed for -stable anyway. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 22:20 ` Tejun Heo @ 2013-06-05 22:27 ` Tejun Heo 2013-06-06 11:50 ` Michal Hocko 0 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-05 22:27 UTC (permalink / raw) To: Johannes Weiner; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan On Wed, Jun 05, 2013 at 03:20:21PM -0700, Tejun Heo wrote: > Yo, > > On Wed, Jun 05, 2013 at 05:17:04PM -0400, Johannes Weiner wrote: > > That could be an advantage, yes. But keep in mind that every > > destruction has to perform this invalidation operation against the > > global root_mem_cgroup's nr_node * nr_zone * nr_priority_levels > > iterators, so you can't muck around forever, while possibly holding a > > lock at this level. It's not a hot path, but you don't want to turn > > it into one, either. > > nr_node tends to be pretty low in most cases, so it shouldn't be a > problem there but yeah with high enough nodes and high enough rate of Also, do we need to hold a lock? It doesn't have to be completely strict, so we might as well get away with something like, for_each_cached_pos() { if (hint == me) { /* simple clearing implementation, we prolly wanna push it forward */ cached = xchg(hint, NULL); if (cached) css_put(cached); } } It still scans the memory but wouldn't create any contention. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-05 22:27 ` Tejun Heo @ 2013-06-06 11:50 ` Michal Hocko 2013-06-07 0:52 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-06 11:50 UTC (permalink / raw) To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan On Wed 05-06-13 15:27:09, Tejun Heo wrote: > On Wed, Jun 05, 2013 at 03:20:21PM -0700, Tejun Heo wrote: > > Yo, > > > > On Wed, Jun 05, 2013 at 05:17:04PM -0400, Johannes Weiner wrote: > > > That could be an advantage, yes. But keep in mind that every > > > destruction has to perform this invalidation operation against the > > > global root_mem_cgroup's nr_node * nr_zone * nr_priority_levels > > > iterators, so you can't muck around forever, while possibly holding a > > > lock at this level. It's not a hot path, but you don't want to turn > > > it into one, either. > > > > nr_node tends to be pretty low in most cases, so it shouldn't be a > > problem there but yeah with high enough nodes and high enough rate of > > Also, do we need to hold a lock? It doesn't have to be completely > strict, so we might as well get away with something like, > > for_each_cached_pos() { > if (hint == me) { > /* simple clearing implementation, we prolly wanna push it forward */ > cached = xchg(hint, NULL); > if (cached) > css_put(cached); > } > } This would be racy: mem_cgroup_iter rcu_read_lock __mem_cgroup_iter_next cgroup_destroy_locked css_tryget(memcg) atomic_add(CSS_DEACT_BIAS) offline_css(memcg) xchg(memcg, NULL) mem_cgroup_iter_update iter->last_visited = memcg rcy_read_unlock But if it was called from call_rcu the we should be safe AFAICS. > It still scans the memory but wouldn't create any contention. > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-06 11:50 ` Michal Hocko @ 2013-06-07 0:52 ` Tejun Heo 2013-06-07 7:37 ` Michal Hocko 0 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-07 0:52 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan Hello, On Thu, Jun 06, 2013 at 01:50:31PM +0200, Michal Hocko wrote: > > Also, do we need to hold a lock? It doesn't have to be completely > > strict, so we might as well get away with something like, > > > > for_each_cached_pos() { > > if (hint == me) { > > /* simple clearing implementation, we prolly wanna push it forward */ > > cached = xchg(hint, NULL); > > if (cached) > > css_put(cached); > > } > > } > > This would be racy: > mem_cgroup_iter > rcu_read_lock > __mem_cgroup_iter_next cgroup_destroy_locked > css_tryget(memcg) > atomic_add(CSS_DEACT_BIAS) > offline_css(memcg) > xchg(memcg, NULL) > mem_cgroup_iter_update > iter->last_visited = memcg > rcy_read_unlock > > But if it was called from call_rcu the we should be safe AFAICS. Oh yeah, it is racy. That's what I meant by "not having to be completely strict". The race window is small enough and it's not like we're messing up refcnt or may end up with use-after-free. Doing it from RCU would make the race go away but I'm not sure whether the extra RCU bouncing is worthwhile. I don't know. Maybe. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-07 0:52 ` Tejun Heo @ 2013-06-07 7:37 ` Michal Hocko 2013-06-07 23:25 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-07 7:37 UTC (permalink / raw) To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan On Thu 06-06-13 17:52:42, Tejun Heo wrote: > Hello, > > On Thu, Jun 06, 2013 at 01:50:31PM +0200, Michal Hocko wrote: > > > Also, do we need to hold a lock? It doesn't have to be completely > > > strict, so we might as well get away with something like, > > > > > > for_each_cached_pos() { > > > if (hint == me) { > > > /* simple clearing implementation, we prolly wanna push it forward */ > > > cached = xchg(hint, NULL); > > > if (cached) > > > css_put(cached); > > > } > > > } > > > > This would be racy: > > mem_cgroup_iter > > rcu_read_lock > > __mem_cgroup_iter_next cgroup_destroy_locked > > css_tryget(memcg) > > atomic_add(CSS_DEACT_BIAS) > > offline_css(memcg) > > xchg(memcg, NULL) > > mem_cgroup_iter_update > > iter->last_visited = memcg > > rcy_read_unlock > > > > But if it was called from call_rcu the we should be safe AFAICS. > > Oh yeah, it is racy. That's what I meant by "not having to be > completely strict". The race window is small enough and it's not like > we're messing up refcnt or may end up with use-after-free. But it would potentially pin (aka leak) the memcg for ever. > Doing it from RCU would make the race go away but I'm not sure whether > the extra RCU bouncing is worthwhile. I don't know. Maybe. > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-07 7:37 ` Michal Hocko @ 2013-06-07 23:25 ` Tejun Heo 2013-06-10 8:02 ` Michal Hocko 0 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-07 23:25 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan Hello, Michal. On Fri, Jun 07, 2013 at 09:37:54AM +0200, Michal Hocko wrote: > > Oh yeah, it is racy. That's what I meant by "not having to be > > completely strict". The race window is small enough and it's not like > > we're messing up refcnt or may end up with use-after-free. > > But it would potentially pin (aka leak) the memcg for ever. It wouldn't be anything systemetic tho - race condition's likliness is low and increases with the frequency of reclaim iteration, which at the same time means that it's likely to remedy itself pretty soon. I'm doubtful it'd matter. If it's still bothering, we sure can do it from RCU callback. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-07 23:25 ` Tejun Heo @ 2013-06-10 8:02 ` Michal Hocko 2013-06-10 19:54 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-10 8:02 UTC (permalink / raw) To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan On Fri 07-06-13 16:25:57, Tejun Heo wrote: > Hello, Michal. > > On Fri, Jun 07, 2013 at 09:37:54AM +0200, Michal Hocko wrote: > > > Oh yeah, it is racy. That's what I meant by "not having to be > > > completely strict". The race window is small enough and it's not like > > > we're messing up refcnt or may end up with use-after-free. > > > > But it would potentially pin (aka leak) the memcg for ever. > > It wouldn't be anything systemetic tho - race condition's likliness is > low and increases with the frequency of reclaim iteration, which at > the same time means that it's likely to remedy itself pretty soon. Sure a next visit on the same root subtree (same node, zone and prio) would css_put it but what if that root goes away itself. Still fixable, if every group checks its own cached iters and css_put everybody but that is even uglier. So doing the up-the-hierarchy cleanup in RCU callback is much easier. > I'm doubtful it'd matter. If it's still bothering, we sure can do it > from RCU callback. Yes, I would definitely prefer correctness over likeliness here. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-10 8:02 ` Michal Hocko @ 2013-06-10 19:54 ` Tejun Heo 2013-06-10 20:48 ` Michal Hocko 0 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-10 19:54 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan Hello, Michal. On Mon, Jun 10, 2013 at 10:02:08AM +0200, Michal Hocko wrote: > Sure a next visit on the same root subtree (same node, zone and prio) > would css_put it but what if that root goes away itself. Still fixable, > if every group checks its own cached iters and css_put everybody but > that is even uglier. So doing the up-the-hierarchy cleanup in RCU > callback is much easier. Ooh, right, we don't need cleanup of the cached cursors on destruction if we get this correct - especially if we make cursors point to the next cgroup to visit as self is always the first one to visit. Yeah, if we can do away with that, doing that way is definitely better. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-10 19:54 ` Tejun Heo @ 2013-06-10 20:48 ` Michal Hocko 2013-06-10 23:13 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-10 20:48 UTC (permalink / raw) To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan On Mon 10-06-13 12:54:26, Tejun Heo wrote: > Hello, Michal. > > On Mon, Jun 10, 2013 at 10:02:08AM +0200, Michal Hocko wrote: > > Sure a next visit on the same root subtree (same node, zone and prio) > > would css_put it but what if that root goes away itself. Still fixable, > > if every group checks its own cached iters and css_put everybody but > > that is even uglier. So doing the up-the-hierarchy cleanup in RCU > > callback is much easier. > > Ooh, right, we don't need cleanup of the cached cursors on destruction > if we get this correct - especially if we make cursors point to the > next cgroup to visit as self is always the first one to visit. You would need to pin the next-to-visit memcg as well, so you need a cleanup on the removal. > Yeah, if we can do away with that, doing that way is definitely > better. The only advantage I can see from next-to-visit caching is that the destruction path can reuse __mem_cgroup_iter_next unlike last_visited which would need to develop a code to get the previous member. Maybe it is worth a try. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-10 20:48 ` Michal Hocko @ 2013-06-10 23:13 ` Tejun Heo 2013-06-11 7:27 ` Michal Hocko 0 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-10 23:13 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan Hey, On Mon, Jun 10, 2013 at 10:48:01PM +0200, Michal Hocko wrote: > > Ooh, right, we don't need cleanup of the cached cursors on destruction > > if we get this correct - especially if we make cursors point to the > > next cgroup to visit as self is always the first one to visit. > > You would need to pin the next-to-visit memcg as well, so you need a > cleanup on the removal. But that'd be one of the descendants of the said cgroup and there can no descendant left when the cgroup is being removed. What am I missing? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-10 23:13 ` Tejun Heo @ 2013-06-11 7:27 ` Michal Hocko 2013-06-11 7:44 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-11 7:27 UTC (permalink / raw) To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan On Mon 10-06-13 16:13:58, Tejun Heo wrote: > Hey, > > On Mon, Jun 10, 2013 at 10:48:01PM +0200, Michal Hocko wrote: > > > Ooh, right, we don't need cleanup of the cached cursors on destruction > > > if we get this correct - especially if we make cursors point to the > > > next cgroup to visit as self is always the first one to visit. > > > > You would need to pin the next-to-visit memcg as well, so you need a > > cleanup on the removal. > > But that'd be one of the descendants of the said cgroup and there can > no descendant left when the cgroup is being removed. What am I > missing? . . . A (cached=E) /|\____________ / | \ B D (cached=E) F< / | \ C< E G ^ removed * D level cache - nobody left for either approach approach * A level is - F for next-to-visit - C for last_visited You have to get up the hierarchy and handle root cgroup as a special case for !root->use_hierarchy. Once you have non-NULL new cache the it can be propagated without a new search (which I haven't realized when working on this approach the last time - not that it would safe some code in the end). Makes sense? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-11 7:27 ` Michal Hocko @ 2013-06-11 7:44 ` Tejun Heo 2013-06-11 7:55 ` Michal Hocko 0 siblings, 1 reply; 42+ messages in thread From: Tejun Heo @ 2013-06-11 7:44 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan On Tue, Jun 11, 2013 at 09:27:43AM +0200, Michal Hocko wrote: > . > . > . > A (cached=E) > /|\____________ > / | \ > B D (cached=E) F< > / | \ > C< E G > ^ > removed > > * D level cache - nobody left for either approach approach > * A level is > - F for next-to-visit > - C for last_visited > > You have to get up the hierarchy and handle root cgroup as a special > case for !root->use_hierarchy. Once you have non-NULL new cache the it > can be propagated without a new search (which I haven't realized when > working on this approach the last time - not that it would safe some > code in the end). > > Makes sense? I don't think we're talking about the same thing. I wasn't talking about skipping walking up the hierarchy (differently depending on use_hierarchy of course) when E is removed. I was talking about skipped cleaning E's cache when removing E as it's guaranteed to be empty by then. The difference between caching the last and next one is that if we put the last one in the cache, E's cache could be pointing to itself and needs to be scanned. Not a big difference either way but if you combine that with the need for special rewinding which will basically come down to traversing the sibling list again, pointing to the next entry is just easier. Anyways, I think we're getting too deep into details but one more thing, what do you mean by "non-NULL new cache"? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-11 7:44 ` Tejun Heo @ 2013-06-11 7:55 ` Michal Hocko 2013-06-11 8:00 ` Tejun Heo 0 siblings, 1 reply; 42+ messages in thread From: Michal Hocko @ 2013-06-11 7:55 UTC (permalink / raw) To: Tejun Heo; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan On Tue 11-06-13 00:44:04, Tejun Heo wrote: > On Tue, Jun 11, 2013 at 09:27:43AM +0200, Michal Hocko wrote: > > . > > . > > . > > A (cached=E) > > /|\____________ > > / | \ > > B D (cached=E) F< > > / | \ > > C< E G > > ^ > > removed > > > > * D level cache - nobody left for either approach approach > > * A level is > > - F for next-to-visit > > - C for last_visited > > > > You have to get up the hierarchy and handle root cgroup as a special > > case for !root->use_hierarchy. Once you have non-NULL new cache the it > > can be propagated without a new search (which I haven't realized when > > working on this approach the last time - not that it would safe some > > code in the end). > > > > Makes sense? > > I don't think we're talking about the same thing. I wasn't talking > about skipping walking up the hierarchy (differently depending on > use_hierarchy of course) when E is removed. I was talking about > skipped cleaning E's cache when removing E as it's guaranteed to be > empty by then. Ahh, sorry I have misread your reply then. You are right that caching the next-to-visit would keep its own cache clean. > The difference between caching the last and next one is that if we put > the last one in the cache, E's cache could be pointing to itself and > needs to be scanned. Right. > Not a big difference either way but if you combine that with the need > for special rewinding which will basically come down to traversing the > sibling list again, pointing to the next entry is just easier. > > Anyways, I think we're getting too deep into details but one more > thing, what do you mean by "non-NULL new cache"? If you replace cached memcg by a new (non-NULL) one then all the parents up the hierarchy can reuse the same replacement and do not have to search again. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-11 7:55 ` Michal Hocko @ 2013-06-11 8:00 ` Tejun Heo 0 siblings, 0 replies; 42+ messages in thread From: Tejun Heo @ 2013-06-11 8:00 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, bsingharora, cgroups, linux-mm, lizefan Hello, On Tue, Jun 11, 2013 at 09:55:40AM +0200, Michal Hocko wrote: > > Anyways, I think we're getting too deep into details but one more > > thing, what do you mean by "non-NULL new cache"? > > If you replace cached memcg by a new (non-NULL) one then all the parents > up the hierarchy can reuse the same replacement and do not have to > search again. As finding the next one to visit is pretty cheap, it isn't likely to be a big difference but yeah we can definitely re-use the first non-NULL next for all further ancestors. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-04 20:50 ` Tejun Heo 2013-06-04 21:28 ` Michal Hocko @ 2013-06-04 21:40 ` Johannes Weiner 2013-06-04 21:49 ` Tejun Heo 1 sibling, 1 reply; 42+ messages in thread From: Johannes Weiner @ 2013-06-04 21:40 UTC (permalink / raw) To: Tejun Heo; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan On Tue, Jun 04, 2013 at 01:50:25PM -0700, Tejun Heo wrote: > Hello, Michal. > > On Tue, Jun 04, 2013 at 03:18:43PM +0200, Michal Hocko wrote: > > > + if (memcg) > > > + css_get(&memcg->css); > > > > This is all good and nice but it re-introduces the same problem which > > has been fixed by (5f578161: memcg: relax memcg iter caching). You are > > pinning memcg in memory for unbounded amount of time because css > > reference will not let object to leave and rest. > > I don't get why that is a problem. Can you please elaborate? css's > now explicitly allow holding onto them. We now have clear separation > of "destruction" and "release" and blkcg also depends on it. If memcg > still doesn't distinguish the two properly, that's where the problem > should be fixed. > > > I understand your frustration about the complexity of the current > > synchronization but we didn't come up with anything easier. > > Originally I though that your tree walk updates which allow dropping rcu > > would help here but then I realized that not really because the iterator > > (resp. pos) has to be a valid pointer and there is only one possibility > > to do that AFAICS here and that is css pinning. And is no-go. > > I find the above really weird. If css can't be pinned for position > caching, isn't it natural to ask why it can't be and then fix it? > Because that's what the whole refcnt thing is about and a usage which > cgroup explicitly allows (e.g. blkcg also does it). Why do you go > from there to "this batshit crazy barrier dancing is the only > solution"? > > Can you please explain why memcg css's can't be pinned? We might pin them indefinitely. In a hierarchy with hundreds of groups that is short by 10M of memory, we only reclaim from a couple of groups before we stop and leave the iterator pointing somewhere in the hierarchy. Until the next reclaimer comes along, which might be a split second later or three days later. There is a reclaim iterator for every memcg (since every memcg represents a hierarchy), so we could pin a lot of csss for an indefinite amount of time. If you say that the delta between destruction and release is small enough, I'd be happy to get rid of the weak referencing. We had weak referencing with css_id before and didn't want to lose predictability and efficiency of our resource usage when switching away from it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter 2013-06-04 21:40 ` Johannes Weiner @ 2013-06-04 21:49 ` Tejun Heo 0 siblings, 0 replies; 42+ messages in thread From: Tejun Heo @ 2013-06-04 21:49 UTC (permalink / raw) To: Johannes Weiner; +Cc: Michal Hocko, bsingharora, cgroups, linux-mm, lizefan Hello, Johannes. On Tue, Jun 04, 2013 at 05:40:50PM -0400, Johannes Weiner wrote: > We might pin them indefinitely. In a hierarchy with hundreds of > groups that is short by 10M of memory, we only reclaim from a couple > of groups before we stop and leave the iterator pointing somewhere in > the hierarchy. Until the next reclaimer comes along, which might be a > split second later or three days later. > > There is a reclaim iterator for every memcg (since every memcg > represents a hierarchy), so we could pin a lot of csss for an > indefinite amount of time. As long as it's bound by the actual number of memcgs in the system and dead cgroups don't pin any other resources, I don't think pinning css and thus memcg struct itself is something we need to worry about. Especially not at the cost of this weak referencing thing. If the large number of unused but pinned css's actually is a problem (which I seriously doubt), we can implement a trivial timer based cache expiration which can be extremely coarse - ie. each iterator just keeps the last time stamp it was used and cleanup runs every ten mins or whatever. It'll be like twenty lines of completely obvious code. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2013-06-11 8:00 UTC | newest] Thread overview: 42+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-04 0:44 [PATCHSET] memcg: fix and reimplement iterator Tejun Heo 2013-06-04 0:44 ` [PATCH 1/3] memcg: fix subtle memory barrier bug in mem_cgroup_iter() Tejun Heo 2013-06-04 13:03 ` Michal Hocko 2013-06-04 13:58 ` Johannes Weiner 2013-06-04 15:29 ` Michal Hocko 2013-06-04 0:44 ` [PATCH 2/3] memcg: restructure mem_cgroup_iter() Tejun Heo 2013-06-04 13:21 ` Michal Hocko 2013-06-04 20:51 ` Tejun Heo 2013-06-04 0:44 ` [PATCH 3/3] memcg: simplify mem_cgroup_reclaim_iter Tejun Heo 2013-06-04 13:18 ` Michal Hocko 2013-06-04 20:50 ` Tejun Heo 2013-06-04 21:28 ` Michal Hocko 2013-06-04 21:55 ` Tejun Heo 2013-06-05 7:30 ` Michal Hocko 2013-06-05 8:20 ` Tejun Heo 2013-06-05 8:36 ` Michal Hocko 2013-06-05 8:44 ` Tejun Heo 2013-06-05 8:55 ` Michal Hocko 2013-06-05 9:03 ` Tejun Heo 2013-06-05 14:39 ` Johannes Weiner 2013-06-05 14:50 ` Johannes Weiner 2013-06-05 14:56 ` Michal Hocko 2013-06-05 17:22 ` Tejun Heo 2013-06-05 19:45 ` Johannes Weiner 2013-06-05 20:06 ` Tejun Heo 2013-06-05 21:17 ` Johannes Weiner 2013-06-05 22:20 ` Tejun Heo 2013-06-05 22:27 ` Tejun Heo 2013-06-06 11:50 ` Michal Hocko 2013-06-07 0:52 ` Tejun Heo 2013-06-07 7:37 ` Michal Hocko 2013-06-07 23:25 ` Tejun Heo 2013-06-10 8:02 ` Michal Hocko 2013-06-10 19:54 ` Tejun Heo 2013-06-10 20:48 ` Michal Hocko 2013-06-10 23:13 ` Tejun Heo 2013-06-11 7:27 ` Michal Hocko 2013-06-11 7:44 ` Tejun Heo 2013-06-11 7:55 ` Michal Hocko 2013-06-11 8:00 ` Tejun Heo 2013-06-04 21:40 ` Johannes Weiner 2013-06-04 21:49 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).