From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx179.postini.com [74.125.245.179]) by kanga.kvack.org (Postfix) with SMTP id 5B75F6B0069 for ; Thu, 3 Jan 2013 12:54:39 -0500 (EST) From: Michal Hocko Subject: [PATCH v3 0/7] rework mem_cgroup iterator Date: Thu, 3 Jan 2013 18:54:14 +0100 Message-Id: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Hi all, this is a third version of the patchset previously posted here: https://lkml.org/lkml/2012/11/26/616 The patch set tries to make mem_cgroup_iter saner in the way how it walks hierarchies. css->id based traversal is far from being ideal as it is not deterministic because it depends on the creation ordering. Diffstat doesn't look that promising as in previous versions anymore but I think it is worth the resulting outcome (and the sanity ;)). The first patch fixes a potential misbehaving which I haven't seen but the fix is needed for the later patches anyway. We could take it alone as well but I do not have any bug report to base the fix on. The second one is also preparatory and it is new to the series. The third patch is the core of the patchset and it replaces css_get_next based on css_id by the generic cgroup pre-order iterator which means that css_id is no longer used by memcg. This brings some chalanges for the last visited group caching during the reclaim (mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers directly now which means that we have to keep a reference to those groups' css to keep them alive. The next patch fixups an unbounded cgroup removal holdoff caused by the elevated css refcount and does the clean up on the group removal. Thanks to Ying who spotted this during testing of the previous version of the patchset. I could have folded it into the previous patch but I felt it would be too big to review but if people feel it would be better that way, I have no problems to squash them together. The fourth and fifth patches are an attempt for simplification of the mem_cgroup_iter. css juggling is removed and the iteration logic is moved to a helper so that the reference counting and iteration are separated. The last patch just removes css_get_next as there is no user for it any longer. I am also thinking that leaf-to-root iteration makes more sense but this patch is not included in the series yet because I have to think some more about the justification. Same as with the previous version I have tested with a quite simple hierarchy: A (limit = 280M, use_hierarchy=true) / | \ B C D (all have 100M limit) And a separate kernel build in the each leaf group. This triggers both children only and hierarchical reclaim which is parallel so the iter_reclaim caching is active a lot. I will hammer it some more but the series should be in quite a good shape already. Michal Hocko (7): memcg: synchronize per-zone iterator access by a spinlock memcg: keep prev's css alive for the whole mem_cgroup_iter memcg: rework mem_cgroup_iter to use cgroup iterators memcg: remove memcg from the reclaim iterators memcg: simplify mem_cgroup_iter memcg: further simplify mem_cgroup_iter cgroup: remove css_get_next And the diffstat says: include/linux/cgroup.h | 7 -- kernel/cgroup.c | 49 ------------ mm/memcontrol.c | 199 ++++++++++++++++++++++++++++++++++++++++++------ 3 files changed, 175 insertions(+), 80 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx125.postini.com [74.125.245.125]) by kanga.kvack.org (Postfix) with SMTP id E376F6B006C for ; Thu, 3 Jan 2013 12:54:43 -0500 (EST) From: Michal Hocko Subject: [PATCH v3 1/7] memcg: synchronize per-zone iterator access by a spinlock Date: Thu, 3 Jan 2013 18:54:15 +0100 Message-Id: <1357235661-29564-2-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan per-zone per-priority iterator is aimed at coordinating concurrent reclaimers on the same hierarchy (or the global reclaim when all groups are reclaimed) so that all groups get reclaimed evenly as much as possible. iter->position holds the last css->id visited and iter->generation signals the completed tree walk (when it is incremented). Concurrent reclaimers are supposed to provide a reclaim cookie which holds the reclaim priority and the last generation they saw. If cookie's generation doesn't match the iterator's view then other concurrent reclaimer already did the job and the tree walk is done for that priority. This scheme works nicely in most cases but it is not raceless. Two racing reclaimers can see the same iter->position and so bang on the same group. iter->generation increment is not serialized as well so a reclaimer can see an updated iter->position with and old generation so the iteration might be restarted from the root of the hierarchy. The simplest way to fix this issue is to synchronise access to the iterator by a lock. This implementation uses per-zone per-priority spinlock which linearizes only directly racing reclaimers which use reclaim cookies so the effect of the new locking should be really minimal. I have to note that I haven't seen this as a real issue so far. The primary motivation for the change is different. The following patch will change the way how the iterator is implemented and css->id iteration will be replaced cgroup generic iteration which requires storing mem_cgroup pointer into iterator and that requires reference counting and so concurrent access will be a problem. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1ea8951..e71cfde 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -148,6 +148,8 @@ struct mem_cgroup_reclaim_iter { int position; /* scan generation, increased every round-trip */ unsigned int generation; + /* lock to protect the position and generation */ + spinlock_t iter_lock; }; /* @@ -1161,8 +1163,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; - if (prev && reclaim->generation != iter->generation) + spin_lock(&iter->iter_lock); + if (prev && reclaim->generation != iter->generation) { + spin_unlock(&iter->iter_lock); return NULL; + } id = iter->position; } @@ -1181,6 +1186,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; + spin_unlock(&iter->iter_lock); } if (prev && !css) @@ -6051,8 +6057,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) return 1; for (zone = 0; zone < MAX_NR_ZONES; zone++) { + int prio; + mz = &pn->zoneinfo[zone]; lruvec_init(&mz->lruvec); + for (prio = 0; prio < DEF_PRIORITY + 1; prio++) + spin_lock_init(&mz->reclaim_iter[prio].iter_lock); mz->usage_in_excess = 0; mz->on_tree = false; mz->memcg = memcg; -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id 510066B0071 for ; Thu, 3 Jan 2013 12:54:45 -0500 (EST) From: Michal Hocko Subject: [PATCH v3 2/7] memcg: keep prev's css alive for the whole mem_cgroup_iter Date: Thu, 3 Jan 2013 18:54:16 +0100 Message-Id: <1357235661-29564-3-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan css reference counting keeps the cgroup alive even though it has been already removed. mem_cgroup_iter relies on this fact and takes a reference to the returned group. The reference is then released on the next iteration or mem_cgroup_iter_break. mem_cgroup_iter currently releases the reference right after it gets the last css_id. This is correct because neither prev's memcg nor cgroup are accessed after then. This will change in the next patch so we need to hold the group alive a bit longer so let's move the css_put at the end of the function. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e71cfde..90a3b1d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1143,12 +1143,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (prev && !reclaim) id = css_id(&prev->css); - if (prev && prev != root) - css_put(&prev->css); - if (!root->use_hierarchy && root != root_mem_cgroup) { if (prev) - return NULL; + goto out_css_put; return root; } @@ -1166,7 +1163,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, spin_lock(&iter->iter_lock); if (prev && reclaim->generation != iter->generation) { spin_unlock(&iter->iter_lock); - return NULL; + goto out_css_put; } id = iter->position; } @@ -1190,8 +1187,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, } if (prev && !css) - return NULL; + goto out_css_put; } +out_css_put: + if (prev && prev != root) + css_put(&prev->css); + return memcg; } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx162.postini.com [74.125.245.162]) by kanga.kvack.org (Postfix) with SMTP id 9C75B6B0070 for ; Thu, 3 Jan 2013 12:54:46 -0500 (EST) From: Michal Hocko Subject: [PATCH v3 3/7] memcg: rework mem_cgroup_iter to use cgroup iterators Date: Thu, 3 Jan 2013 18:54:17 +0100 Message-Id: <1357235661-29564-4-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan mem_cgroup_iter curently relies on css->id when walking down a group hierarchy tree. This is really awkward because the tree walk depends on the groups creation ordering. The only guarantee is that a parent node is visited before its children. Example 1) mkdir -p a a/d a/b/c 2) mkdir -a a/b/c a/d Will create the same trees but the tree walks will be different: 1) a, d, b, c 2) a, b, c, d 574bd9f7 (cgroup: implement generic child / descendant walk macros) has introduced generic cgroup tree walkers which provide either pre-order or post-order tree walk. This patch converts css->id based iteration to pre-order tree walk to keep the semantic with the original iterator where parent is always visited before its subtree. cgroup_for_each_descendant_pre suggests using post_create and pre_destroy for proper synchronization with groups addidition resp. removal. This implementation doesn't use those because a new memory cgroup is fully initialized in mem_cgroup_create and css reference counting enforces that the group is alive for both the last seen cgroup and the found one resp. it signals that the group is dead and it should be skipped. If the reclaim cookie is used we need to store the last visited group into the iterator so we have to be careful that it doesn't disappear in the mean time. Elevated reference count on the css keeps it alive even though the group have been removed (parked waiting for the last dput so that it can be freed). V2 - use css_{get,put} for iter->last_visited rather than mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle - cgroup_next_descendant_pre expects NULL pos for the first iterartion otherwise it might loop endlessly for intermediate node without any children. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 74 ++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 57 insertions(+), 17 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 90a3b1d..e9f5c47 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -144,8 +144,8 @@ struct mem_cgroup_stat_cpu { }; struct mem_cgroup_reclaim_iter { - /* css_id of the last scanned hierarchy member */ - int position; + /* last scanned hierarchy member with elevated css ref count */ + struct mem_cgroup *last_visited; /* scan generation, increased every round-trip */ unsigned int generation; /* lock to protect the position and generation */ @@ -1132,7 +1132,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup_reclaim_cookie *reclaim) { struct mem_cgroup *memcg = NULL; - int id = 0; + struct mem_cgroup *last_visited = NULL; if (mem_cgroup_disabled()) return NULL; @@ -1141,7 +1141,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, root = root_mem_cgroup; if (prev && !reclaim) - id = css_id(&prev->css); + last_visited = prev; if (!root->use_hierarchy && root != root_mem_cgroup) { if (prev) @@ -1149,9 +1149,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, return root; } + rcu_read_lock(); while (!memcg) { struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - struct cgroup_subsys_state *css; + struct cgroup_subsys_state *css = NULL; if (reclaim) { int nid = zone_to_nid(reclaim->zone); @@ -1161,34 +1162,73 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; spin_lock(&iter->iter_lock); + last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { + if (last_visited) { + css_put(&last_visited->css); + iter->last_visited = NULL; + } spin_unlock(&iter->iter_lock); - goto out_css_put; + goto out_unlock; } - id = iter->position; } - rcu_read_lock(); - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); - if (css) { - if (css == &root->css || css_tryget(css)) - memcg = mem_cgroup_from_css(css); - } else - id = 0; - rcu_read_unlock(); + /* + * Root is not visited by cgroup iterators so it needs an + * explicit visit. + */ + if (!last_visited) { + css = &root->css; + } else { + struct cgroup *prev_cgroup, *next_cgroup; + + prev_cgroup = (last_visited == root) ? NULL + : last_visited->css.cgroup; + next_cgroup = cgroup_next_descendant_pre(prev_cgroup, + root->css.cgroup); + if (next_cgroup) + css = cgroup_subsys_state(next_cgroup, + mem_cgroup_subsys_id); + } + + /* + * Even if we found a group we have to make sure it is alive. + * css && !memcg means that the groups should be skipped and + * we should continue the tree walk. + * last_visited css is safe to use because it is protected by + * css_get and the tree walk is rcu safe. + */ + if (css == &root->css || (css && css_tryget(css))) + memcg = mem_cgroup_from_css(css); if (reclaim) { - iter->position = id; + struct mem_cgroup *curr = memcg; + + if (last_visited) + css_put(&last_visited->css); + + if (css && !memcg) + curr = mem_cgroup_from_css(css); + + /* make sure that the cached memcg is not removed */ + if (curr) + css_get(&curr->css); + iter->last_visited = curr; + if (!css) iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; spin_unlock(&iter->iter_lock); + } else if (css && !memcg) { + last_visited = mem_cgroup_from_css(css); } if (prev && !css) - goto out_css_put; + goto out_unlock; } +out_unlock: + rcu_read_unlock(); out_css_put: if (prev && prev != root) css_put(&prev->css); -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx188.postini.com [74.125.245.188]) by kanga.kvack.org (Postfix) with SMTP id D1C9E6B0075 for ; Thu, 3 Jan 2013 12:54:48 -0500 (EST) From: Michal Hocko Subject: [PATCH v3 5/7] memcg: simplify mem_cgroup_iter Date: Thu, 3 Jan 2013 18:54:19 +0100 Message-Id: <1357235661-29564-6-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Current implementation of mem_cgroup_iter has to consider both css and memcg to find out whether no group has been found (css==NULL - aka the loop is completed) and that no memcg is associated with the found node (!memcg - aka css_tryget failed because the group is no longer alive). This leads to awkward tweaks like tests for css && !memcg to skip the current node. It will be much easier if we got rid off css variable altogether and only rely on memcg. In order to do that the iteration part has to skip dead nodes. This sounds natural to me and as a nice side effect we will get a simple invariant that memcg is always alive when non-NULL and all nodes have been visited otherwise. We could get rid of the surrounding while loop but keep it in for now to make review easier. It will go away in the following patch. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 56 +++++++++++++++++++++++++++---------------------------- 1 file changed, 27 insertions(+), 29 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4f81abd..d8c6e5e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1152,7 +1152,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, rcu_read_lock(); while (!memcg) { struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - struct cgroup_subsys_state *css = NULL; if (reclaim) { int nid = zone_to_nid(reclaim->zone); @@ -1178,53 +1177,52 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, * explicit visit. */ if (!last_visited) { - css = &root->css; + memcg = root; } else { struct cgroup *prev_cgroup, *next_cgroup; prev_cgroup = (last_visited == root) ? NULL : last_visited->css.cgroup; - next_cgroup = cgroup_next_descendant_pre(prev_cgroup, - root->css.cgroup); - if (next_cgroup) - css = cgroup_subsys_state(next_cgroup, - mem_cgroup_subsys_id); - } +skip_node: + next_cgroup = cgroup_next_descendant_pre( + prev_cgroup, root->css.cgroup); - /* - * Even if we found a group we have to make sure it is alive. - * css && !memcg means that the groups should be skipped and - * we should continue the tree walk. - * last_visited css is safe to use because it is protected by - * css_get and the tree walk is rcu safe. - */ - if (css == &root->css || (css && css_tryget(css))) - memcg = mem_cgroup_from_css(css); + /* + * Even if we found a group we have to make sure it is + * alive. css && !memcg means that the groups should be + * skipped and we should continue the tree walk. + * last_visited css is safe to use because it is + * protected by css_get and the tree walk is rcu safe. + */ + if (next_cgroup) { + struct mem_cgroup *mem = mem_cgroup_from_cont( + next_cgroup); + if (css_tryget(&mem->css)) + memcg = mem; + else { + prev_cgroup = next_cgroup; + goto skip_node; + } + } + } if (reclaim) { - struct mem_cgroup *curr = memcg; - if (last_visited) css_put(&last_visited->css); - if (css && !memcg) - curr = mem_cgroup_from_css(css); - /* make sure that the cached memcg is not removed */ - if (curr) - css_get(&curr->css); - iter->last_visited = curr; + if (memcg) + css_get(&memcg->css); + iter->last_visited = memcg; - if (!css) + if (!memcg) iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; spin_unlock(&iter->iter_lock); - } else if (css && !memcg) { - last_visited = mem_cgroup_from_css(css); } - if (prev && !css) + if (prev && !memcg) goto out_unlock; } out_unlock: -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx141.postini.com [74.125.245.141]) by kanga.kvack.org (Postfix) with SMTP id B88C86B0071 for ; Thu, 3 Jan 2013 12:54:47 -0500 (EST) From: Michal Hocko Subject: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Date: Thu, 3 Jan 2013 18:54:18 +0100 Message-Id: <1357235661-29564-5-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Now that per-node-zone-priority iterator caches memory cgroups rather than their css ids we have to be careful and remove them from the iterator when they are on the way out otherwise they might hang for unbounded amount of time (until the global/targeted reclaim triggers the zone under priority to find out the group is dead and let it to find the final rest). This is solved by hooking into mem_cgroup_css_offline and checking all per-node-zone-priority iterators up the way to the root cgroup. If the current memcg is found in the respective iter->last_visited then it is replaced by the previous one in the same sub-hierarchy. This guarantees that no group gets more reclaiming than necessary and the next iteration will continue without noticing that the removed group has disappeared. Spotted-by: Ying Han Signed-off-by: Michal Hocko --- mm/memcontrol.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 89 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e9f5c47..4f81abd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6375,10 +6375,99 @@ free_out: return ERR_PTR(error); } +/* + * Helper to find memcg's previous group under the given root + * hierarchy. + */ +struct mem_cgroup *__find_prev_memcg(struct mem_cgroup *root, + struct mem_cgroup *memcg) +{ + struct cgroup *memcg_cgroup = memcg->css.cgroup; + struct cgroup *root_cgroup = root->css.cgroup; + struct cgroup *prev_cgroup = NULL; + struct cgroup *iter; + + cgroup_for_each_descendant_pre(iter, root_cgroup) { + if (iter == memcg_cgroup) + break; + prev_cgroup = iter; + } + + return (prev_cgroup) ? mem_cgroup_from_cont(prev_cgroup) : NULL; +} + +/* + * Remove the given memcg under given root from all per-node per-zone + * per-priority chached iterators. + */ +static void mem_cgroup_uncache_reclaim_iters(struct mem_cgroup *root, + struct mem_cgroup *memcg) +{ + int node; + + for_each_node(node) { + struct mem_cgroup_per_node *pn = root->info.nodeinfo[node]; + int zone; + + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + struct mem_cgroup_per_zone *mz; + int prio; + + mz = &pn->zoneinfo[zone]; + for (prio = 0; prio < DEF_PRIORITY + 1; prio++) { + struct mem_cgroup_reclaim_iter *iter; + + /* + * Just drop the reference on the removed memcg + * cached last_visited. No need to lock iter as + * the memcg is on the way out and cannot be + * reclaimed. + */ + iter = &mz->reclaim_iter[prio]; + if (root == memcg) { + if (iter->last_visited) + css_put(&iter->last_visited->css); + continue; + } + + rcu_read_lock(); + spin_lock(&iter->iter_lock); + if (iter->last_visited == memcg) { + iter->last_visited = __find_prev_memcg( + root, memcg); + css_put(&memcg->css); + } + spin_unlock(&iter->iter_lock); + rcu_read_unlock(); + } + } + } +} + +/* + * Remove the given memcg from all cached reclaim iterators. + */ +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; + + do { + mem_cgroup_uncache_reclaim_iters(parent, memcg); + } while ((parent = parent_mem_cgroup(parent))); + + /* + * if the root memcg is not hierarchical we have to check it + * explicitely. + */ + if (!root_mem_cgroup->use_hierarchy) + mem_cgroup_uncache_reclaim_iters(root_mem_cgroup, memcg); +} + static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + mem_cgroup_uncache_from_reclaim(memcg); mem_cgroup_reparent_charges(memcg); mem_cgroup_destroy_all_caches(memcg); } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx148.postini.com [74.125.245.148]) by kanga.kvack.org (Postfix) with SMTP id 128666B0070 for ; Thu, 3 Jan 2013 12:54:50 -0500 (EST) From: Michal Hocko Subject: [PATCH v3 6/7] memcg: further simplify mem_cgroup_iter Date: Thu, 3 Jan 2013 18:54:20 +0100 Message-Id: <1357235661-29564-7-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan mem_cgroup_iter basically does two things currently. It takes care of the house keeping (reference counting, raclaim cookie) and it iterates through a hierarchy tree (by using cgroup generic tree walk). The code would be much more easier to follow if we move the iteration outside of the function (to __mem_cgrou_iter_next) so the distinction is more clear. This patch doesn't introduce any functional changes. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 46 insertions(+), 33 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d8c6e5e..d80fcff 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1110,6 +1110,51 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) return memcg; } +/* + * Returns a next (in a pre-order walk) alive memcg (with elevated css + * ref. count) or NULL if the whole root's subtree has been visited. + * + * helper function to be used by mem_cgroup_iter + */ +static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root, + struct mem_cgroup *last_visited) +{ + struct cgroup *prev_cgroup, *next_cgroup; + + /* + * Root is not visited by cgroup iterators so it needs an + * explicit visit. + */ + if (!last_visited) + return root; + + prev_cgroup = (last_visited == root) ? NULL + : last_visited->css.cgroup; +skip_node: + next_cgroup = cgroup_next_descendant_pre( + prev_cgroup, root->css.cgroup); + + /* + * Even if we found a group we have to make sure it is + * alive. css && !memcg means that the groups should be + * skipped and we should continue the tree walk. + * last_visited css is safe to use because it is + * protected by css_get and the tree walk is rcu safe. + */ + if (next_cgroup) { + struct mem_cgroup *mem = mem_cgroup_from_cont( + next_cgroup); + if (css_tryget(&mem->css)) + return mem; + else { + prev_cgroup = next_cgroup; + goto skip_node; + } + } + + return NULL; +} + /** * mem_cgroup_iter - iterate over memory cgroup hierarchy * @root: hierarchy root @@ -1172,39 +1217,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, } } - /* - * Root is not visited by cgroup iterators so it needs an - * explicit visit. - */ - if (!last_visited) { - memcg = root; - } else { - struct cgroup *prev_cgroup, *next_cgroup; - - prev_cgroup = (last_visited == root) ? NULL - : last_visited->css.cgroup; -skip_node: - next_cgroup = cgroup_next_descendant_pre( - prev_cgroup, root->css.cgroup); - - /* - * Even if we found a group we have to make sure it is - * alive. css && !memcg means that the groups should be - * skipped and we should continue the tree walk. - * last_visited css is safe to use because it is - * protected by css_get and the tree walk is rcu safe. - */ - if (next_cgroup) { - struct mem_cgroup *mem = mem_cgroup_from_cont( - next_cgroup); - if (css_tryget(&mem->css)) - memcg = mem; - else { - prev_cgroup = next_cgroup; - goto skip_node; - } - } - } + memcg = __mem_cgroup_iter_next(root, last_visited); if (reclaim) { if (last_visited) -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx156.postini.com [74.125.245.156]) by kanga.kvack.org (Postfix) with SMTP id 524086B0078 for ; Thu, 3 Jan 2013 12:54:51 -0500 (EST) From: Michal Hocko Subject: [PATCH v3 7/7] cgroup: remove css_get_next Date: Thu, 3 Jan 2013 18:54:21 +0100 Message-Id: <1357235661-29564-8-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Now that we have generic and well ordered cgroup tree walkers there is no need to keep css_get_next in the place. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- include/linux/cgroup.h | 7 ------- kernel/cgroup.c | 49 ------------------------------------------------ 2 files changed, 56 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 7d73905..a4d86b0 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -685,13 +685,6 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css); struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id); -/* - * Get a cgroup whose id is greater than or equal to id under tree of root. - * Returning a cgroup_subsys_state or NULL. - */ -struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id, - struct cgroup_subsys_state *root, int *foundid); - /* Returns true if root is ancestor of cg */ bool css_is_ancestor(struct cgroup_subsys_state *cg, const struct cgroup_subsys_state *root); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index f34c41b..3013ec4 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5384,55 +5384,6 @@ struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id) } EXPORT_SYMBOL_GPL(css_lookup); -/** - * css_get_next - lookup next cgroup under specified hierarchy. - * @ss: pointer to subsystem - * @id: current position of iteration. - * @root: pointer to css. search tree under this. - * @foundid: position of found object. - * - * Search next css under the specified hierarchy of rootid. Calling under - * rcu_read_lock() is necessary. Returns NULL if it reaches the end. - */ -struct cgroup_subsys_state * -css_get_next(struct cgroup_subsys *ss, int id, - struct cgroup_subsys_state *root, int *foundid) -{ - struct cgroup_subsys_state *ret = NULL; - struct css_id *tmp; - int tmpid; - int rootid = css_id(root); - int depth = css_depth(root); - - if (!rootid) - return NULL; - - BUG_ON(!ss->use_id); - WARN_ON_ONCE(!rcu_read_lock_held()); - - /* fill start point for scan */ - tmpid = id; - while (1) { - /* - * scan next entry from bitmap(tree), tmpid is updated after - * idr_get_next(). - */ - tmp = idr_get_next(&ss->idr, &tmpid); - if (!tmp) - break; - if (tmp->depth >= depth && tmp->stack[depth] == rootid) { - ret = rcu_dereference(tmp->css); - if (ret) { - *foundid = tmpid; - break; - } - } - /* continue to scan from next id */ - tmpid = tmpid + 1; - } - return ret; -} - /* * get corresponding css from file open on cgroupfs directory */ -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx106.postini.com [74.125.245.106]) by kanga.kvack.org (Postfix) with SMTP id 0A3CF6B005D for ; Thu, 3 Jan 2013 22:43:26 -0500 (EST) Message-ID: <50E64FB0.9050803@huawei.com> Date: Fri, 4 Jan 2013 11:42:40 +0800 From: Li Zefan MIME-Version: 1.0 Subject: Re: [PATCH v3 7/7] cgroup: remove css_get_next References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-8-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-8-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa On 2013/1/4 1:54, Michal Hocko wrote: > Now that we have generic and well ordered cgroup tree walkers there is > no need to keep css_get_next in the place. > > Signed-off-by: Michal Hocko > Acked-by: KAMEZAWA Hiroyuki Acked-by: Li Zefan > --- > include/linux/cgroup.h | 7 ------- > kernel/cgroup.c | 49 ------------------------------------------------ > 2 files changed, 56 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 45D4C6B005D for ; Mon, 7 Jan 2013 01:18:38 -0500 (EST) Received: from m1.gw.fujitsu.co.jp (unknown [10.0.50.71]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id 31FB13EE0B6 for ; Mon, 7 Jan 2013 15:18:36 +0900 (JST) Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 13FDA45DE63 for ; Mon, 7 Jan 2013 15:18:36 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id EE74545DE5D for ; Mon, 7 Jan 2013 15:18:35 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id DE5C81DB8052 for ; Mon, 7 Jan 2013 15:18:35 +0900 (JST) Received: from m1001.s.css.fujitsu.com (m1001.s.css.fujitsu.com [10.240.81.139]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 9486C1DB804E for ; Mon, 7 Jan 2013 15:18:35 +0900 (JST) Message-ID: <50EA689B.7060308@jp.fujitsu.com> Date: Mon, 07 Jan 2013 15:18:03 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-5-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan (2013/01/04 2:54), Michal Hocko wrote: > Now that per-node-zone-priority iterator caches memory cgroups rather > than their css ids we have to be careful and remove them from the > iterator when they are on the way out otherwise they might hang for > unbounded amount of time (until the global/targeted reclaim triggers the > zone under priority to find out the group is dead and let it to find the > final rest). > > This is solved by hooking into mem_cgroup_css_offline and checking all > per-node-zone-priority iterators up the way to the root cgroup. If the > current memcg is found in the respective iter->last_visited then it is > replaced by the previous one in the same sub-hierarchy. > > This guarantees that no group gets more reclaiming than necessary and > the next iteration will continue without noticing that the removed group > has disappeared. > > Spotted-by: Ying Han > Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx197.postini.com [74.125.245.197]) by kanga.kvack.org (Postfix) with SMTP id CFEF86B0008 for ; Wed, 23 Jan 2013 07:52:07 -0500 (EST) Date: Wed, 23 Jan 2013 13:52:02 +0100 From: Michal Hocko Subject: Re: [PATCH v3 0/7] rework mem_cgroup iterator Message-ID: <20130123125202.GA13319@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Are there any comments? Ying, Johannes? I would be happy if this could go into 3.9. On Thu 03-01-13 18:54:14, Michal Hocko wrote: > Hi all, > this is a third version of the patchset previously posted here: > https://lkml.org/lkml/2012/11/26/616 > > The patch set tries to make mem_cgroup_iter saner in the way how it > walks hierarchies. css->id based traversal is far from being ideal as it > is not deterministic because it depends on the creation ordering. > > Diffstat doesn't look that promising as in previous versions anymore but > I think it is worth the resulting outcome (and the sanity ;)). > > The first patch fixes a potential misbehaving which I haven't seen but > the fix is needed for the later patches anyway. We could take it alone > as well but I do not have any bug report to base the fix on. The second > one is also preparatory and it is new to the series. > > The third patch is the core of the patchset and it replaces css_get_next > based on css_id by the generic cgroup pre-order iterator which > means that css_id is no longer used by memcg. This brings some > chalanges for the last visited group caching during the reclaim > (mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers > directly now which means that we have to keep a reference to those > groups' css to keep them alive. > > The next patch fixups an unbounded cgroup removal holdoff caused by > the elevated css refcount and does the clean up on the group removal. > Thanks to Ying who spotted this during testing of the previous version > of the patchset. > I could have folded it into the previous patch but I felt it would be > too big to review but if people feel it would be better that way, I have > no problems to squash them together. > > The fourth and fifth patches are an attempt for simplification of the > mem_cgroup_iter. css juggling is removed and the iteration logic is > moved to a helper so that the reference counting and iteration are > separated. > > The last patch just removes css_get_next as there is no user for it any > longer. > > I am also thinking that leaf-to-root iteration makes more sense but this > patch is not included in the series yet because I have to think some > more about the justification. > > Same as with the previous version I have tested with a quite simple > hierarchy: > A (limit = 280M, use_hierarchy=true) > / | \ > B C D (all have 100M limit) > > And a separate kernel build in the each leaf group. This triggers > both children only and hierarchical reclaim which is parallel so the > iter_reclaim caching is active a lot. I will hammer it some more but the > series should be in quite a good shape already. > > Michal Hocko (7): > memcg: synchronize per-zone iterator access by a spinlock > memcg: keep prev's css alive for the whole mem_cgroup_iter > memcg: rework mem_cgroup_iter to use cgroup iterators > memcg: remove memcg from the reclaim iterators > memcg: simplify mem_cgroup_iter > memcg: further simplify mem_cgroup_iter > cgroup: remove css_get_next > > And the diffstat says: > include/linux/cgroup.h | 7 -- > kernel/cgroup.c | 49 ------------ > mm/memcontrol.c | 199 ++++++++++++++++++++++++++++++++++++++++++------ > 3 files changed, 175 insertions(+), 80 deletions(-) > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx187.postini.com [74.125.245.187]) by kanga.kvack.org (Postfix) with SMTP id 7C9596B0005 for ; Fri, 8 Feb 2013 14:33:33 -0500 (EST) Date: Fri, 8 Feb 2013 14:33:18 -0500 From: Johannes Weiner Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130208193318.GA15951@cmpxchg.org> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1357235661-29564-5-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Thu, Jan 03, 2013 at 06:54:18PM +0100, Michal Hocko wrote: > Now that per-node-zone-priority iterator caches memory cgroups rather > than their css ids we have to be careful and remove them from the > iterator when they are on the way out otherwise they might hang for > unbounded amount of time (until the global/targeted reclaim triggers the > zone under priority to find out the group is dead and let it to find the > final rest). > > This is solved by hooking into mem_cgroup_css_offline and checking all > per-node-zone-priority iterators up the way to the root cgroup. If the > current memcg is found in the respective iter->last_visited then it is > replaced by the previous one in the same sub-hierarchy. > > This guarantees that no group gets more reclaiming than necessary and > the next iteration will continue without noticing that the removed group > has disappeared. > > Spotted-by: Ying Han > Signed-off-by: Michal Hocko > --- > mm/memcontrol.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 89 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e9f5c47..4f81abd 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6375,10 +6375,99 @@ free_out: > return ERR_PTR(error); > } > > +/* > + * Helper to find memcg's previous group under the given root > + * hierarchy. > + */ > +struct mem_cgroup *__find_prev_memcg(struct mem_cgroup *root, > + struct mem_cgroup *memcg) > +{ > + struct cgroup *memcg_cgroup = memcg->css.cgroup; > + struct cgroup *root_cgroup = root->css.cgroup; > + struct cgroup *prev_cgroup = NULL; > + struct cgroup *iter; > + > + cgroup_for_each_descendant_pre(iter, root_cgroup) { > + if (iter == memcg_cgroup) > + break; > + prev_cgroup = iter; > + } > + > + return (prev_cgroup) ? mem_cgroup_from_cont(prev_cgroup) : NULL; > +} > + > +/* > + * Remove the given memcg under given root from all per-node per-zone > + * per-priority chached iterators. > + */ > +static void mem_cgroup_uncache_reclaim_iters(struct mem_cgroup *root, > + struct mem_cgroup *memcg) > +{ > + int node; > + > + for_each_node(node) { > + struct mem_cgroup_per_node *pn = root->info.nodeinfo[node]; > + int zone; > + > + for (zone = 0; zone < MAX_NR_ZONES; zone++) { > + struct mem_cgroup_per_zone *mz; > + int prio; > + > + mz = &pn->zoneinfo[zone]; > + for (prio = 0; prio < DEF_PRIORITY + 1; prio++) { > + struct mem_cgroup_reclaim_iter *iter; > + > + /* > + * Just drop the reference on the removed memcg > + * cached last_visited. No need to lock iter as > + * the memcg is on the way out and cannot be > + * reclaimed. > + */ > + iter = &mz->reclaim_iter[prio]; > + if (root == memcg) { > + if (iter->last_visited) > + css_put(&iter->last_visited->css); > + continue; > + } > + > + rcu_read_lock(); > + spin_lock(&iter->iter_lock); > + if (iter->last_visited == memcg) { > + iter->last_visited = __find_prev_memcg( > + root, memcg); > + css_put(&memcg->css); > + } > + spin_unlock(&iter->iter_lock); > + rcu_read_unlock(); > + } > + } > + } > +} > + > +/* > + * Remove the given memcg from all cached reclaim iterators. > + */ > +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *parent = memcg; > + > + do { > + mem_cgroup_uncache_reclaim_iters(parent, memcg); > + } while ((parent = parent_mem_cgroup(parent))); > + > + /* > + * if the root memcg is not hierarchical we have to check it > + * explicitely. > + */ > + if (!root_mem_cgroup->use_hierarchy) > + mem_cgroup_uncache_reclaim_iters(root_mem_cgroup, memcg); > +} for each in hierarchy: for each node: for each zone: for each reclaim priority: every time a cgroup is destroyed. I don't think such a hammer is justified in general, let alone for consolidating code a little. Can we invalidate the position cache lazily? Have a global "cgroup destruction" counter and store a snapshot of that counter whenever we put a cgroup pointer in the position cache. We only use the cached pointer if that counter has not changed in the meantime, so we know that the cgroup still exists. It is pretty pretty imprecise and we invalidate the whole cache every time a cgroup is destroyed, but I think that should be okay. If not, better ideas are welcome. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id A68426B0002 for ; Mon, 11 Feb 2013 10:16:55 -0500 (EST) Date: Mon, 11 Feb 2013 16:16:49 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211151649.GD19922@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208193318.GA15951@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Fri 08-02-13 14:33:18, Johannes Weiner wrote: [...] > for each in hierarchy: > for each node: > for each zone: > for each reclaim priority: > > every time a cgroup is destroyed. I don't think such a hammer is > justified in general, let alone for consolidating code a little. > > Can we invalidate the position cache lazily? Have a global "cgroup > destruction" counter and store a snapshot of that counter whenever we > put a cgroup pointer in the position cache. We only use the cached > pointer if that counter has not changed in the meantime, so we know > that the cgroup still exists. Currently we have: rcu_read_lock() // keeps cgroup links safe iter->iter_lock // keeps selection exclusive for a specific iterator 1) global_counter == iter_counter 2) css_tryget(cached_memcg) // check it is still alive rcu_read_unlock() What would protect us from races when css would disappear between 1 and 2? css is invalidated from worker context scheduled from __css_put and it is using dentry locking which we surely do not want to pull here. We could hook into css_offline which is called with cgroup_mutex but we cannot use this one here because it is no longer exported and Tejun would kill us for that. So we can add a new global memcg internal lock to do this atomically. Ohh, this is getting uglier... > It is pretty pretty imprecise and we invalidate the whole cache every > time a cgroup is destroyed, but I think that should be okay. I am not sure this is OK because this gives an indirect way of influencing reclaim in one hierarchy by another one which opens a door for regressions (or malicious over-reclaim in the extreme case). So I do not like this very much. > If not, better ideas are welcome. Maybe we could keep the counter per memcg but that would mean that we would need to go up the hierarchy as well. We wouldn't have to go over node-zone-priority cleanup so it would be much more lightweight. I am not sure this is necessarily better than explicit cleanup because it brings yet another kind of generation number to the game but I guess I can live with it if people really thing the relaxed way is much better. What do you think about the patch below (untested yet)? --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx175.postini.com [74.125.245.175]) by kanga.kvack.org (Postfix) with SMTP id F14836B0008 for ; Mon, 11 Feb 2013 12:56:31 -0500 (EST) Date: Mon, 11 Feb 2013 12:56:19 -0500 From: Johannes Weiner Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211175619.GC13218@cmpxchg.org> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211151649.GD19922@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > On Fri 08-02-13 14:33:18, Johannes Weiner wrote: > [...] > > for each in hierarchy: > > for each node: > > for each zone: > > for each reclaim priority: > > > > every time a cgroup is destroyed. I don't think such a hammer is > > justified in general, let alone for consolidating code a little. > > > > Can we invalidate the position cache lazily? Have a global "cgroup > > destruction" counter and store a snapshot of that counter whenever we > > put a cgroup pointer in the position cache. We only use the cached > > pointer if that counter has not changed in the meantime, so we know > > that the cgroup still exists. > > Currently we have: > rcu_read_lock() // keeps cgroup links safe > iter->iter_lock // keeps selection exclusive for a specific iterator > 1) global_counter == iter_counter > 2) css_tryget(cached_memcg) // check it is still alive > rcu_read_unlock() > > What would protect us from races when css would disappear between 1 and > 2? rcu > css is invalidated from worker context scheduled from __css_put and it > is using dentry locking which we surely do not want to pull here. We > could hook into css_offline which is called with cgroup_mutex but we > cannot use this one here because it is no longer exported and Tejun > would kill us for that. > So we can add a new global memcg internal lock to do this atomically. > Ohh, this is getting uglier... A racing final css_put() means that the tryget fails, but our RCU read lock keeps the CSS allocated. If the dead_count is uptodate, it means that the rcu read lock was acquired before the synchronize_rcu() before the css is freed. > > It is pretty pretty imprecise and we invalidate the whole cache every > > time a cgroup is destroyed, but I think that should be okay. > > I am not sure this is OK because this gives an indirect way of > influencing reclaim in one hierarchy by another one which opens a door > for regressions (or malicious over-reclaim in the extreme case). > So I do not like this very much. > > > If not, better ideas are welcome. > > Maybe we could keep the counter per memcg but that would mean that we > would need to go up the hierarchy as well. We wouldn't have to go over > node-zone-priority cleanup so it would be much more lightweight. > > I am not sure this is necessarily better than explicit cleanup because > it brings yet another kind of generation number to the game but I guess > I can live with it if people really thing the relaxed way is much > better. > What do you think about the patch below (untested yet)? Better, but I think you can get rid of both locks: mem_cgroup_iter: rcu_read_lock() if atomic_read(&root->dead_count) == iter->dead_count: smp_rmb() if tryget(iter->position): position = iter->position memcg = find_next(postion) css_put(position) iter->position = memcg smp_wmb() /* Write position cache BEFORE marking it uptodate */ iter->dead_count = atomic_read(&root->dead_count) rcu_read_unlock() -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 4B5FA6B0005 for ; Mon, 11 Feb 2013 14:29:35 -0500 (EST) Received: by mail-ee0-f49.google.com with SMTP id d4so3396715eek.8 for ; Mon, 11 Feb 2013 11:29:33 -0800 (PST) Date: Mon, 11 Feb 2013 20:29:29 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211192929.GB29000@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211175619.GC13218@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Mon 11-02-13 12:56:19, Johannes Weiner wrote: > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > > On Fri 08-02-13 14:33:18, Johannes Weiner wrote: > > [...] > > > for each in hierarchy: > > > for each node: > > > for each zone: > > > for each reclaim priority: > > > > > > every time a cgroup is destroyed. I don't think such a hammer is > > > justified in general, let alone for consolidating code a little. > > > > > > Can we invalidate the position cache lazily? Have a global "cgroup > > > destruction" counter and store a snapshot of that counter whenever we > > > put a cgroup pointer in the position cache. We only use the cached > > > pointer if that counter has not changed in the meantime, so we know > > > that the cgroup still exists. > > > > Currently we have: > > rcu_read_lock() // keeps cgroup links safe > > iter->iter_lock // keeps selection exclusive for a specific iterator > > 1) global_counter == iter_counter > > 2) css_tryget(cached_memcg) // check it is still alive > > rcu_read_unlock() > > > > What would protect us from races when css would disappear between 1 and > > 2? > > rcu That was my first attempt but then I convinced myself it might not be sufficient. But now that I think about it more I guess you are right. > > css is invalidated from worker context scheduled from __css_put and it > > is using dentry locking which we surely do not want to pull here. We > > could hook into css_offline which is called with cgroup_mutex but we > > cannot use this one here because it is no longer exported and Tejun > > would kill us for that. > > So we can add a new global memcg internal lock to do this atomically. > > Ohh, this is getting uglier... > > A racing final css_put() means that the tryget fails, but our RCU read > lock keeps the CSS allocated. If the dead_count is uptodate, it means > that the rcu read lock was acquired before the synchronize_rcu() > before the css is freed. yes. > > > > It is pretty pretty imprecise and we invalidate the whole cache every > > > time a cgroup is destroyed, but I think that should be okay. > > > > I am not sure this is OK because this gives an indirect way of > > influencing reclaim in one hierarchy by another one which opens a door > > for regressions (or malicious over-reclaim in the extreme case). > > So I do not like this very much. > > > > > If not, better ideas are welcome. > > > > Maybe we could keep the counter per memcg but that would mean that we > > would need to go up the hierarchy as well. We wouldn't have to go over > > node-zone-priority cleanup so it would be much more lightweight. > > > > I am not sure this is necessarily better than explicit cleanup because > > it brings yet another kind of generation number to the game but I guess > > I can live with it if people really thing the relaxed way is much > > better. > > What do you think about the patch below (untested yet)? > > Better, but I think you can get rid of both locks: What is the other lock you have in mind. > mem_cgroup_iter: > rcu_read_lock() > if atomic_read(&root->dead_count) == iter->dead_count: > smp_rmb() > if tryget(iter->position): > position = iter->position > memcg = find_next(postion) > css_put(position) > iter->position = memcg > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > iter->dead_count = atomic_read(&root->dead_count) > rcu_read_unlock() Updated patch bellow: --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx139.postini.com [74.125.245.139]) by kanga.kvack.org (Postfix) with SMTP id BC9AE6B000A for ; Mon, 11 Feb 2013 14:58:38 -0500 (EST) Date: Mon, 11 Feb 2013 14:58:24 -0500 From: Johannes Weiner Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211195824.GB15951@cmpxchg.org> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211192929.GB29000@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote: > On Mon 11-02-13 12:56:19, Johannes Weiner wrote: > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > > > Maybe we could keep the counter per memcg but that would mean that we > > > would need to go up the hierarchy as well. We wouldn't have to go over > > > node-zone-priority cleanup so it would be much more lightweight. > > > > > > I am not sure this is necessarily better than explicit cleanup because > > > it brings yet another kind of generation number to the game but I guess > > > I can live with it if people really thing the relaxed way is much > > > better. > > > What do you think about the patch below (untested yet)? > > > > Better, but I think you can get rid of both locks: > > What is the other lock you have in mind. The iter lock itself. I mean, multiple reclaimers can still race but there won't be any corruption (if you make iter->dead_count a long, setting it happens atomically, we nly need the memcg->dead_count to be an atomic because of the inc) and the worst that could happen is that a reclaim starts at the wrong point in hierarchy, right? But as you said in the changelog that introduced the lock, it's never actually been a practical problem. You just need to put the wmb back in place, so that we never see the dead_count give the green light while the cached position is stale, or we'll tryget random memory. > > mem_cgroup_iter: > > rcu_read_lock() > > if atomic_read(&root->dead_count) == iter->dead_count: > > smp_rmb() > > if tryget(iter->position): > > position = iter->position > > memcg = find_next(postion) > > css_put(position) > > iter->position = memcg > > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > > iter->dead_count = atomic_read(&root->dead_count) > > rcu_read_unlock() > > Updated patch bellow: Cool, thanks. I hope you don't find it too ugly anymore :-) > >From 756c4f0091d250bc5ff816f8e9d11840e8522b3a Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 11 Feb 2013 20:23:51 +0100 > Subject: [PATCH] memcg: relax memcg iter caching > > Now that per-node-zone-priority iterator caches memory cgroups rather > than their css ids we have to be careful and remove them from the > iterator when they are on the way out otherwise they might hang for > unbounded amount of time (until the global/targeted reclaim triggers the > zone under priority to find out the group is dead and let it to find the > final rest). > > We can fix this issue by relaxing rules for the last_visited memcg as > well. > Instead of taking reference to css before it is stored into > iter->last_visited we can just store its pointer and track the number of > removed groups for each memcg. This number would be stored into iterator > everytime when a memcg is cached. If the iter count doesn't match the > curent walker root's one we will start over from the root again. The > group counter is incremented upwards the hierarchy every time a group is > removed. > > Locking rules are a bit complicated but we primarily rely on rcu which > protects css from disappearing while it is proved to be still valid. The > validity is checked in two steps. First the iter->last_dead_count has > to match root->dead_count and second css_tryget has to confirm the > that the group is still alive and it pins it until we get a next memcg. > > Spotted-by: Ying Han > Original-idea-by: Johannes Weiner > Signed-off-by: Michal Hocko > --- > mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 57 insertions(+), 9 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e9f5c47..f9b5719 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { > }; > > struct mem_cgroup_reclaim_iter { > - /* last scanned hierarchy member with elevated css ref count */ > + /* > + * last scanned hierarchy member. Valid only if last_dead_count > + * matches memcg->dead_count of the hierarchy root group. > + */ > struct mem_cgroup *last_visited; > + unsigned int last_dead_count; > + > /* scan generation, increased every round-trip */ > unsigned int generation; > /* lock to protect the position and generation */ > @@ -357,6 +362,7 @@ struct mem_cgroup { > struct mem_cgroup_stat_cpu nocpu_base; > spinlock_t pcp_counter_lock; > > + atomic_t dead_count; > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) > struct tcp_memcontrol tcp_mem; > #endif > @@ -1158,19 +1164,33 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > int nid = zone_to_nid(reclaim->zone); > int zid = zone_idx(reclaim->zone); > struct mem_cgroup_per_zone *mz; > + unsigned int dead_count; > > mz = mem_cgroup_zoneinfo(root, nid, zid); > iter = &mz->reclaim_iter[reclaim->priority]; > spin_lock(&iter->iter_lock); > - last_visited = iter->last_visited; > if (prev && reclaim->generation != iter->generation) { > - if (last_visited) { > - css_put(&last_visited->css); > - iter->last_visited = NULL; > - } > + iter->last_visited = NULL; > spin_unlock(&iter->iter_lock); > goto out_unlock; > } > + > + /* > + * last_visited might be invalid if some of the group > + * downwards was removed. As we do not know which one > + * disappeared we have to start all over again from the > + * root. > + * css ref count then makes sure that css won't > + * disappear while we iterate to the next memcg > + */ > + last_visited = iter->last_visited; > + dead_count = atomic_read(&root->dead_count); > + smp_rmb(); Confused about this barrier, see below. As per above, if you remove the iter lock, those lines are mixed up. You need to read the dead count first because the writer updates the dead count after it sets the new position. That way, if the dead count gives the go-ahead, you KNOW that the position cache is valid, because it has been updated first. If either the two reads or the two writes get reordered, you risk seeing a matching dead count while the position cache is stale. > + if (last_visited && > + ((dead_count != iter->last_dead_count) || > + !css_tryget(&last_visited->css))) { > + last_visited = NULL; > + } > } > > /* > @@ -1210,10 +1230,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (css && !memcg) > curr = mem_cgroup_from_css(css); > > - /* make sure that the cached memcg is not removed */ > - if (curr) > - css_get(&curr->css); > + /* > + * No memory barrier is needed here because we are > + * protected by iter_lock > + */ > iter->last_visited = curr; > + iter->last_dead_count = atomic_read(&root->dead_count); > > if (!css) > iter->generation++; > @@ -6375,10 +6397,36 @@ free_out: > return ERR_PTR(error); > } > > +/* > + * Announce all parents that a group from their hierarchy is gone. > + */ > +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg) How about static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) ? > +{ > + struct mem_cgroup *parent = memcg; > + > + while ((parent = parent_mem_cgroup(parent))) > + atomic_inc(&parent->dead_count); > + > + /* > + * if the root memcg is not hierarchical we have to check it > + * explicitely. > + */ > + if (!root_mem_cgroup->use_hierarchy) > + atomic_inc(&parent->dead_count); Increase root_mem_cgroup->dead_count instead? > + /* > + * Make sure that dead_count updates are visible before other > + * cleanup from css_offline. > + * Pairs with smp_rmb in mem_cgroup_iter > + */ > + smp_wmb(); That's unexpected. What other cleanups? A race between this and mem_cgroup_iter should be fine because of the RCU synchronization. Thanks! Johannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx106.postini.com [74.125.245.106]) by kanga.kvack.org (Postfix) with SMTP id B65466B0005 for ; Mon, 11 Feb 2013 16:28:03 -0500 (EST) Received: by mail-ea0-f171.google.com with SMTP id c13so2807352eaa.30 for ; Mon, 11 Feb 2013 13:28:01 -0800 (PST) Date: Mon, 11 Feb 2013 22:27:56 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211212756.GC29000@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211195824.GB15951@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote: > > On Mon 11-02-13 12:56:19, Johannes Weiner wrote: > > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > > > > Maybe we could keep the counter per memcg but that would mean that we > > > > would need to go up the hierarchy as well. We wouldn't have to go over > > > > node-zone-priority cleanup so it would be much more lightweight. > > > > > > > > I am not sure this is necessarily better than explicit cleanup because > > > > it brings yet another kind of generation number to the game but I guess > > > > I can live with it if people really thing the relaxed way is much > > > > better. > > > > What do you think about the patch below (untested yet)? > > > > > > Better, but I think you can get rid of both locks: > > > > What is the other lock you have in mind. > > The iter lock itself. I mean, multiple reclaimers can still race but > there won't be any corruption (if you make iter->dead_count a long, > setting it happens atomically, we nly need the memcg->dead_count to be > an atomic because of the inc) and the worst that could happen is that > a reclaim starts at the wrong point in hierarchy, right? The lack of synchronization basically means that 2 parallel reclaimers can reclaim every group exactly once (ideally) or up to each group twice in the worst case. So the exclusion was quite comfortable. > But as you said in the changelog that introduced the lock, it's never > actually been a practical problem. That is true but those bugs would be subtle though so I wouldn't be opposed to prevent from them before we get burnt. But if you think that we should keep the previous semantic I can drop that patch. > You just need to put the wmb back in place, so that we never see the > dead_count give the green light while the cached position is stale, or > we'll tryget random memory. > > > > mem_cgroup_iter: > > > rcu_read_lock() > > > if atomic_read(&root->dead_count) == iter->dead_count: > > > smp_rmb() > > > if tryget(iter->position): > > > position = iter->position > > > memcg = find_next(postion) > > > css_put(position) > > > iter->position = memcg > > > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > > > iter->dead_count = atomic_read(&root->dead_count) > > > rcu_read_unlock() > > > > Updated patch bellow: > > Cool, thanks. I hope you don't find it too ugly anymore :-) It's getting trick and you know how people love when you have to play and rely on atomics with memory barriers... [...] > > + > > + /* > > + * last_visited might be invalid if some of the group > > + * downwards was removed. As we do not know which one > > + * disappeared we have to start all over again from the > > + * root. > > + * css ref count then makes sure that css won't > > + * disappear while we iterate to the next memcg > > + */ > > + last_visited = iter->last_visited; > > + dead_count = atomic_read(&root->dead_count); > > + smp_rmb(); > > Confused about this barrier, see below. > > As per above, if you remove the iter lock, those lines are mixed up. > You need to read the dead count first because the writer updates the > dead count after it sets the new position. You are right, we need + dead_count = atomic_read(&root->dead_count); + smp_rmb(); + last_visited = iter->last_visited; > That way, if the dead count gives the go-ahead, you KNOW that the > position cache is valid, because it has been updated first. OK, you are right. We can live without css_tryget because dead_count is either OK which means that css would be alive at least this rcu period (and RCU walk would be safe as well) or it is incremented which means that we have started css_offline already and then css is dead already. So css_tryget can be dropped. > If either the two reads or the two writes get reordered, you risk > seeing a matching dead count while the position cache is stale. > > > + if (last_visited && > > + ((dead_count != iter->last_dead_count) || > > + !css_tryget(&last_visited->css))) { > > + last_visited = NULL; > > + } > > } > > > > /* > > @@ -1210,10 +1230,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > if (css && !memcg) > > curr = mem_cgroup_from_css(css); > > > > - /* make sure that the cached memcg is not removed */ > > - if (curr) > > - css_get(&curr->css); > > + /* > > + * No memory barrier is needed here because we are > > + * protected by iter_lock > > + */ > > iter->last_visited = curr; + smp_wmb(); > > + iter->last_dead_count = atomic_read(&root->dead_count); > > > > if (!css) > > iter->generation++; > > @@ -6375,10 +6397,36 @@ free_out: > > return ERR_PTR(error); > > } > > > > +/* > > + * Announce all parents that a group from their hierarchy is gone. > > + */ > > +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg) > > How about > > static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) OK > ? > > > +{ > > + struct mem_cgroup *parent = memcg; > > + > > + while ((parent = parent_mem_cgroup(parent))) > > + atomic_inc(&parent->dead_count); > > + > > + /* > > + * if the root memcg is not hierarchical we have to check it > > + * explicitely. > > + */ > > + if (!root_mem_cgroup->use_hierarchy) > > + atomic_inc(&parent->dead_count); > > Increase root_mem_cgroup->dead_count instead? Sure. C&P > > + /* > > + * Make sure that dead_count updates are visible before other > > + * cleanup from css_offline. > > + * Pairs with smp_rmb in mem_cgroup_iter > > + */ > > + smp_wmb(); > > That's unexpected. What other cleanups? A race between this and > mem_cgroup_iter should be fine because of the RCU synchronization. OK, I was too careful, probably (memory barriers are always head scratchers). I was worried about all dead_count should be committed before we do other steps in the clean up like reparenting charges etc. But as you say it will not do any changes. I will get back to this tomorrow. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id 510C36B0007 for ; Mon, 11 Feb 2013 17:07:47 -0500 (EST) Received: by mail-ea0-f180.google.com with SMTP id c1so2679828eaa.25 for ; Mon, 11 Feb 2013 14:07:45 -0800 (PST) Date: Mon, 11 Feb 2013 23:07:42 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211220742.GD29000@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211212756.GC29000@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Mon 11-02-13 22:27:56, Michal Hocko wrote: [...] > I will get back to this tomorrow. Maybe not a great idea as it is getting late here and brain turns into cabbage but there we go: --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx201.postini.com [74.125.245.201]) by kanga.kvack.org (Postfix) with SMTP id 9784D6B0005 for ; Mon, 11 Feb 2013 17:39:58 -0500 (EST) Date: Mon, 11 Feb 2013 17:39:43 -0500 From: Johannes Weiner Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211223943.GC15951@cmpxchg.org> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211212756.GC29000@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote: > > > On Mon 11-02-13 12:56:19, Johannes Weiner wrote: > > > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > > > > > Maybe we could keep the counter per memcg but that would mean that we > > > > > would need to go up the hierarchy as well. We wouldn't have to go over > > > > > node-zone-priority cleanup so it would be much more lightweight. > > > > > > > > > > I am not sure this is necessarily better than explicit cleanup because > > > > > it brings yet another kind of generation number to the game but I guess > > > > > I can live with it if people really thing the relaxed way is much > > > > > better. > > > > > What do you think about the patch below (untested yet)? > > > > > > > > Better, but I think you can get rid of both locks: > > > > > > What is the other lock you have in mind. > > > > The iter lock itself. I mean, multiple reclaimers can still race but > > there won't be any corruption (if you make iter->dead_count a long, > > setting it happens atomically, we nly need the memcg->dead_count to be > > an atomic because of the inc) and the worst that could happen is that > > a reclaim starts at the wrong point in hierarchy, right? > > The lack of synchronization basically means that 2 parallel reclaimers > can reclaim every group exactly once (ideally) or up to each group > twice in the worst case. > So the exclusion was quite comfortable. It's quite unlikely, though. Don't forget that they actually reclaim in between, I just can't see them line up perfectly and race to the iterator at the same time repeatedly. It's more likely to happen at the higher priority levels where less reclaim happens, and then it's not a big deal anyway. With lower priority levels, when the glitches would be more problematic, they also become even less likely. > > But as you said in the changelog that introduced the lock, it's never > > actually been a practical problem. > > That is true but those bugs would be subtle though so I wouldn't be > opposed to prevent from them before we get burnt. But if you think that > we should keep the previous semantic I can drop that patch. I just think that the problem is unlikely and not that big of a deal. > > You just need to put the wmb back in place, so that we never see the > > dead_count give the green light while the cached position is stale, or > > we'll tryget random memory. > > > > > > mem_cgroup_iter: > > > > rcu_read_lock() > > > > if atomic_read(&root->dead_count) == iter->dead_count: > > > > smp_rmb() > > > > if tryget(iter->position): > > > > position = iter->position > > > > memcg = find_next(postion) > > > > css_put(position) > > > > iter->position = memcg > > > > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > > > > iter->dead_count = atomic_read(&root->dead_count) > > > > rcu_read_unlock() > > > > > > Updated patch bellow: > > > > Cool, thanks. I hope you don't find it too ugly anymore :-) > > It's getting trick and you know how people love when you have to play > and rely on atomics with memory barriers... My bumper sticker reads "I don't believe in mutual exclusion" (the kernel hacker's version of smile for the red light camera). I mean, you were the one complaining about the lock... > > That way, if the dead count gives the go-ahead, you KNOW that the > > position cache is valid, because it has been updated first. > > OK, you are right. We can live without css_tryget because dead_count is > either OK which means that css would be alive at least this rcu period > (and RCU walk would be safe as well) or it is incremented which means > that we have started css_offline already and then css is dead already. > So css_tryget can be dropped. Not quite :) The dead_count check is for completed destructions, but the try_get is needed to detect a race with an ongoing destruction. Basically, the dead_count verifies the iterator pointer is valid (and the rcu reader lock keeps it that way), the try_get verifies that the object pointed to is still alive. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx133.postini.com [74.125.245.133]) by kanga.kvack.org (Postfix) with SMTP id 2D1C16B0005 for ; Tue, 12 Feb 2013 04:54:27 -0500 (EST) Date: Tue, 12 Feb 2013 10:54:19 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212095419.GB4863@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211223943.GC15951@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote: > > > > On Mon 11-02-13 12:56:19, Johannes Weiner wrote: > > > > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > > > > > > Maybe we could keep the counter per memcg but that would mean that we > > > > > > would need to go up the hierarchy as well. We wouldn't have to go over > > > > > > node-zone-priority cleanup so it would be much more lightweight. > > > > > > > > > > > > I am not sure this is necessarily better than explicit cleanup because > > > > > > it brings yet another kind of generation number to the game but I guess > > > > > > I can live with it if people really thing the relaxed way is much > > > > > > better. > > > > > > What do you think about the patch below (untested yet)? > > > > > > > > > > Better, but I think you can get rid of both locks: > > > > > > > > What is the other lock you have in mind. > > > > > > The iter lock itself. I mean, multiple reclaimers can still race but > > > there won't be any corruption (if you make iter->dead_count a long, > > > setting it happens atomically, we nly need the memcg->dead_count to be > > > an atomic because of the inc) and the worst that could happen is that > > > a reclaim starts at the wrong point in hierarchy, right? > > > > The lack of synchronization basically means that 2 parallel reclaimers > > can reclaim every group exactly once (ideally) or up to each group > > twice in the worst case. > > So the exclusion was quite comfortable. > > It's quite unlikely, though. Don't forget that they actually reclaim > in between, I just can't see them line up perfectly and race to the > iterator at the same time repeatedly. It's more likely to happen at > the higher priority levels where less reclaim happens, and then it's > not a big deal anyway. With lower priority levels, when the glitches > would be more problematic, they also become even less likely. Fair enough, I will drop that patch in the next version. > > > But as you said in the changelog that introduced the lock, it's never > > > actually been a practical problem. > > > > That is true but those bugs would be subtle though so I wouldn't be > > opposed to prevent from them before we get burnt. But if you think that > > we should keep the previous semantic I can drop that patch. > > I just think that the problem is unlikely and not that big of a deal. > > > > You just need to put the wmb back in place, so that we never see the > > > dead_count give the green light while the cached position is stale, or > > > we'll tryget random memory. > > > > > > > > mem_cgroup_iter: > > > > > rcu_read_lock() > > > > > if atomic_read(&root->dead_count) == iter->dead_count: > > > > > smp_rmb() > > > > > if tryget(iter->position): > > > > > position = iter->position > > > > > memcg = find_next(postion) > > > > > css_put(position) > > > > > iter->position = memcg > > > > > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > > > > > iter->dead_count = atomic_read(&root->dead_count) > > > > > rcu_read_unlock() > > > > > > > > Updated patch bellow: > > > > > > Cool, thanks. I hope you don't find it too ugly anymore :-) > > > > It's getting trick and you know how people love when you have to play > > and rely on atomics with memory barriers... > > My bumper sticker reads "I don't believe in mutual exclusion" (the > kernel hacker's version of smile for the red light camera). Ohh, those easy riders. > I mean, you were the one complaining about the lock... > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > position cache is valid, because it has been updated first. > > > > OK, you are right. We can live without css_tryget because dead_count is > > either OK which means that css would be alive at least this rcu period > > (and RCU walk would be safe as well) or it is incremented which means > > that we have started css_offline already and then css is dead already. > > So css_tryget can be dropped. > > Not quite :) > > The dead_count check is for completed destructions, Not quite :P. dead_count is incremented in css_offline callback which is called before the cgroup core releases its last reference and unlinks the group from the siblinks. css_tryget would already fail at this stage because CSS_DEACT_BIAS is in place at that time but this doesn't break RCU walk. So I think we are safe even without css_get. Or am I missing something? [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx166.postini.com [74.125.245.166]) by kanga.kvack.org (Postfix) with SMTP id 71A656B0005 for ; Tue, 12 Feb 2013 10:10:24 -0500 (EST) Date: Tue, 12 Feb 2013 10:10:02 -0500 From: Johannes Weiner Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212151002.GD15951@cmpxchg.org> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212095419.GB4863@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > position cache is valid, because it has been updated first. > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > either OK which means that css would be alive at least this rcu period > > > (and RCU walk would be safe as well) or it is incremented which means > > > that we have started css_offline already and then css is dead already. > > > So css_tryget can be dropped. > > > > Not quite :) > > > > The dead_count check is for completed destructions, > > Not quite :P. dead_count is incremented in css_offline callback which is > called before the cgroup core releases its last reference and unlinks > the group from the siblinks. css_tryget would already fail at this stage > because CSS_DEACT_BIAS is in place at that time but this doesn't break > RCU walk. So I think we are safe even without css_get. But you drop the RCU lock before you return. dead_count IS incremented for every destruction, but it's not reliable for concurrent ones, is what I meant. Again, if there is a dead_count mismatch, your pointer might be dangling, easy case. However, even if there is no mismatch, you could still race with a destruction that has marked the object dead, and then frees it once you drop the RCU lock, so you need try_get() to check if the object is dead, or you could return a pointer to freed or soon to be freed memory. /* * If the dead_count mismatches, a destruction has happened or is * happening concurrently. If the dead_count matches, a destruction * might still happen concurrently, but since we checked under RCU, * that destruction won't free the object until we release the RCU * reader lock. Thus, the dead_count check verifies the pointer is * still valid, css_tryget() verifies the cgroup pointed to is alive. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx165.postini.com [74.125.245.165]) by kanga.kvack.org (Postfix) with SMTP id 28BD96B0005 for ; Tue, 12 Feb 2013 10:43:34 -0500 (EST) Date: Tue, 12 Feb 2013 16:43:30 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212154330.GG4863@dhcp22.suse.cz> References: <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212151002.GD15951@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > position cache is valid, because it has been updated first. > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > either OK which means that css would be alive at least this rcu period > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > that we have started css_offline already and then css is dead already. > > > > So css_tryget can be dropped. > > > > > > Not quite :) > > > > > > The dead_count check is for completed destructions, > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > called before the cgroup core releases its last reference and unlinks > > the group from the siblinks. css_tryget would already fail at this stage > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > RCU walk. So I think we are safe even without css_get. > > But you drop the RCU lock before you return. > > dead_count IS incremented for every destruction, but it's not reliable > for concurrent ones, is what I meant. Again, if there is a dead_count > mismatch, your pointer might be dangling, easy case. However, even if > there is no mismatch, you could still race with a destruction that has > marked the object dead, and then frees it once you drop the RCU lock, > so you need try_get() to check if the object is dead, or you could > return a pointer to freed or soon to be freed memory. Wait a moment. But what prevents from the following race? rcu_read_lock() mem_cgroup_css_offline(memcg) root->dead_count++ iter->last_dead_count = root->dead_count iter->last_visited = memcg // final css_put(memcg); // last_visited is still valid rcu_read_unlock() [...] // next iteration rcu_read_lock() iter->last_dead_count == root->dead_count // KABOOM The race window between dead_count++ and css_put is quite big but that is not important because that css_put can happen anytime before we start the next iteration and take rcu_read_lock. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx165.postini.com [74.125.245.165]) by kanga.kvack.org (Postfix) with SMTP id 055F86B0002 for ; Tue, 12 Feb 2013 11:13:35 -0500 (EST) Date: Tue, 12 Feb 2013 17:13:32 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212161332.GI4863@dhcp22.suse.cz> References: <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212154330.GG4863@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue 12-02-13 16:43:30, Michal Hocko wrote: [...] The example was not complete: > Wait a moment. But what prevents from the following race? > > rcu_read_lock() cgroup_next_descendant_pre css_tryget(css); memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, &css->refcnt) > mem_cgroup_css_offline(memcg) We should be safe if we did synchronize_rcu() before root->dead_count++, no? Because then we would have a guarantee that if css_tryget(memcg) suceeded then we wouldn't race with dead_count++ it triggered. > root->dead_count++ > iter->last_dead_count = root->dead_count > iter->last_visited = memcg > // final > css_put(memcg); > // last_visited is still valid > rcu_read_unlock() > [...] > // next iteration > rcu_read_lock() > iter->last_dead_count == root->dead_count > // KABOOM > > The race window between dead_count++ and css_put is quite big but that > is not important because that css_put can happen anytime before we start > the next iteration and take rcu_read_lock. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx188.postini.com [74.125.245.188]) by kanga.kvack.org (Postfix) with SMTP id 5443C6B0002 for ; Tue, 12 Feb 2013 11:24:45 -0500 (EST) Date: Tue, 12 Feb 2013 17:24:42 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212162442.GJ4863@dhcp22.suse.cz> References: <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212161332.GI4863@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue 12-02-13 17:13:32, Michal Hocko wrote: > On Tue 12-02-13 16:43:30, Michal Hocko wrote: > [...] > The example was not complete: > > > Wait a moment. But what prevents from the following race? > > > > rcu_read_lock() > > cgroup_next_descendant_pre > css_tryget(css); > memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, &css->refcnt) > > > mem_cgroup_css_offline(memcg) > > We should be safe if we did synchronize_rcu() before root->dead_count++, > no? > Because then we would have a guarantee that if css_tryget(memcg) > suceeded then we wouldn't race with dead_count++ it triggered. > > > root->dead_count++ > > iter->last_dead_count = root->dead_count > > iter->last_visited = memcg > > // final > > css_put(memcg); > > // last_visited is still valid > > rcu_read_unlock() > > [...] > > // next iteration > > rcu_read_lock() > > iter->last_dead_count == root->dead_count > > // KABOOM Ohh I have missed that we took a reference on the current memcg which will be stored into last_visited. And then later, during the next iteration it will be still alive until we are done because previous patch moved css_put to the very end. So this race is not possible. I still need to think about parallel iteration and a race with removal. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx186.postini.com [74.125.245.186]) by kanga.kvack.org (Postfix) with SMTP id BFE4B6B0007 for ; Tue, 12 Feb 2013 11:33:58 -0500 (EST) In-Reply-To: <20130212154330.GG4863@dhcp22.suse.cz> References: <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators From: Johannes Weiner Date: Tue, 12 Feb 2013 11:33:39 -0500 Message-ID: Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Michal Hocko wrote: >On Tue 12-02-13 10:10:02, Johannes= Weiner wrote: >> On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wr= ote: >> > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: >> > > On Mon, F= eb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: >> > > > On Mon 11-02-= 13 14:58:24, Johannes Weiner wrote: >> > > > > That way, if the dead count = gives the go-ahead, you KNOW that >the >> > > > > position cache is valid, = because it has been updated first. >> > > > >> > > > OK, you are right. We= can live without css_tryget because >dead_count is >> > > > either OK whic= h means that css would be alive at least this rcu >period >> > > > (and RCU= walk would be safe as well) or it is incremented which >means >> > > > tha= t we have started css_offline already and then css is dead >already. >> > >= > So css_tryget can be dropped. >> > > >> > > Not quite :) >> > > >> > >= The dead_count check is for completed destructions, >> > >> > Not quite := P. dead_count is incremented in css_offline callback >which is >> > called = before the cgroup core releases its last reference and >unlinks >> > the gr= oup from the siblinks. css_tryget would already fail at this >stage >> > be= cause CSS_DEACT_BIAS is in place at that time but this doesn't >break >> > = RCU walk. So I think we are safe even without css_get. >> >> But you drop = the RCU lock before you return. >> >> dead_count IS incremented for every d= estruction, but it's not >reliable >> for concurrent ones, is what I meant.= Again, if there is a >dead_count >> mismatch, your pointer might be dangl= ing, easy case. However, even >if >> there is no mismatch, you could still= race with a destruction that >has >> marked the object dead, and then free= s it once you drop the RCU lock, >> so you need try_get() to check if the o= bject is dead, or you could >> return a pointer to freed or soon to be free= d memory. > >Wait a moment. But what prevents from the following race? > >r= cu_read_lock() > mem_cgroup_css_offline(memcg) > root->dead_count= ++ >iter->last_dead_count =3D root->dead_count use the dead count read the= first time for comparison, i.e. only one atomic read in that function. yo= u are right, we would miss to account for that concurrent destruction other= wise. >iter->last_visited =3D memcg > // final > css_put(memcg);= >// last_visited is still valid >rcu_read_unlock() >[...] >// next iterati= on >rcu_read_lock() >iter->last_dead_count =3D=3D root->dead_count >// KABO= OM > >The race window between dead_count++ and css_put is quite big but tha= t >is not important because that css_put can happen anytime before we >star= t >the next iteration and take rcu_read_lock. -- Sent from my Android pho= ne with K-9 Mail. Please excuse my brevity. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id B93DF6B0009 for ; Tue, 12 Feb 2013 11:37:58 -0500 (EST) Date: Tue, 12 Feb 2013 17:37:56 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212163756.GK4863@dhcp22.suse.cz> References: <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212162442.GJ4863@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue 12-02-13 17:24:42, Michal Hocko wrote: > On Tue 12-02-13 17:13:32, Michal Hocko wrote: > > On Tue 12-02-13 16:43:30, Michal Hocko wrote: > > [...] > > The example was not complete: > > > > > Wait a moment. But what prevents from the following race? > > > > > > rcu_read_lock() > > > > cgroup_next_descendant_pre > > css_tryget(css); > > memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, &css->refcnt) > > > > > mem_cgroup_css_offline(memcg) > > > > We should be safe if we did synchronize_rcu() before root->dead_count++, > > no? > > Because then we would have a guarantee that if css_tryget(memcg) > > suceeded then we wouldn't race with dead_count++ it triggered. > > > > > root->dead_count++ > > > iter->last_dead_count = root->dead_count > > > iter->last_visited = memcg > > > // final > > > css_put(memcg); > > > // last_visited is still valid > > > rcu_read_unlock() > > > [...] > > > // next iteration > > > rcu_read_lock() > > > iter->last_dead_count == root->dead_count > > > // KABOOM > > Ohh I have missed that we took a reference on the current memcg which > will be stored into last_visited. And then later, during the next > iteration it will be still alive until we are done because previous > patch moved css_put to the very end. And that wouldn't help because: css_tryget(memcg) // OK CSS_DEACT_BIAS root->dead_count++ iter->last_visited = memcg iter->last_dead_count = root->dead_count prev = memcg css_put(memcg) memcg_iter_break css_put(memcg) // it will released //new iteration iter->last_dead_count == root->dead_count //ok css_tryget() // KABOOM because css is already gone Bit I still might be missing something and need to get back to this with a clean head. Sorry about the spam -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx203.postini.com [74.125.245.203]) by kanga.kvack.org (Postfix) with SMTP id 7F4266B0007 for ; Tue, 12 Feb 2013 11:41:33 -0500 (EST) In-Reply-To: <20130212162442.GJ4863@dhcp22.suse.cz> References: <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators From: Johannes Weiner Date: Tue, 12 Feb 2013 11:41:03 -0500 Message-ID: <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Michal Hocko wrote: >On Tue 12-02-13 17:13:32, Michal H= ocko wrote: >> On Tue 12-02-13 16:43:30, Michal Hocko wrote: >> [...] >> Th= e example was not complete: >> >> > Wait a moment. But what prevents from = the following race? >> > >> > rcu_read_lock() >> >> cgroup_next_descendan= t_pre >> css_tryget(css); >> memcg =3D mem_cgroup_from_css(css) atomic_add= (CSS_DEACT_BIAS, >&css->refcnt) >> >> > mem_cgroup_css_offline(memcg= ) >> >> We should be safe if we did synchronize_rcu() before >root->dead_c= ount++, >> no? >> Because then we would have a guarantee that if css_tryget= (memcg) >> suceeded then we wouldn't race with dead_count++ it triggered. >= > >> > root->dead_count++ >> > iter->last_dead_count =3D root->dead_= count >> > iter->last_visited =3D memcg >> > // final >> > css_= put(memcg); >> > // last_visited is still valid >> > rcu_read_unlock() >> >= [...] >> > // next iteration >> > rcu_read_lock() >> > iter->last_dead_cou= nt =3D=3D root->dead_count >> > // KABOOM > >Ohh I have missed that we took= a reference on the current memcg which >will be stored into last_visited. = And then later, during the next >iteration it will be still alive until we = are done because previous >patch moved css_put to the very end. >So this ra= ce is not possible. I still need to think about parallel >iteration and a r= ace with removal. I thought the whole point was to not have a reference in= last_visited because have the iterator might be unused indefinitely :-) W= e only store a pointer and validate it before use the next time around. So= I think the race is still possible, but we can deal with it by not losing = concurrent dead count changes, i.e. one atomic read in the iterator functio= n. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx202.postini.com [74.125.245.202]) by kanga.kvack.org (Postfix) with SMTP id 1DC186B0002 for ; Tue, 12 Feb 2013 12:12:21 -0500 (EST) Date: Tue, 12 Feb 2013 18:12:16 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212171216.GA17663@dhcp22.suse.cz> References: <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue 12-02-13 11:41:03, Johannes Weiner wrote: > > > Michal Hocko wrote: > > >On Tue 12-02-13 17:13:32, Michal Hocko wrote: > >> On Tue 12-02-13 16:43:30, Michal Hocko wrote: > >> [...] > >> The example was not complete: > >> > >> > Wait a moment. But what prevents from the following race? > >> > > >> > rcu_read_lock() > >> > >> cgroup_next_descendant_pre > >> css_tryget(css); > >> memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, > >&css->refcnt) > >> > >> > mem_cgroup_css_offline(memcg) > >> > >> We should be safe if we did synchronize_rcu() before > >root->dead_count++, > >> no? > >> Because then we would have a guarantee that if css_tryget(memcg) > >> suceeded then we wouldn't race with dead_count++ it triggered. > >> > >> > root->dead_count++ > >> > iter->last_dead_count = root->dead_count > >> > iter->last_visited = memcg > >> > // final > >> > css_put(memcg); > >> > // last_visited is still valid > >> > rcu_read_unlock() > >> > [...] > >> > // next iteration > >> > rcu_read_lock() > >> > iter->last_dead_count == root->dead_count > >> > // KABOOM > > > >Ohh I have missed that we took a reference on the current memcg which > >will be stored into last_visited. And then later, during the next > >iteration it will be still alive until we are done because previous > >patch moved css_put to the very end. > >So this race is not possible. I still need to think about parallel > >iteration and a race with removal. > > I thought the whole point was to not have a reference in last_visited > because have the iterator might be unused indefinitely :-) OK, it seems that I managed to confuse ;) > We only store a pointer and validate it before use the next time > around. So I think the race is still possible, but we can deal with > it by not losing concurrent dead count changes, i.e. one atomic read > in the iterator function. All reads from root->dead_count are atomic already, so I am not sure what you mean here. Anyway, I hope I won't make this even more confusing if I post what I have right now: --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id 8D4826B0002 for ; Tue, 12 Feb 2013 12:25:50 -0500 (EST) Date: Tue, 12 Feb 2013 12:25:26 -0500 From: Johannes Weiner Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212172526.GC25235@cmpxchg.org> References: <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161051.GQ2666@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212161051.GQ2666@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Paul E. McKenney" Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote: > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote: > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > > > position cache is valid, because it has been updated first. > > > > > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > > > either OK which means that css would be alive at least this rcu period > > > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > > > that we have started css_offline already and then css is dead already. > > > > > > So css_tryget can be dropped. > > > > > > > > > > Not quite :) > > > > > > > > > > The dead_count check is for completed destructions, > > > > > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > > > called before the cgroup core releases its last reference and unlinks > > > > the group from the siblinks. css_tryget would already fail at this stage > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > > > RCU walk. So I think we are safe even without css_get. > > > > > > But you drop the RCU lock before you return. > > > > > > dead_count IS incremented for every destruction, but it's not reliable > > > for concurrent ones, is what I meant. Again, if there is a dead_count > > > mismatch, your pointer might be dangling, easy case. However, even if > > > there is no mismatch, you could still race with a destruction that has > > > marked the object dead, and then frees it once you drop the RCU lock, > > > so you need try_get() to check if the object is dead, or you could > > > return a pointer to freed or soon to be freed memory. > > > > Wait a moment. But what prevents from the following race? > > > > rcu_read_lock() > > mem_cgroup_css_offline(memcg) > > root->dead_count++ > > iter->last_dead_count = root->dead_count > > iter->last_visited = memcg > > // final > > css_put(memcg); > > // last_visited is still valid > > rcu_read_unlock() > > [...] > > // next iteration > > rcu_read_lock() > > iter->last_dead_count == root->dead_count > > // KABOOM > > > > The race window between dead_count++ and css_put is quite big but that > > is not important because that css_put can happen anytime before we start > > the next iteration and take rcu_read_lock. > > The usual approach is to make sure that there is a grace period (either > synchronize_rcu() or call_rcu()) between the time that the data is > made inaccessible to readers (this would be mem_cgroup_css_offline()?) > and the time it is freed (css_put(), correct?). Absolutely! And there is a synchronize_rcu() in between those two operations. However, we want to keep a weak reference to the cgroup after we drop the rcu read-side lock, so rcu alone is not enough for us to guarantee object life time. We still have to carefully detect any concurrent offlinings in order to validate the weak reference next time around. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx116.postini.com [74.125.245.116]) by kanga.kvack.org (Postfix) with SMTP id 09F806B0007 for ; Tue, 12 Feb 2013 12:37:49 -0500 (EST) Date: Tue, 12 Feb 2013 12:37:41 -0500 From: Johannes Weiner Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212173741.GD25235@cmpxchg.org> References: <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> <20130212171216.GA17663@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212171216.GA17663@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote: > On Tue 12-02-13 11:41:03, Johannes Weiner wrote: > > > > > > Michal Hocko wrote: > > > > >On Tue 12-02-13 17:13:32, Michal Hocko wrote: > > >> On Tue 12-02-13 16:43:30, Michal Hocko wrote: > > >> [...] > > >> The example was not complete: > > >> > > >> > Wait a moment. But what prevents from the following race? > > >> > > > >> > rcu_read_lock() > > >> > > >> cgroup_next_descendant_pre > > >> css_tryget(css); > > >> memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, > > >&css->refcnt) > > >> > > >> > mem_cgroup_css_offline(memcg) > > >> > > >> We should be safe if we did synchronize_rcu() before > > >root->dead_count++, > > >> no? > > >> Because then we would have a guarantee that if css_tryget(memcg) > > >> suceeded then we wouldn't race with dead_count++ it triggered. > > >> > > >> > root->dead_count++ > > >> > iter->last_dead_count = root->dead_count > > >> > iter->last_visited = memcg > > >> > // final > > >> > css_put(memcg); > > >> > // last_visited is still valid > > >> > rcu_read_unlock() > > >> > [...] > > >> > // next iteration > > >> > rcu_read_lock() > > >> > iter->last_dead_count == root->dead_count > > >> > // KABOOM > > > > > >Ohh I have missed that we took a reference on the current memcg which > > >will be stored into last_visited. And then later, during the next > > >iteration it will be still alive until we are done because previous > > >patch moved css_put to the very end. > > >So this race is not possible. I still need to think about parallel > > >iteration and a race with removal. > > > > I thought the whole point was to not have a reference in last_visited > > because have the iterator might be unused indefinitely :-) > > OK, it seems that I managed to confuse ;) > > > We only store a pointer and validate it before use the next time > > around. So I think the race is still possible, but we can deal with > > it by not losing concurrent dead count changes, i.e. one atomic read > > in the iterator function. > > All reads from root->dead_count are atomic already, so I am not sure > what you mean here. Anyway, I hope I won't make this even more confusing > if I post what I have right now: Yes, but we are doing two reads. Can't the memcg that we'll store in last_visited be offlined during this and be freed after we drop the rcu read lock? If we had just one read, we would detect this properly. > --- > >From 52121928be61282dc19e32179056615ffdf128a9 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Tue, 12 Feb 2013 18:08:26 +0100 > Subject: [PATCH] memcg: relax memcg iter caching > > Now that per-node-zone-priority iterator caches memory cgroups rather > than their css ids we have to be careful and remove them from the > iterator when they are on the way out otherwise they might hang for > unbounded amount of time (until the global/targeted reclaim triggers the > zone under priority to find out the group is dead and let it to find the > final rest). > > We can fix this issue by relaxing rules for the last_visited memcg as > well. > Instead of taking reference to css before it is stored into > iter->last_visited we can just store its pointer and track the number of > removed groups for each memcg. This number would be stored into iterator > everytime when a memcg is cached. If the iter count doesn't match the > curent walker root's one we will start over from the root again. The > group counter is incremented upwards the hierarchy every time a group is > removed. > > Locking rules got a bit complicated. We primarily rely on rcu read > lock which makes sure that once we see an up-to-date dead_count then > iter->last_visited is valid for RCU walk. smp_rmb makes sure that > dead_count is read before last_visited and last_dead_count while smp_wmb > makes sure that last_visited is updated before last_dead_count so the > up-to-date last_dead_count cannot point to an outdated last_visited. > css_tryget then makes sure that the last_visited is still alive. > > Spotted-by: Ying Han > Original-idea-by: Johannes Weiner > Signed-off-by: Michal Hocko > --- > mm/memcontrol.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 60 insertions(+), 9 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 727ec39..31bb9b0 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { > }; > > struct mem_cgroup_reclaim_iter { > - /* last scanned hierarchy member with elevated css ref count */ > + /* > + * last scanned hierarchy member. Valid only if last_dead_count > + * matches memcg->dead_count of the hierarchy root group. > + */ > struct mem_cgroup *last_visited; > + unsigned int last_dead_count; Since we read and write this without a lock, I would feel more comfortable if this were a full word, i.e. unsigned long. That guarantees we don't see any partial states. > @@ -1156,17 +1162,36 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > int nid = zone_to_nid(reclaim->zone); > int zid = zone_idx(reclaim->zone); > struct mem_cgroup_per_zone *mz; > + unsigned int dead_count; > > mz = mem_cgroup_zoneinfo(root, nid, zid); > iter = &mz->reclaim_iter[reclaim->priority]; > - last_visited = iter->last_visited; > if (prev && reclaim->generation != iter->generation) { > - if (last_visited) { > - css_put(&last_visited->css); > - iter->last_visited = NULL; > - } > + iter->last_visited = NULL; > goto out_unlock; > } > + > + /* > + * If the dead_count mismatches, a destruction > + * has happened or is happening concurrently. > + * If the dead_count matches, a destruction > + * might still happen concurrently, but since > + * we checked under RCU, that destruction > + * won't free the object until we release the > + * RCU reader lock. Thus, the dead_count > + * check verifies the pointer is still valid, > + * css_tryget() verifies the cgroup pointed to > + * is alive. > + */ > + dead_count = atomic_read(&root->dead_count); > + smp_rmb(); > + last_visited = iter->last_visited; > + if (last_visited) { > + if ((dead_count != iter->last_dead_count) || > + !css_tryget(&last_visited->css)) { > + last_visited = NULL; > + } > + } > } > > /* > @@ -1206,10 +1231,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (css && !memcg) > curr = mem_cgroup_from_css(css); > > - /* make sure that the cached memcg is not removed */ > - if (curr) > - css_get(&curr->css); > iter->last_visited = curr; > + smp_wmb(); > + iter->last_dead_count = atomic_read(&root->dead_count); iter->last_dead_count = dead_count This way, we detect if curr is offlined between the first reading and the second reading. Otherwise, it could get freed when the reference is dropped and then last_visited points to invalid memory while the dead_count is uptodate. > @@ -6366,10 +6390,37 @@ free_out: > return ERR_PTR(error); > } > > +/* > + * Announce all parents that a group from their hierarchy is gone. > + */ > +static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *parent = memcg; > + > + /* > + * Make sure we are not racing with mem_cgroup_iter when it stores > + * a new iter->last_visited. Wait until that RCU finishes so that > + * it cannot see already incremented dead_count with memcg which > + * would be already dead next time but dead_count wouldn't tell > + * us about that. > + */ > + synchronize_rcu(); Ah, you are stabilizing the counter between the two reads. It's cheaper to just do one read instead. Saves the atomic op and saves the synchronization point :-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx107.postini.com [74.125.245.107]) by kanga.kvack.org (Postfix) with SMTP id B57206B0007 for ; Tue, 12 Feb 2013 12:50:06 -0500 (EST) Received: from /spool/local by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 12 Feb 2013 12:50:05 -0500 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id 12D5A38CB16B for ; Tue, 12 Feb 2013 11:29:29 -0500 (EST) Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r1CGTP6h039264 for ; Tue, 12 Feb 2013 11:29:26 -0500 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r1CGTKob007290 for ; Tue, 12 Feb 2013 09:29:21 -0700 Date: Tue, 12 Feb 2013 08:10:51 -0800 From: "Paul E. McKenney" Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212161051.GQ2666@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212154330.GG4863@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Johannes Weiner , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote: > On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > > position cache is valid, because it has been updated first. > > > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > > either OK which means that css would be alive at least this rcu period > > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > > that we have started css_offline already and then css is dead already. > > > > > So css_tryget can be dropped. > > > > > > > > Not quite :) > > > > > > > > The dead_count check is for completed destructions, > > > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > > called before the cgroup core releases its last reference and unlinks > > > the group from the siblinks. css_tryget would already fail at this stage > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > > RCU walk. So I think we are safe even without css_get. > > > > But you drop the RCU lock before you return. > > > > dead_count IS incremented for every destruction, but it's not reliable > > for concurrent ones, is what I meant. Again, if there is a dead_count > > mismatch, your pointer might be dangling, easy case. However, even if > > there is no mismatch, you could still race with a destruction that has > > marked the object dead, and then frees it once you drop the RCU lock, > > so you need try_get() to check if the object is dead, or you could > > return a pointer to freed or soon to be freed memory. > > Wait a moment. But what prevents from the following race? > > rcu_read_lock() > mem_cgroup_css_offline(memcg) > root->dead_count++ > iter->last_dead_count = root->dead_count > iter->last_visited = memcg > // final > css_put(memcg); > // last_visited is still valid > rcu_read_unlock() > [...] > // next iteration > rcu_read_lock() > iter->last_dead_count == root->dead_count > // KABOOM > > The race window between dead_count++ and css_put is quite big but that > is not important because that css_put can happen anytime before we start > the next iteration and take rcu_read_lock. The usual approach is to make sure that there is a grace period (either synchronize_rcu() or call_rcu()) between the time that the data is made inaccessible to readers (this would be mem_cgroup_css_offline()?) and the time it is freed (css_put(), correct?). Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx191.postini.com [74.125.245.191]) by kanga.kvack.org (Postfix) with SMTP id E6CD26B0007 for ; Tue, 12 Feb 2013 12:56:57 -0500 (EST) Date: Tue, 12 Feb 2013 18:56:51 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212175651.GC17663@dhcp22.suse.cz> References: <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161051.GQ2666@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212161051.GQ2666@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Paul E. McKenney" Cc: Johannes Weiner , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue 12-02-13 08:10:51, Paul E. McKenney wrote: > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote: > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > > > position cache is valid, because it has been updated first. > > > > > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > > > either OK which means that css would be alive at least this rcu period > > > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > > > that we have started css_offline already and then css is dead already. > > > > > > So css_tryget can be dropped. > > > > > > > > > > Not quite :) > > > > > > > > > > The dead_count check is for completed destructions, > > > > > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > > > called before the cgroup core releases its last reference and unlinks > > > > the group from the siblinks. css_tryget would already fail at this stage > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > > > RCU walk. So I think we are safe even without css_get. > > > > > > But you drop the RCU lock before you return. > > > > > > dead_count IS incremented for every destruction, but it's not reliable > > > for concurrent ones, is what I meant. Again, if there is a dead_count > > > mismatch, your pointer might be dangling, easy case. However, even if > > > there is no mismatch, you could still race with a destruction that has > > > marked the object dead, and then frees it once you drop the RCU lock, > > > so you need try_get() to check if the object is dead, or you could > > > return a pointer to freed or soon to be freed memory. > > > > Wait a moment. But what prevents from the following race? > > > > rcu_read_lock() > > mem_cgroup_css_offline(memcg) > > root->dead_count++ > > iter->last_dead_count = root->dead_count > > iter->last_visited = memcg > > // final > > css_put(memcg); > > // last_visited is still valid > > rcu_read_unlock() > > [...] > > // next iteration > > rcu_read_lock() > > iter->last_dead_count == root->dead_count > > // KABOOM > > > > The race window between dead_count++ and css_put is quite big but that > > is not important because that css_put can happen anytime before we start > > the next iteration and take rcu_read_lock. > > The usual approach is to make sure that there is a grace period (either > synchronize_rcu() or call_rcu()) between the time that the data is > made inaccessible to readers (this would be mem_cgroup_css_offline()?) > and the time it is freed (css_put(), correct?). Yes, that was my suggestion and I put it before dead_count is incremented down the mem_cgroup_css_offline road. Johannes still thinks we can do without it if we reduce the number of atomic_read(dead_count) which sounds like a way to go but I will rather think about it with a fresh head tomorrow. Anyway, thanks for jumping in. Earth is always a bit shaky when all the barriers and rcu mix together. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx129.postini.com [74.125.245.129]) by kanga.kvack.org (Postfix) with SMTP id F2AB16B0002 for ; Tue, 12 Feb 2013 14:54:17 -0500 (EST) Date: Tue, 12 Feb 2013 14:53:58 -0500 From: Johannes Weiner Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212195358.GE25235@cmpxchg.org> References: <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161051.GQ2666@linux.vnet.ibm.com> <20130212172526.GC25235@cmpxchg.org> <20130212183148.GW2666@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212183148.GW2666@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Paul E. McKenney" Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue, Feb 12, 2013 at 10:31:48AM -0800, Paul E. McKenney wrote: > On Tue, Feb 12, 2013 at 12:25:26PM -0500, Johannes Weiner wrote: > > On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote: > > > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote: > > > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > > > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > > > > > position cache is valid, because it has been updated first. > > > > > > > > > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > > > > > either OK which means that css would be alive at least this rcu period > > > > > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > > > > > that we have started css_offline already and then css is dead already. > > > > > > > > So css_tryget can be dropped. > > > > > > > > > > > > > > Not quite :) > > > > > > > > > > > > > > The dead_count check is for completed destructions, > > > > > > > > > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > > > > > called before the cgroup core releases its last reference and unlinks > > > > > > the group from the siblinks. css_tryget would already fail at this stage > > > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > > > > > RCU walk. So I think we are safe even without css_get. > > > > > > > > > > But you drop the RCU lock before you return. > > > > > > > > > > dead_count IS incremented for every destruction, but it's not reliable > > > > > for concurrent ones, is what I meant. Again, if there is a dead_count > > > > > mismatch, your pointer might be dangling, easy case. However, even if > > > > > there is no mismatch, you could still race with a destruction that has > > > > > marked the object dead, and then frees it once you drop the RCU lock, > > > > > so you need try_get() to check if the object is dead, or you could > > > > > return a pointer to freed or soon to be freed memory. > > > > > > > > Wait a moment. But what prevents from the following race? > > > > > > > > rcu_read_lock() > > > > mem_cgroup_css_offline(memcg) > > > > root->dead_count++ > > > > iter->last_dead_count = root->dead_count > > > > iter->last_visited = memcg > > > > // final > > > > css_put(memcg); > > > > // last_visited is still valid > > > > rcu_read_unlock() > > > > [...] > > > > // next iteration > > > > rcu_read_lock() > > > > iter->last_dead_count == root->dead_count > > > > // KABOOM > > > > > > > > The race window between dead_count++ and css_put is quite big but that > > > > is not important because that css_put can happen anytime before we start > > > > the next iteration and take rcu_read_lock. > > > > > > The usual approach is to make sure that there is a grace period (either > > > synchronize_rcu() or call_rcu()) between the time that the data is > > > made inaccessible to readers (this would be mem_cgroup_css_offline()?) > > > and the time it is freed (css_put(), correct?). > > > > Absolutely! And there is a synchronize_rcu() in between those two > > operations. > > > > However, we want to keep a weak reference to the cgroup after we drop > > the rcu read-side lock, so rcu alone is not enough for us to guarantee > > object life time. We still have to carefully detect any concurrent > > offlinings in order to validate the weak reference next time around. > > That would make things more interesting. ;-) > > Exactly who or what holds the weak reference? And the idea is that if > you attempt to use the weak reference beforehand, the css_put() does not > actually free it, but if you attempt to use it afterwards, you get some > sort of failure indication? Yes, exactly. We are using a seqlock-style cookie comparison to see if any objects in the pool of objects that we may point to was destroyed. We are having trouble to agree on how to safely read the counter :-) Long version: It's an iterator over a hierarchy of cgroups, but page reclaim may stop iteration at will and might not come back for an indefinite amount of time (until memory pressure triggers reclaim again). So we want to allow cgroups to be destroyed while one of the iterators may still pointing at it (we have iterators per-node, per-zone, per reclaim priority level, that's why it's not feasible to invalidate them pro-actively upon cgroup destruction). The idea is that we have a counter that counts cgroup destructions in each cgroup hierarchy and we remember a snapshot of that counter at the time we remember the iterator position. If any group in that group's hierarchy gets killed before we come back to the iterator, the counter mismatches. Easy. If any group is getting killed concurrently, the counter might match our cookie, but the object could be marked dead already, while rcu prevents it from being freed. The remaining worry is/was that we have two reads of the destruction counter: one when validating the weak reference, another one when updating the iterator. If a destruction starts in between those two, and modifies the counter, we would miss that destruction and the object that is now weakly referenced could get freed while the corresponding snapshot matches the latest value of the destruction counter. Michal's idea was to hold off the destruction counter inc between those reads with synchronize_rcu(). My idea was to simply read the counter only once and use that same value to both check and update the iterator with. That should catch this type of race condition and save the atomic & the extra synchronize_rcu(). At least I fail to see the downside of reading it only once: iteration: rcu_read_lock() dead_count = atomic_read(&hierarchy->dead_count) smp_rmb() previous = iterator->position if (iterator->dead_count != dead_count) /* A cgroup in our hierarchy was killed, pointer might be dangling */ don't use iterator if (!tryget(&previous)) /* The cgroup is marked dead, don't use it */ don't use iterator next = find_next_and_tryget(hierarchy, &previous) /* what happens if destruction of next starts NOW? */ css_put(previous) iterator->position = next smp_wmb() iterator->dead_count = dead_count /* my suggestion, instead of a second atomic_read() */ rcu_read_unlock() return next /* caller drops ref eventually, iterator->cgroup becomes weak */ destruction: bias(cgroup->refcount) /* disables future tryget */ //synchronize_rcu() /* Michal's suggestion */ atomic_inc(&cgroup->hierarchy->dead_count) synchronize_rcu() free(cgroup) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx121.postini.com [74.125.245.121]) by kanga.kvack.org (Postfix) with SMTP id 523EA6B0005 for ; Tue, 12 Feb 2013 17:23:40 -0500 (EST) Received: from /spool/local by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 12 Feb 2013 17:23:39 -0500 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by d01dlp03.pok.ibm.com (Postfix) with ESMTP id 4C4203C6149F for ; Tue, 12 Feb 2013 13:39:21 -0500 (EST) Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r1CIdJTZ245338 for ; Tue, 12 Feb 2013 13:39:19 -0500 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r1CId59E001814 for ; Tue, 12 Feb 2013 11:39:06 -0700 Date: Tue, 12 Feb 2013 10:31:48 -0800 From: "Paul E. McKenney" Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212183148.GW2666@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161051.GQ2666@linux.vnet.ibm.com> <20130212172526.GC25235@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212172526.GC25235@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue, Feb 12, 2013 at 12:25:26PM -0500, Johannes Weiner wrote: > On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote: > > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote: > > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > > > > position cache is valid, because it has been updated first. > > > > > > > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > > > > either OK which means that css would be alive at least this rcu period > > > > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > > > > that we have started css_offline already and then css is dead already. > > > > > > > So css_tryget can be dropped. > > > > > > > > > > > > Not quite :) > > > > > > > > > > > > The dead_count check is for completed destructions, > > > > > > > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > > > > called before the cgroup core releases its last reference and unlinks > > > > > the group from the siblinks. css_tryget would already fail at this stage > > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > > > > RCU walk. So I think we are safe even without css_get. > > > > > > > > But you drop the RCU lock before you return. > > > > > > > > dead_count IS incremented for every destruction, but it's not reliable > > > > for concurrent ones, is what I meant. Again, if there is a dead_count > > > > mismatch, your pointer might be dangling, easy case. However, even if > > > > there is no mismatch, you could still race with a destruction that has > > > > marked the object dead, and then frees it once you drop the RCU lock, > > > > so you need try_get() to check if the object is dead, or you could > > > > return a pointer to freed or soon to be freed memory. > > > > > > Wait a moment. But what prevents from the following race? > > > > > > rcu_read_lock() > > > mem_cgroup_css_offline(memcg) > > > root->dead_count++ > > > iter->last_dead_count = root->dead_count > > > iter->last_visited = memcg > > > // final > > > css_put(memcg); > > > // last_visited is still valid > > > rcu_read_unlock() > > > [...] > > > // next iteration > > > rcu_read_lock() > > > iter->last_dead_count == root->dead_count > > > // KABOOM > > > > > > The race window between dead_count++ and css_put is quite big but that > > > is not important because that css_put can happen anytime before we start > > > the next iteration and take rcu_read_lock. > > > > The usual approach is to make sure that there is a grace period (either > > synchronize_rcu() or call_rcu()) between the time that the data is > > made inaccessible to readers (this would be mem_cgroup_css_offline()?) > > and the time it is freed (css_put(), correct?). > > Absolutely! And there is a synchronize_rcu() in between those two > operations. > > However, we want to keep a weak reference to the cgroup after we drop > the rcu read-side lock, so rcu alone is not enough for us to guarantee > object life time. We still have to carefully detect any concurrent > offlinings in order to validate the weak reference next time around. That would make things more interesting. ;-) Exactly who or what holds the weak reference? And the idea is that if you attempt to use the weak reference beforehand, the css_put() does not actually free it, but if you attempt to use it afterwards, you get some sort of failure indication? Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx184.postini.com [74.125.245.184]) by kanga.kvack.org (Postfix) with SMTP id DCCE06B0005 for ; Wed, 13 Feb 2013 03:11:45 -0500 (EST) Message-ID: <511B4ACF.90209@parallels.com> Date: Wed, 13 Feb 2013 12:11:59 +0400 From: Glauber Costa MIME-Version: 1.0 Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators References: <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> <20130212171216.GA17663@dhcp22.suse.cz> <20130212173741.GD25235@cmpxchg.org> In-Reply-To: <20130212173741.GD25235@cmpxchg.org> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Li Zefan On 02/12/2013 09:37 PM, Johannes Weiner wrote: >> > All reads from root->dead_count are atomic already, so I am not sure >> > what you mean here. Anyway, I hope I won't make this even more confusing >> > if I post what I have right now: > Yes, but we are doing two reads. Can't the memcg that we'll store in > last_visited be offlined during this and be freed after we drop the > rcu read lock? If we had just one read, we would detect this > properly. > I don't want to add any more confusion to an already fun discussion, but IIUC, you are trying to avoid triggering a second round of reclaim in an already dead memcg, right? Can't you generalize the mechanism I use for kmemcg, where a very similar problem exists ? This is how it looks like: /* this atomically sets a bit in the memcg. It does so * unconditionally, and it is (so far) okay if it is set * twice */ memcg_kmem_mark_dead(memcg); /* * Then if kmem charges is not zero, we don't actually destroy the * memcg. The function where it lives will always be called when usage * reaches 0, so we guarantee that we will never miss the chance to * call the destruction function at least once. * * I suspect you could use a mechanism like this, or extend * this very same, to prevent the second reclaim to be even called */ if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0) return; /* * this is how we guarantee that the destruction fuction is called at * most once. The second caller would see the bit unset. */ if (memcg_kmem_test_and_clear_dead(memcg)) mem_cgroup_put(memcg); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx188.postini.com [74.125.245.188]) by kanga.kvack.org (Postfix) with SMTP id 55B786B0005 for ; Wed, 13 Feb 2013 04:51:17 -0500 (EST) Date: Wed, 13 Feb 2013 10:51:13 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130213095113.GA23562@dhcp22.suse.cz> References: <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161051.GQ2666@linux.vnet.ibm.com> <20130212172526.GC25235@cmpxchg.org> <20130212183148.GW2666@linux.vnet.ibm.com> <20130212195358.GE25235@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212195358.GE25235@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: "Paul E. McKenney" , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue 12-02-13 14:53:58, Johannes Weiner wrote: [...] > iteration: > rcu_read_lock() > dead_count = atomic_read(&hierarchy->dead_count) > smp_rmb() > previous = iterator->position > if (iterator->dead_count != dead_count) > /* A cgroup in our hierarchy was killed, pointer might be dangling */ > don't use iterator > if (!tryget(&previous)) > /* The cgroup is marked dead, don't use it */ > don't use iterator > next = find_next_and_tryget(hierarchy, &previous) > /* what happens if destruction of next starts NOW? */ OK, I thought that this depends on the ordering of CSS_DEACT_BIAS and dead_count writes - because there is no memory ordering enforced between those two. But it shouldn't matter because we are checking both. If the increment is seen sooner then we do not care about css_tryget and if css is deactivated before dead_count++ then the css_tryget would shout. More interesting ordering, however, is dead_count++ vs. css_put from cgroup core. Say we have the following: CPU0 CPU1 CPU2 iter->position = A; iter->dead_count = dead_count; rcu_read_unlock() return A mem_cgroup_iter_break css_put(A) bias(A) css_offline() css_put(A) // in cgroup_destroy_locked // last ref and A will be freed rcu_read_lock() read parent->dead_count parent->dead_count++ // got reordered from css_offline css_tryget(A) // kaboom The reordering window is really huge and I think it is impossible to trigger in real life. And mem_cgroup_reparent_charges calls mem_cgroup_start_move unconditionally which in turn calls synchronize_rcu() which is a full barrier AFAIU so dead_count++ cannot be reordered ATM. But should we rely on that? Shouldn't we add smp_wmb after dead_count++ as I had in an earlier version of the patch? > css_put(previous) > iterator->position = next > smp_wmb() > iterator->dead_count = dead_count /* my suggestion, instead of a second atomic_read() */ > rcu_read_unlock() > return next /* caller drops ref eventually, iterator->cgroup becomes weak */ > > destruction: > bias(cgroup->refcount) /* disables future tryget */ > //synchronize_rcu() /* Michal's suggestion */ > atomic_inc(&cgroup->hierarchy->dead_count) > synchronize_rcu() > free(cgroup) Other than that this should work. I will update the patch accordingly. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx192.postini.com [74.125.245.192]) by kanga.kvack.org (Postfix) with SMTP id B670C6B0005 for ; Wed, 13 Feb 2013 05:35:02 -0500 (EST) Date: Wed, 13 Feb 2013 11:34:59 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130213103459.GB23562@dhcp22.suse.cz> References: <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> <20130212171216.GA17663@dhcp22.suse.cz> <20130212173741.GD25235@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212173741.GD25235@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Tue 12-02-13 12:37:41, Johannes Weiner wrote: > On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote: [...] > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 727ec39..31bb9b0 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { > > }; > > > > struct mem_cgroup_reclaim_iter { > > - /* last scanned hierarchy member with elevated css ref count */ > > + /* > > + * last scanned hierarchy member. Valid only if last_dead_count > > + * matches memcg->dead_count of the hierarchy root group. > > + */ > > struct mem_cgroup *last_visited; > > + unsigned int last_dead_count; > > Since we read and write this without a lock, I would feel more > comfortable if this were a full word, i.e. unsigned long. That > guarantees we don't see any partial states. OK. Changed. Although I though that int is read/modified atomically as well if it is aligned to its size. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id C2B8D6B0005 for ; Wed, 13 Feb 2013 05:38:14 -0500 (EST) Date: Wed, 13 Feb 2013 11:38:11 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130213103811.GC23562@dhcp22.suse.cz> References: <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> <20130212171216.GA17663@dhcp22.suse.cz> <20130212173741.GD25235@cmpxchg.org> <511B4ACF.90209@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <511B4ACF.90209@parallels.com> Sender: owner-linux-mm@kvack.org List-ID: To: Glauber Costa Cc: Johannes Weiner , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Li Zefan On Wed 13-02-13 12:11:59, Glauber Costa wrote: > On 02/12/2013 09:37 PM, Johannes Weiner wrote: > >> > All reads from root->dead_count are atomic already, so I am not sure > >> > what you mean here. Anyway, I hope I won't make this even more confusing > >> > if I post what I have right now: > > Yes, but we are doing two reads. Can't the memcg that we'll store in > > last_visited be offlined during this and be freed after we drop the > > rcu read lock? If we had just one read, we would detect this > > properly. > > > > I don't want to add any more confusion to an already fun discussion, but > IIUC, you are trying to avoid triggering a second round of reclaim in an > already dead memcg, right? No this is not about the second round of the reclaim but rather iteration racing with removal. And we want to do it as lightweight as possible. We cannot work with memcg directly because it might have disappeared in the mean time and we do not want to hold a reference on it because there would be no guarantee somebody will release it later on. So mark_dead && test_and_clear_dead would not work in this context. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id F1A4A6B0005 for ; Wed, 13 Feb 2013 07:56:22 -0500 (EST) Date: Wed, 13 Feb 2013 13:56:17 +0100 From: Michal Hocko Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130213125617.GD23562@dhcp22.suse.cz> References: <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> <20130212171216.GA17663@dhcp22.suse.cz> <20130212173741.GD25235@cmpxchg.org> <20130213103459.GB23562@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130213103459.GB23562@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan On Wed 13-02-13 11:34:59, Michal Hocko wrote: > On Tue 12-02-13 12:37:41, Johannes Weiner wrote: > > On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote: > [...] > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 727ec39..31bb9b0 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { > > > }; > > > > > > struct mem_cgroup_reclaim_iter { > > > - /* last scanned hierarchy member with elevated css ref count */ > > > + /* > > > + * last scanned hierarchy member. Valid only if last_dead_count > > > + * matches memcg->dead_count of the hierarchy root group. > > > + */ > > > struct mem_cgroup *last_visited; > > > + unsigned int last_dead_count; > > > > Since we read and write this without a lock, I would feel more > > comfortable if this were a full word, i.e. unsigned long. That > > guarantees we don't see any partial states. > > OK. Changed. Although I though that int is read/modified atomically as > well if it is aligned to its size. Ohh, I guess what was your concern. If last_dead_count was int then it would fit into the same full word slot with generation and so the parallel read-modify-update cycle could be an issue. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753723Ab3ACRym (ORCPT ); Thu, 3 Jan 2013 12:54:42 -0500 Received: from cantor2.suse.de ([195.135.220.15]:38353 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753490Ab3ACRyi (ORCPT ); Thu, 3 Jan 2013 12:54:38 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: [PATCH v3 0/7] rework mem_cgroup iterator Date: Thu, 3 Jan 2013 18:54:14 +0100 Message-Id: <1357235661-29564-1-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi all, this is a third version of the patchset previously posted here: https://lkml.org/lkml/2012/11/26/616 The patch set tries to make mem_cgroup_iter saner in the way how it walks hierarchies. css->id based traversal is far from being ideal as it is not deterministic because it depends on the creation ordering. Diffstat doesn't look that promising as in previous versions anymore but I think it is worth the resulting outcome (and the sanity ;)). The first patch fixes a potential misbehaving which I haven't seen but the fix is needed for the later patches anyway. We could take it alone as well but I do not have any bug report to base the fix on. The second one is also preparatory and it is new to the series. The third patch is the core of the patchset and it replaces css_get_next based on css_id by the generic cgroup pre-order iterator which means that css_id is no longer used by memcg. This brings some chalanges for the last visited group caching during the reclaim (mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers directly now which means that we have to keep a reference to those groups' css to keep them alive. The next patch fixups an unbounded cgroup removal holdoff caused by the elevated css refcount and does the clean up on the group removal. Thanks to Ying who spotted this during testing of the previous version of the patchset. I could have folded it into the previous patch but I felt it would be too big to review but if people feel it would be better that way, I have no problems to squash them together. The fourth and fifth patches are an attempt for simplification of the mem_cgroup_iter. css juggling is removed and the iteration logic is moved to a helper so that the reference counting and iteration are separated. The last patch just removes css_get_next as there is no user for it any longer. I am also thinking that leaf-to-root iteration makes more sense but this patch is not included in the series yet because I have to think some more about the justification. Same as with the previous version I have tested with a quite simple hierarchy: A (limit = 280M, use_hierarchy=true) / | \ B C D (all have 100M limit) And a separate kernel build in the each leaf group. This triggers both children only and hierarchical reclaim which is parallel so the iter_reclaim caching is active a lot. I will hammer it some more but the series should be in quite a good shape already. Michal Hocko (7): memcg: synchronize per-zone iterator access by a spinlock memcg: keep prev's css alive for the whole mem_cgroup_iter memcg: rework mem_cgroup_iter to use cgroup iterators memcg: remove memcg from the reclaim iterators memcg: simplify mem_cgroup_iter memcg: further simplify mem_cgroup_iter cgroup: remove css_get_next And the diffstat says: include/linux/cgroup.h | 7 -- kernel/cgroup.c | 49 ------------ mm/memcontrol.c | 199 ++++++++++++++++++++++++++++++++++++++++++------ 3 files changed, 175 insertions(+), 80 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753766Ab3ACRyo (ORCPT ); Thu, 3 Jan 2013 12:54:44 -0500 Received: from cantor2.suse.de ([195.135.220.15]:38359 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753725Ab3ACRyn (ORCPT ); Thu, 3 Jan 2013 12:54:43 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: [PATCH v3 1/7] memcg: synchronize per-zone iterator access by a spinlock Date: Thu, 3 Jan 2013 18:54:15 +0100 Message-Id: <1357235661-29564-2-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org per-zone per-priority iterator is aimed at coordinating concurrent reclaimers on the same hierarchy (or the global reclaim when all groups are reclaimed) so that all groups get reclaimed evenly as much as possible. iter->position holds the last css->id visited and iter->generation signals the completed tree walk (when it is incremented). Concurrent reclaimers are supposed to provide a reclaim cookie which holds the reclaim priority and the last generation they saw. If cookie's generation doesn't match the iterator's view then other concurrent reclaimer already did the job and the tree walk is done for that priority. This scheme works nicely in most cases but it is not raceless. Two racing reclaimers can see the same iter->position and so bang on the same group. iter->generation increment is not serialized as well so a reclaimer can see an updated iter->position with and old generation so the iteration might be restarted from the root of the hierarchy. The simplest way to fix this issue is to synchronise access to the iterator by a lock. This implementation uses per-zone per-priority spinlock which linearizes only directly racing reclaimers which use reclaim cookies so the effect of the new locking should be really minimal. I have to note that I haven't seen this as a real issue so far. The primary motivation for the change is different. The following patch will change the way how the iterator is implemented and css->id iteration will be replaced cgroup generic iteration which requires storing mem_cgroup pointer into iterator and that requires reference counting and so concurrent access will be a problem. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1ea8951..e71cfde 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -148,6 +148,8 @@ struct mem_cgroup_reclaim_iter { int position; /* scan generation, increased every round-trip */ unsigned int generation; + /* lock to protect the position and generation */ + spinlock_t iter_lock; }; /* @@ -1161,8 +1163,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; - if (prev && reclaim->generation != iter->generation) + spin_lock(&iter->iter_lock); + if (prev && reclaim->generation != iter->generation) { + spin_unlock(&iter->iter_lock); return NULL; + } id = iter->position; } @@ -1181,6 +1186,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; + spin_unlock(&iter->iter_lock); } if (prev && !css) @@ -6051,8 +6057,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) return 1; for (zone = 0; zone < MAX_NR_ZONES; zone++) { + int prio; + mz = &pn->zoneinfo[zone]; lruvec_init(&mz->lruvec); + for (prio = 0; prio < DEF_PRIORITY + 1; prio++) + spin_lock_init(&mz->reclaim_iter[prio].iter_lock); mz->usage_in_excess = 0; mz->on_tree = false; mz->memcg = memcg; -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753833Ab3ACRyy (ORCPT ); Thu, 3 Jan 2013 12:54:54 -0500 Received: from cantor2.suse.de ([195.135.220.15]:38380 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753728Ab3ACRyq (ORCPT ); Thu, 3 Jan 2013 12:54:46 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: [PATCH v3 3/7] memcg: rework mem_cgroup_iter to use cgroup iterators Date: Thu, 3 Jan 2013 18:54:17 +0100 Message-Id: <1357235661-29564-4-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org mem_cgroup_iter curently relies on css->id when walking down a group hierarchy tree. This is really awkward because the tree walk depends on the groups creation ordering. The only guarantee is that a parent node is visited before its children. Example 1) mkdir -p a a/d a/b/c 2) mkdir -a a/b/c a/d Will create the same trees but the tree walks will be different: 1) a, d, b, c 2) a, b, c, d 574bd9f7 (cgroup: implement generic child / descendant walk macros) has introduced generic cgroup tree walkers which provide either pre-order or post-order tree walk. This patch converts css->id based iteration to pre-order tree walk to keep the semantic with the original iterator where parent is always visited before its subtree. cgroup_for_each_descendant_pre suggests using post_create and pre_destroy for proper synchronization with groups addidition resp. removal. This implementation doesn't use those because a new memory cgroup is fully initialized in mem_cgroup_create and css reference counting enforces that the group is alive for both the last seen cgroup and the found one resp. it signals that the group is dead and it should be skipped. If the reclaim cookie is used we need to store the last visited group into the iterator so we have to be careful that it doesn't disappear in the mean time. Elevated reference count on the css keeps it alive even though the group have been removed (parked waiting for the last dput so that it can be freed). V2 - use css_{get,put} for iter->last_visited rather than mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle - cgroup_next_descendant_pre expects NULL pos for the first iterartion otherwise it might loop endlessly for intermediate node without any children. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 74 ++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 57 insertions(+), 17 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 90a3b1d..e9f5c47 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -144,8 +144,8 @@ struct mem_cgroup_stat_cpu { }; struct mem_cgroup_reclaim_iter { - /* css_id of the last scanned hierarchy member */ - int position; + /* last scanned hierarchy member with elevated css ref count */ + struct mem_cgroup *last_visited; /* scan generation, increased every round-trip */ unsigned int generation; /* lock to protect the position and generation */ @@ -1132,7 +1132,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup_reclaim_cookie *reclaim) { struct mem_cgroup *memcg = NULL; - int id = 0; + struct mem_cgroup *last_visited = NULL; if (mem_cgroup_disabled()) return NULL; @@ -1141,7 +1141,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, root = root_mem_cgroup; if (prev && !reclaim) - id = css_id(&prev->css); + last_visited = prev; if (!root->use_hierarchy && root != root_mem_cgroup) { if (prev) @@ -1149,9 +1149,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, return root; } + rcu_read_lock(); while (!memcg) { struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - struct cgroup_subsys_state *css; + struct cgroup_subsys_state *css = NULL; if (reclaim) { int nid = zone_to_nid(reclaim->zone); @@ -1161,34 +1162,73 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; spin_lock(&iter->iter_lock); + last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { + if (last_visited) { + css_put(&last_visited->css); + iter->last_visited = NULL; + } spin_unlock(&iter->iter_lock); - goto out_css_put; + goto out_unlock; } - id = iter->position; } - rcu_read_lock(); - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); - if (css) { - if (css == &root->css || css_tryget(css)) - memcg = mem_cgroup_from_css(css); - } else - id = 0; - rcu_read_unlock(); + /* + * Root is not visited by cgroup iterators so it needs an + * explicit visit. + */ + if (!last_visited) { + css = &root->css; + } else { + struct cgroup *prev_cgroup, *next_cgroup; + + prev_cgroup = (last_visited == root) ? NULL + : last_visited->css.cgroup; + next_cgroup = cgroup_next_descendant_pre(prev_cgroup, + root->css.cgroup); + if (next_cgroup) + css = cgroup_subsys_state(next_cgroup, + mem_cgroup_subsys_id); + } + + /* + * Even if we found a group we have to make sure it is alive. + * css && !memcg means that the groups should be skipped and + * we should continue the tree walk. + * last_visited css is safe to use because it is protected by + * css_get and the tree walk is rcu safe. + */ + if (css == &root->css || (css && css_tryget(css))) + memcg = mem_cgroup_from_css(css); if (reclaim) { - iter->position = id; + struct mem_cgroup *curr = memcg; + + if (last_visited) + css_put(&last_visited->css); + + if (css && !memcg) + curr = mem_cgroup_from_css(css); + + /* make sure that the cached memcg is not removed */ + if (curr) + css_get(&curr->css); + iter->last_visited = curr; + if (!css) iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; spin_unlock(&iter->iter_lock); + } else if (css && !memcg) { + last_visited = mem_cgroup_from_css(css); } if (prev && !css) - goto out_css_put; + goto out_unlock; } +out_unlock: + rcu_read_unlock(); out_css_put: if (prev && prev != root) css_put(&prev->css); -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753816Ab3ACRyv (ORCPT ); Thu, 3 Jan 2013 12:54:51 -0500 Received: from cantor2.suse.de ([195.135.220.15]:38387 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753773Ab3ACRyq (ORCPT ); Thu, 3 Jan 2013 12:54:46 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Date: Thu, 3 Jan 2013 18:54:18 +0100 Message-Id: <1357235661-29564-5-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that per-node-zone-priority iterator caches memory cgroups rather than their css ids we have to be careful and remove them from the iterator when they are on the way out otherwise they might hang for unbounded amount of time (until the global/targeted reclaim triggers the zone under priority to find out the group is dead and let it to find the final rest). This is solved by hooking into mem_cgroup_css_offline and checking all per-node-zone-priority iterators up the way to the root cgroup. If the current memcg is found in the respective iter->last_visited then it is replaced by the previous one in the same sub-hierarchy. This guarantees that no group gets more reclaiming than necessary and the next iteration will continue without noticing that the removed group has disappeared. Spotted-by: Ying Han Signed-off-by: Michal Hocko --- mm/memcontrol.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 89 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e9f5c47..4f81abd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6375,10 +6375,99 @@ free_out: return ERR_PTR(error); } +/* + * Helper to find memcg's previous group under the given root + * hierarchy. + */ +struct mem_cgroup *__find_prev_memcg(struct mem_cgroup *root, + struct mem_cgroup *memcg) +{ + struct cgroup *memcg_cgroup = memcg->css.cgroup; + struct cgroup *root_cgroup = root->css.cgroup; + struct cgroup *prev_cgroup = NULL; + struct cgroup *iter; + + cgroup_for_each_descendant_pre(iter, root_cgroup) { + if (iter == memcg_cgroup) + break; + prev_cgroup = iter; + } + + return (prev_cgroup) ? mem_cgroup_from_cont(prev_cgroup) : NULL; +} + +/* + * Remove the given memcg under given root from all per-node per-zone + * per-priority chached iterators. + */ +static void mem_cgroup_uncache_reclaim_iters(struct mem_cgroup *root, + struct mem_cgroup *memcg) +{ + int node; + + for_each_node(node) { + struct mem_cgroup_per_node *pn = root->info.nodeinfo[node]; + int zone; + + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + struct mem_cgroup_per_zone *mz; + int prio; + + mz = &pn->zoneinfo[zone]; + for (prio = 0; prio < DEF_PRIORITY + 1; prio++) { + struct mem_cgroup_reclaim_iter *iter; + + /* + * Just drop the reference on the removed memcg + * cached last_visited. No need to lock iter as + * the memcg is on the way out and cannot be + * reclaimed. + */ + iter = &mz->reclaim_iter[prio]; + if (root == memcg) { + if (iter->last_visited) + css_put(&iter->last_visited->css); + continue; + } + + rcu_read_lock(); + spin_lock(&iter->iter_lock); + if (iter->last_visited == memcg) { + iter->last_visited = __find_prev_memcg( + root, memcg); + css_put(&memcg->css); + } + spin_unlock(&iter->iter_lock); + rcu_read_unlock(); + } + } + } +} + +/* + * Remove the given memcg from all cached reclaim iterators. + */ +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; + + do { + mem_cgroup_uncache_reclaim_iters(parent, memcg); + } while ((parent = parent_mem_cgroup(parent))); + + /* + * if the root memcg is not hierarchical we have to check it + * explicitely. + */ + if (!root_mem_cgroup->use_hierarchy) + mem_cgroup_uncache_reclaim_iters(root_mem_cgroup, memcg); +} + static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + mem_cgroup_uncache_from_reclaim(memcg); mem_cgroup_reparent_charges(memcg); mem_cgroup_destroy_all_caches(memcg); } -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753848Ab3ACRy4 (ORCPT ); Thu, 3 Jan 2013 12:54:56 -0500 Received: from cantor2.suse.de ([195.135.220.15]:38396 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753780Ab3ACRys (ORCPT ); Thu, 3 Jan 2013 12:54:48 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: [PATCH v3 5/7] memcg: simplify mem_cgroup_iter Date: Thu, 3 Jan 2013 18:54:19 +0100 Message-Id: <1357235661-29564-6-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Current implementation of mem_cgroup_iter has to consider both css and memcg to find out whether no group has been found (css==NULL - aka the loop is completed) and that no memcg is associated with the found node (!memcg - aka css_tryget failed because the group is no longer alive). This leads to awkward tweaks like tests for css && !memcg to skip the current node. It will be much easier if we got rid off css variable altogether and only rely on memcg. In order to do that the iteration part has to skip dead nodes. This sounds natural to me and as a nice side effect we will get a simple invariant that memcg is always alive when non-NULL and all nodes have been visited otherwise. We could get rid of the surrounding while loop but keep it in for now to make review easier. It will go away in the following patch. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 56 +++++++++++++++++++++++++++---------------------------- 1 file changed, 27 insertions(+), 29 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4f81abd..d8c6e5e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1152,7 +1152,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, rcu_read_lock(); while (!memcg) { struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - struct cgroup_subsys_state *css = NULL; if (reclaim) { int nid = zone_to_nid(reclaim->zone); @@ -1178,53 +1177,52 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, * explicit visit. */ if (!last_visited) { - css = &root->css; + memcg = root; } else { struct cgroup *prev_cgroup, *next_cgroup; prev_cgroup = (last_visited == root) ? NULL : last_visited->css.cgroup; - next_cgroup = cgroup_next_descendant_pre(prev_cgroup, - root->css.cgroup); - if (next_cgroup) - css = cgroup_subsys_state(next_cgroup, - mem_cgroup_subsys_id); - } +skip_node: + next_cgroup = cgroup_next_descendant_pre( + prev_cgroup, root->css.cgroup); - /* - * Even if we found a group we have to make sure it is alive. - * css && !memcg means that the groups should be skipped and - * we should continue the tree walk. - * last_visited css is safe to use because it is protected by - * css_get and the tree walk is rcu safe. - */ - if (css == &root->css || (css && css_tryget(css))) - memcg = mem_cgroup_from_css(css); + /* + * Even if we found a group we have to make sure it is + * alive. css && !memcg means that the groups should be + * skipped and we should continue the tree walk. + * last_visited css is safe to use because it is + * protected by css_get and the tree walk is rcu safe. + */ + if (next_cgroup) { + struct mem_cgroup *mem = mem_cgroup_from_cont( + next_cgroup); + if (css_tryget(&mem->css)) + memcg = mem; + else { + prev_cgroup = next_cgroup; + goto skip_node; + } + } + } if (reclaim) { - struct mem_cgroup *curr = memcg; - if (last_visited) css_put(&last_visited->css); - if (css && !memcg) - curr = mem_cgroup_from_css(css); - /* make sure that the cached memcg is not removed */ - if (curr) - css_get(&curr->css); - iter->last_visited = curr; + if (memcg) + css_get(&memcg->css); + iter->last_visited = memcg; - if (!css) + if (!memcg) iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; spin_unlock(&iter->iter_lock); - } else if (css && !memcg) { - last_visited = mem_cgroup_from_css(css); } - if (prev && !css) + if (prev && !memcg) goto out_unlock; } out_unlock: -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753831Ab3ACRzc (ORCPT ); Thu, 3 Jan 2013 12:55:32 -0500 Received: from cantor2.suse.de ([195.135.220.15]:38408 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753810Ab3ACRyu (ORCPT ); Thu, 3 Jan 2013 12:54:50 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: [PATCH v3 7/7] cgroup: remove css_get_next Date: Thu, 3 Jan 2013 18:54:21 +0100 Message-Id: <1357235661-29564-8-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that we have generic and well ordered cgroup tree walkers there is no need to keep css_get_next in the place. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- include/linux/cgroup.h | 7 ------- kernel/cgroup.c | 49 ------------------------------------------------ 2 files changed, 56 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 7d73905..a4d86b0 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -685,13 +685,6 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css); struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id); -/* - * Get a cgroup whose id is greater than or equal to id under tree of root. - * Returning a cgroup_subsys_state or NULL. - */ -struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id, - struct cgroup_subsys_state *root, int *foundid); - /* Returns true if root is ancestor of cg */ bool css_is_ancestor(struct cgroup_subsys_state *cg, const struct cgroup_subsys_state *root); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index f34c41b..3013ec4 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5384,55 +5384,6 @@ struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id) } EXPORT_SYMBOL_GPL(css_lookup); -/** - * css_get_next - lookup next cgroup under specified hierarchy. - * @ss: pointer to subsystem - * @id: current position of iteration. - * @root: pointer to css. search tree under this. - * @foundid: position of found object. - * - * Search next css under the specified hierarchy of rootid. Calling under - * rcu_read_lock() is necessary. Returns NULL if it reaches the end. - */ -struct cgroup_subsys_state * -css_get_next(struct cgroup_subsys *ss, int id, - struct cgroup_subsys_state *root, int *foundid) -{ - struct cgroup_subsys_state *ret = NULL; - struct css_id *tmp; - int tmpid; - int rootid = css_id(root); - int depth = css_depth(root); - - if (!rootid) - return NULL; - - BUG_ON(!ss->use_id); - WARN_ON_ONCE(!rcu_read_lock_held()); - - /* fill start point for scan */ - tmpid = id; - while (1) { - /* - * scan next entry from bitmap(tree), tmpid is updated after - * idr_get_next(). - */ - tmp = idr_get_next(&ss->idr, &tmpid); - if (!tmp) - break; - if (tmp->depth >= depth && tmp->stack[depth] == rootid) { - ret = rcu_dereference(tmp->css); - if (ret) { - *foundid = tmpid; - break; - } - } - /* continue to scan from next id */ - tmpid = tmpid + 1; - } - return ret; -} - /* * get corresponding css from file open on cgroupfs directory */ -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753855Ab3ACR4G (ORCPT ); Thu, 3 Jan 2013 12:56:06 -0500 Received: from cantor2.suse.de ([195.135.220.15]:38404 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753796Ab3ACRyt (ORCPT ); Thu, 3 Jan 2013 12:54:49 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: [PATCH v3 6/7] memcg: further simplify mem_cgroup_iter Date: Thu, 3 Jan 2013 18:54:20 +0100 Message-Id: <1357235661-29564-7-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org mem_cgroup_iter basically does two things currently. It takes care of the house keeping (reference counting, raclaim cookie) and it iterates through a hierarchy tree (by using cgroup generic tree walk). The code would be much more easier to follow if we move the iteration outside of the function (to __mem_cgrou_iter_next) so the distinction is more clear. This patch doesn't introduce any functional changes. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 46 insertions(+), 33 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d8c6e5e..d80fcff 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1110,6 +1110,51 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) return memcg; } +/* + * Returns a next (in a pre-order walk) alive memcg (with elevated css + * ref. count) or NULL if the whole root's subtree has been visited. + * + * helper function to be used by mem_cgroup_iter + */ +static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root, + struct mem_cgroup *last_visited) +{ + struct cgroup *prev_cgroup, *next_cgroup; + + /* + * Root is not visited by cgroup iterators so it needs an + * explicit visit. + */ + if (!last_visited) + return root; + + prev_cgroup = (last_visited == root) ? NULL + : last_visited->css.cgroup; +skip_node: + next_cgroup = cgroup_next_descendant_pre( + prev_cgroup, root->css.cgroup); + + /* + * Even if we found a group we have to make sure it is + * alive. css && !memcg means that the groups should be + * skipped and we should continue the tree walk. + * last_visited css is safe to use because it is + * protected by css_get and the tree walk is rcu safe. + */ + if (next_cgroup) { + struct mem_cgroup *mem = mem_cgroup_from_cont( + next_cgroup); + if (css_tryget(&mem->css)) + return mem; + else { + prev_cgroup = next_cgroup; + goto skip_node; + } + } + + return NULL; +} + /** * mem_cgroup_iter - iterate over memory cgroup hierarchy * @root: hierarchy root @@ -1172,39 +1217,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, } } - /* - * Root is not visited by cgroup iterators so it needs an - * explicit visit. - */ - if (!last_visited) { - memcg = root; - } else { - struct cgroup *prev_cgroup, *next_cgroup; - - prev_cgroup = (last_visited == root) ? NULL - : last_visited->css.cgroup; -skip_node: - next_cgroup = cgroup_next_descendant_pre( - prev_cgroup, root->css.cgroup); - - /* - * Even if we found a group we have to make sure it is - * alive. css && !memcg means that the groups should be - * skipped and we should continue the tree walk. - * last_visited css is safe to use because it is - * protected by css_get and the tree walk is rcu safe. - */ - if (next_cgroup) { - struct mem_cgroup *mem = mem_cgroup_from_cont( - next_cgroup); - if (css_tryget(&mem->css)) - memcg = mem; - else { - prev_cgroup = next_cgroup; - goto skip_node; - } - } - } + memcg = __mem_cgroup_iter_next(root, last_visited); if (reclaim) { if (last_visited) -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753793Ab3ACRys (ORCPT ); Thu, 3 Jan 2013 12:54:48 -0500 Received: from cantor2.suse.de ([195.135.220.15]:38367 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753733Ab3ACRyo (ORCPT ); Thu, 3 Jan 2013 12:54:44 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: [PATCH v3 2/7] memcg: keep prev's css alive for the whole mem_cgroup_iter Date: Thu, 3 Jan 2013 18:54:16 +0100 Message-Id: <1357235661-29564-3-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org css reference counting keeps the cgroup alive even though it has been already removed. mem_cgroup_iter relies on this fact and takes a reference to the returned group. The reference is then released on the next iteration or mem_cgroup_iter_break. mem_cgroup_iter currently releases the reference right after it gets the last css_id. This is correct because neither prev's memcg nor cgroup are accessed after then. This will change in the next patch so we need to hold the group alive a bit longer so let's move the css_put at the end of the function. Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e71cfde..90a3b1d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1143,12 +1143,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (prev && !reclaim) id = css_id(&prev->css); - if (prev && prev != root) - css_put(&prev->css); - if (!root->use_hierarchy && root != root_mem_cgroup) { if (prev) - return NULL; + goto out_css_put; return root; } @@ -1166,7 +1163,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, spin_lock(&iter->iter_lock); if (prev && reclaim->generation != iter->generation) { spin_unlock(&iter->iter_lock); - return NULL; + goto out_css_put; } id = iter->position; } @@ -1190,8 +1187,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, } if (prev && !css) - return NULL; + goto out_css_put; } +out_css_put: + if (prev && prev != root) + css_put(&prev->css); + return memcg; } -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754174Ab3ADDng (ORCPT ); Thu, 3 Jan 2013 22:43:36 -0500 Received: from szxga02-in.huawei.com ([119.145.14.65]:8031 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753896Ab3ADDne (ORCPT ); Thu, 3 Jan 2013 22:43:34 -0500 Message-ID: <50E64FB0.9050803@huawei.com> Date: Fri, 4 Jan 2013 11:42:40 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Michal Hocko CC: , , KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [PATCH v3 7/7] cgroup: remove css_get_next References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-8-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-8-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2013/1/4 1:54, Michal Hocko wrote: > Now that we have generic and well ordered cgroup tree walkers there is > no need to keep css_get_next in the place. > > Signed-off-by: Michal Hocko > Acked-by: KAMEZAWA Hiroyuki Acked-by: Li Zefan > --- > include/linux/cgroup.h | 7 ------- > kernel/cgroup.c | 49 ------------------------------------------------ > 2 files changed, 56 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751354Ab3AGGSk (ORCPT ); Mon, 7 Jan 2013 01:18:40 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:49061 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750922Ab3AGGSh (ORCPT ); Mon, 7 Jan 2013 01:18:37 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <50EA689B.7060308@jp.fujitsu.com> Date: Mon, 07 Jan 2013 15:18:03 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Michal Hocko CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> In-Reply-To: <1357235661-29564-5-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2013/01/04 2:54), Michal Hocko wrote: > Now that per-node-zone-priority iterator caches memory cgroups rather > than their css ids we have to be careful and remove them from the > iterator when they are on the way out otherwise they might hang for > unbounded amount of time (until the global/targeted reclaim triggers the > zone under priority to find out the group is dead and let it to find the > final rest). > > This is solved by hooking into mem_cgroup_css_offline and checking all > per-node-zone-priority iterators up the way to the root cgroup. If the > current memcg is found in the respective iter->last_visited then it is > replaced by the previous one in the same sub-hierarchy. > > This guarantees that no group gets more reclaiming than necessary and > the next iteration will continue without noticing that the removed group > has disappeared. > > Spotted-by: Ying Han > Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754707Ab3AWMwJ (ORCPT ); Wed, 23 Jan 2013 07:52:09 -0500 Received: from cantor2.suse.de ([195.135.220.15]:42921 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754872Ab3AWMwH (ORCPT ); Wed, 23 Jan 2013 07:52:07 -0500 Date: Wed, 23 Jan 2013 13:52:02 +0100 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 0/7] rework mem_cgroup iterator Message-ID: <20130123125202.GA13319@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1357235661-29564-1-git-send-email-mhocko@suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Are there any comments? Ying, Johannes? I would be happy if this could go into 3.9. On Thu 03-01-13 18:54:14, Michal Hocko wrote: > Hi all, > this is a third version of the patchset previously posted here: > https://lkml.org/lkml/2012/11/26/616 > > The patch set tries to make mem_cgroup_iter saner in the way how it > walks hierarchies. css->id based traversal is far from being ideal as it > is not deterministic because it depends on the creation ordering. > > Diffstat doesn't look that promising as in previous versions anymore but > I think it is worth the resulting outcome (and the sanity ;)). > > The first patch fixes a potential misbehaving which I haven't seen but > the fix is needed for the later patches anyway. We could take it alone > as well but I do not have any bug report to base the fix on. The second > one is also preparatory and it is new to the series. > > The third patch is the core of the patchset and it replaces css_get_next > based on css_id by the generic cgroup pre-order iterator which > means that css_id is no longer used by memcg. This brings some > chalanges for the last visited group caching during the reclaim > (mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers > directly now which means that we have to keep a reference to those > groups' css to keep them alive. > > The next patch fixups an unbounded cgroup removal holdoff caused by > the elevated css refcount and does the clean up on the group removal. > Thanks to Ying who spotted this during testing of the previous version > of the patchset. > I could have folded it into the previous patch but I felt it would be > too big to review but if people feel it would be better that way, I have > no problems to squash them together. > > The fourth and fifth patches are an attempt for simplification of the > mem_cgroup_iter. css juggling is removed and the iteration logic is > moved to a helper so that the reference counting and iteration are > separated. > > The last patch just removes css_get_next as there is no user for it any > longer. > > I am also thinking that leaf-to-root iteration makes more sense but this > patch is not included in the series yet because I have to think some > more about the justification. > > Same as with the previous version I have tested with a quite simple > hierarchy: > A (limit = 280M, use_hierarchy=true) > / | \ > B C D (all have 100M limit) > > And a separate kernel build in the each leaf group. This triggers > both children only and hierarchical reclaim which is parallel so the > iter_reclaim caching is active a lot. I will hammer it some more but the > series should be in quite a good shape already. > > Michal Hocko (7): > memcg: synchronize per-zone iterator access by a spinlock > memcg: keep prev's css alive for the whole mem_cgroup_iter > memcg: rework mem_cgroup_iter to use cgroup iterators > memcg: remove memcg from the reclaim iterators > memcg: simplify mem_cgroup_iter > memcg: further simplify mem_cgroup_iter > cgroup: remove css_get_next > > And the diffstat says: > include/linux/cgroup.h | 7 -- > kernel/cgroup.c | 49 ------------ > mm/memcontrol.c | 199 ++++++++++++++++++++++++++++++++++++++++++------ > 3 files changed, 175 insertions(+), 80 deletions(-) > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946995Ab3BHTdi (ORCPT ); Fri, 8 Feb 2013 14:33:38 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39423 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946860Ab3BHTdf (ORCPT ); Fri, 8 Feb 2013 14:33:35 -0500 Date: Fri, 8 Feb 2013 14:33:18 -0500 From: Johannes Weiner To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130208193318.GA15951@cmpxchg.org> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1357235661-29564-5-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 03, 2013 at 06:54:18PM +0100, Michal Hocko wrote: > Now that per-node-zone-priority iterator caches memory cgroups rather > than their css ids we have to be careful and remove them from the > iterator when they are on the way out otherwise they might hang for > unbounded amount of time (until the global/targeted reclaim triggers the > zone under priority to find out the group is dead and let it to find the > final rest). > > This is solved by hooking into mem_cgroup_css_offline and checking all > per-node-zone-priority iterators up the way to the root cgroup. If the > current memcg is found in the respective iter->last_visited then it is > replaced by the previous one in the same sub-hierarchy. > > This guarantees that no group gets more reclaiming than necessary and > the next iteration will continue without noticing that the removed group > has disappeared. > > Spotted-by: Ying Han > Signed-off-by: Michal Hocko > --- > mm/memcontrol.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 89 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e9f5c47..4f81abd 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6375,10 +6375,99 @@ free_out: > return ERR_PTR(error); > } > > +/* > + * Helper to find memcg's previous group under the given root > + * hierarchy. > + */ > +struct mem_cgroup *__find_prev_memcg(struct mem_cgroup *root, > + struct mem_cgroup *memcg) > +{ > + struct cgroup *memcg_cgroup = memcg->css.cgroup; > + struct cgroup *root_cgroup = root->css.cgroup; > + struct cgroup *prev_cgroup = NULL; > + struct cgroup *iter; > + > + cgroup_for_each_descendant_pre(iter, root_cgroup) { > + if (iter == memcg_cgroup) > + break; > + prev_cgroup = iter; > + } > + > + return (prev_cgroup) ? mem_cgroup_from_cont(prev_cgroup) : NULL; > +} > + > +/* > + * Remove the given memcg under given root from all per-node per-zone > + * per-priority chached iterators. > + */ > +static void mem_cgroup_uncache_reclaim_iters(struct mem_cgroup *root, > + struct mem_cgroup *memcg) > +{ > + int node; > + > + for_each_node(node) { > + struct mem_cgroup_per_node *pn = root->info.nodeinfo[node]; > + int zone; > + > + for (zone = 0; zone < MAX_NR_ZONES; zone++) { > + struct mem_cgroup_per_zone *mz; > + int prio; > + > + mz = &pn->zoneinfo[zone]; > + for (prio = 0; prio < DEF_PRIORITY + 1; prio++) { > + struct mem_cgroup_reclaim_iter *iter; > + > + /* > + * Just drop the reference on the removed memcg > + * cached last_visited. No need to lock iter as > + * the memcg is on the way out and cannot be > + * reclaimed. > + */ > + iter = &mz->reclaim_iter[prio]; > + if (root == memcg) { > + if (iter->last_visited) > + css_put(&iter->last_visited->css); > + continue; > + } > + > + rcu_read_lock(); > + spin_lock(&iter->iter_lock); > + if (iter->last_visited == memcg) { > + iter->last_visited = __find_prev_memcg( > + root, memcg); > + css_put(&memcg->css); > + } > + spin_unlock(&iter->iter_lock); > + rcu_read_unlock(); > + } > + } > + } > +} > + > +/* > + * Remove the given memcg from all cached reclaim iterators. > + */ > +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *parent = memcg; > + > + do { > + mem_cgroup_uncache_reclaim_iters(parent, memcg); > + } while ((parent = parent_mem_cgroup(parent))); > + > + /* > + * if the root memcg is not hierarchical we have to check it > + * explicitely. > + */ > + if (!root_mem_cgroup->use_hierarchy) > + mem_cgroup_uncache_reclaim_iters(root_mem_cgroup, memcg); > +} for each in hierarchy: for each node: for each zone: for each reclaim priority: every time a cgroup is destroyed. I don't think such a hammer is justified in general, let alone for consolidating code a little. Can we invalidate the position cache lazily? Have a global "cgroup destruction" counter and store a snapshot of that counter whenever we put a cgroup pointer in the position cache. We only use the cached pointer if that counter has not changed in the meantime, so we know that the cgroup still exists. It is pretty pretty imprecise and we invalidate the whole cache every time a cgroup is destroyed, but I think that should be okay. If not, better ideas are welcome. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757571Ab3BKPQ4 (ORCPT ); Mon, 11 Feb 2013 10:16:56 -0500 Received: from cantor2.suse.de ([195.135.220.15]:35604 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757255Ab3BKPQz (ORCPT ); Mon, 11 Feb 2013 10:16:55 -0500 Date: Mon, 11 Feb 2013 16:16:49 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211151649.GD19922@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130208193318.GA15951@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 08-02-13 14:33:18, Johannes Weiner wrote: [...] > for each in hierarchy: > for each node: > for each zone: > for each reclaim priority: > > every time a cgroup is destroyed. I don't think such a hammer is > justified in general, let alone for consolidating code a little. > > Can we invalidate the position cache lazily? Have a global "cgroup > destruction" counter and store a snapshot of that counter whenever we > put a cgroup pointer in the position cache. We only use the cached > pointer if that counter has not changed in the meantime, so we know > that the cgroup still exists. Currently we have: rcu_read_lock() // keeps cgroup links safe iter->iter_lock // keeps selection exclusive for a specific iterator 1) global_counter == iter_counter 2) css_tryget(cached_memcg) // check it is still alive rcu_read_unlock() What would protect us from races when css would disappear between 1 and 2? css is invalidated from worker context scheduled from __css_put and it is using dentry locking which we surely do not want to pull here. We could hook into css_offline which is called with cgroup_mutex but we cannot use this one here because it is no longer exported and Tejun would kill us for that. So we can add a new global memcg internal lock to do this atomically. Ohh, this is getting uglier... > It is pretty pretty imprecise and we invalidate the whole cache every > time a cgroup is destroyed, but I think that should be okay. I am not sure this is OK because this gives an indirect way of influencing reclaim in one hierarchy by another one which opens a door for regressions (or malicious over-reclaim in the extreme case). So I do not like this very much. > If not, better ideas are welcome. Maybe we could keep the counter per memcg but that would mean that we would need to go up the hierarchy as well. We wouldn't have to go over node-zone-priority cleanup so it would be much more lightweight. I am not sure this is necessarily better than explicit cleanup because it brings yet another kind of generation number to the game but I guess I can live with it if people really thing the relaxed way is much better. What do you think about the patch below (untested yet)? --- >>From 8169aa49649753822661b8fbbfba0852dcfedba6 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 11 Feb 2013 16:13:48 +0100 Subject: [PATCH] memcg: relax memcg iter caching Now that per-node-zone-priority iterator caches memory cgroups rather than their css ids we have to be careful and remove them from the iterator when they are on the way out otherwise they might hang for unbounded amount of time (until the global/targeted reclaim triggers the zone under priority to find out the group is dead and let it to find the final rest). We can fix this issue by relaxing rules for the last_visited memcg as well. Instead of taking reference to css we can just use its pointer and track the number of removed groups for each memcg. This number would be stored into iterator everytime when a memcg is cached. If the iter count doesn't match the curent walker root's one we will start over from the root again. The group counter is incremented upwards the hierarchy every time a group is removed. dead_count_lock makes sure that we do not race with memcg removal. Spotted-by: Ying Han Original-idea-by: Johannes Weiner Signed-off-by: Michal Hocko --- mm/memcontrol.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 57 insertions(+), 11 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e9f5c47..65bf2cb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { }; struct mem_cgroup_reclaim_iter { - /* last scanned hierarchy member with elevated css ref count */ + /* + * last scanned hierarchy member. Valid only if last_dead_count + * matches memcg->dead_count of the hierarchy root group. + */ struct mem_cgroup *last_visited; + unsigned int last_dead_count; + /* scan generation, increased every round-trip */ unsigned int generation; /* lock to protect the position and generation */ @@ -357,6 +362,8 @@ struct mem_cgroup { struct mem_cgroup_stat_cpu nocpu_base; spinlock_t pcp_counter_lock; + spinlock_t dead_count_lock; + unsigned int dead_count; #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) struct tcp_memcontrol tcp_mem; #endif @@ -1162,15 +1169,24 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; spin_lock(&iter->iter_lock); - last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { - if (last_visited) { - css_put(&last_visited->css); - iter->last_visited = NULL; - } + iter->last_visited = NULL; spin_unlock(&iter->iter_lock); goto out_unlock; } + + /* + * last_visited might be invalid if some of the group + * downwards was removed. As we do not know which one + * disappeared we have to start all over again from the + * root. + */ + spin_lock(&root->dead_count_lock); + last_visited = iter->last_visited; + if (last_visited && (root->dead_count != + iter->last_dead_count)) { + last_visited = NULL; + } } /* @@ -1204,16 +1220,21 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (reclaim) { struct mem_cgroup *curr = memcg; - if (last_visited) - css_put(&last_visited->css); + /* + * last_visited is not longer used so we can let + * other thread to run and update dead_count + * because the current memcg would be valid + * regardless other memcg was removed + */ + spin_unlock(&root->dead_count_lock); if (css && !memcg) curr = mem_cgroup_from_css(css); - /* make sure that the cached memcg is not removed */ - if (curr) - css_get(&curr->css); iter->last_visited = curr; + spin_lock(&root->dead_count_lock); + iter->last_dead_count = root->dead_count; + spin_unlock(&root->dead_count_lock); if (!css) iter->generation++; @@ -6375,10 +6396,35 @@ free_out: return ERR_PTR(error); } +/* + * Announce all parents that a group from their hierarchy is gone. + */ +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; + + while ((parent = parent_mem_cgroup(parent))) { + spin_lock(&parent->dead_count_lock); + parent->dead_count++; + spin_unlock(&parent->dead_count_lock); + } + + /* + * if the root memcg is not hierarchical we have to check it + * explicitely. + */ + if (!root_mem_cgroup->use_hierarchy) { + spin_lock(&root_mem_cgroup->dead_count_lock); + parent->dead_count++; + spin_unlock(&root_mem_cgroup->dead_count_lock); + } +} + static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + mem_cgroup_uncache_from_reclaim(memcg); mem_cgroup_reparent_charges(memcg); mem_cgroup_destroy_all_caches(memcg); } -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758576Ab3BKR4f (ORCPT ); Mon, 11 Feb 2013 12:56:35 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39580 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758464Ab3BKR4e (ORCPT ); Mon, 11 Feb 2013 12:56:34 -0500 Date: Mon, 11 Feb 2013 12:56:19 -0500 From: Johannes Weiner To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211175619.GC13218@cmpxchg.org> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211151649.GD19922@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > On Fri 08-02-13 14:33:18, Johannes Weiner wrote: > [...] > > for each in hierarchy: > > for each node: > > for each zone: > > for each reclaim priority: > > > > every time a cgroup is destroyed. I don't think such a hammer is > > justified in general, let alone for consolidating code a little. > > > > Can we invalidate the position cache lazily? Have a global "cgroup > > destruction" counter and store a snapshot of that counter whenever we > > put a cgroup pointer in the position cache. We only use the cached > > pointer if that counter has not changed in the meantime, so we know > > that the cgroup still exists. > > Currently we have: > rcu_read_lock() // keeps cgroup links safe > iter->iter_lock // keeps selection exclusive for a specific iterator > 1) global_counter == iter_counter > 2) css_tryget(cached_memcg) // check it is still alive > rcu_read_unlock() > > What would protect us from races when css would disappear between 1 and > 2? rcu > css is invalidated from worker context scheduled from __css_put and it > is using dentry locking which we surely do not want to pull here. We > could hook into css_offline which is called with cgroup_mutex but we > cannot use this one here because it is no longer exported and Tejun > would kill us for that. > So we can add a new global memcg internal lock to do this atomically. > Ohh, this is getting uglier... A racing final css_put() means that the tryget fails, but our RCU read lock keeps the CSS allocated. If the dead_count is uptodate, it means that the rcu read lock was acquired before the synchronize_rcu() before the css is freed. > > It is pretty pretty imprecise and we invalidate the whole cache every > > time a cgroup is destroyed, but I think that should be okay. > > I am not sure this is OK because this gives an indirect way of > influencing reclaim in one hierarchy by another one which opens a door > for regressions (or malicious over-reclaim in the extreme case). > So I do not like this very much. > > > If not, better ideas are welcome. > > Maybe we could keep the counter per memcg but that would mean that we > would need to go up the hierarchy as well. We wouldn't have to go over > node-zone-priority cleanup so it would be much more lightweight. > > I am not sure this is necessarily better than explicit cleanup because > it brings yet another kind of generation number to the game but I guess > I can live with it if people really thing the relaxed way is much > better. > What do you think about the patch below (untested yet)? Better, but I think you can get rid of both locks: mem_cgroup_iter: rcu_read_lock() if atomic_read(&root->dead_count) == iter->dead_count: smp_rmb() if tryget(iter->position): position = iter->position memcg = find_next(postion) css_put(position) iter->position = memcg smp_wmb() /* Write position cache BEFORE marking it uptodate */ iter->dead_count = atomic_read(&root->dead_count) rcu_read_unlock() From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759633Ab3BKT3h (ORCPT ); Mon, 11 Feb 2013 14:29:37 -0500 Received: from mail-ee0-f50.google.com ([74.125.83.50]:64588 "EHLO mail-ee0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759331Ab3BKT3e (ORCPT ); Mon, 11 Feb 2013 14:29:34 -0500 Date: Mon, 11 Feb 2013 20:29:29 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211192929.GB29000@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211175619.GC13218@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 11-02-13 12:56:19, Johannes Weiner wrote: > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > > On Fri 08-02-13 14:33:18, Johannes Weiner wrote: > > [...] > > > for each in hierarchy: > > > for each node: > > > for each zone: > > > for each reclaim priority: > > > > > > every time a cgroup is destroyed. I don't think such a hammer is > > > justified in general, let alone for consolidating code a little. > > > > > > Can we invalidate the position cache lazily? Have a global "cgroup > > > destruction" counter and store a snapshot of that counter whenever we > > > put a cgroup pointer in the position cache. We only use the cached > > > pointer if that counter has not changed in the meantime, so we know > > > that the cgroup still exists. > > > > Currently we have: > > rcu_read_lock() // keeps cgroup links safe > > iter->iter_lock // keeps selection exclusive for a specific iterator > > 1) global_counter == iter_counter > > 2) css_tryget(cached_memcg) // check it is still alive > > rcu_read_unlock() > > > > What would protect us from races when css would disappear between 1 and > > 2? > > rcu That was my first attempt but then I convinced myself it might not be sufficient. But now that I think about it more I guess you are right. > > css is invalidated from worker context scheduled from __css_put and it > > is using dentry locking which we surely do not want to pull here. We > > could hook into css_offline which is called with cgroup_mutex but we > > cannot use this one here because it is no longer exported and Tejun > > would kill us for that. > > So we can add a new global memcg internal lock to do this atomically. > > Ohh, this is getting uglier... > > A racing final css_put() means that the tryget fails, but our RCU read > lock keeps the CSS allocated. If the dead_count is uptodate, it means > that the rcu read lock was acquired before the synchronize_rcu() > before the css is freed. yes. > > > > It is pretty pretty imprecise and we invalidate the whole cache every > > > time a cgroup is destroyed, but I think that should be okay. > > > > I am not sure this is OK because this gives an indirect way of > > influencing reclaim in one hierarchy by another one which opens a door > > for regressions (or malicious over-reclaim in the extreme case). > > So I do not like this very much. > > > > > If not, better ideas are welcome. > > > > Maybe we could keep the counter per memcg but that would mean that we > > would need to go up the hierarchy as well. We wouldn't have to go over > > node-zone-priority cleanup so it would be much more lightweight. > > > > I am not sure this is necessarily better than explicit cleanup because > > it brings yet another kind of generation number to the game but I guess > > I can live with it if people really thing the relaxed way is much > > better. > > What do you think about the patch below (untested yet)? > > Better, but I think you can get rid of both locks: What is the other lock you have in mind. > mem_cgroup_iter: > rcu_read_lock() > if atomic_read(&root->dead_count) == iter->dead_count: > smp_rmb() > if tryget(iter->position): > position = iter->position > memcg = find_next(postion) > css_put(position) > iter->position = memcg > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > iter->dead_count = atomic_read(&root->dead_count) > rcu_read_unlock() Updated patch bellow: --- >>From 756c4f0091d250bc5ff816f8e9d11840e8522b3a Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 11 Feb 2013 20:23:51 +0100 Subject: [PATCH] memcg: relax memcg iter caching Now that per-node-zone-priority iterator caches memory cgroups rather than their css ids we have to be careful and remove them from the iterator when they are on the way out otherwise they might hang for unbounded amount of time (until the global/targeted reclaim triggers the zone under priority to find out the group is dead and let it to find the final rest). We can fix this issue by relaxing rules for the last_visited memcg as well. Instead of taking reference to css before it is stored into iter->last_visited we can just store its pointer and track the number of removed groups for each memcg. This number would be stored into iterator everytime when a memcg is cached. If the iter count doesn't match the curent walker root's one we will start over from the root again. The group counter is incremented upwards the hierarchy every time a group is removed. Locking rules are a bit complicated but we primarily rely on rcu which protects css from disappearing while it is proved to be still valid. The validity is checked in two steps. First the iter->last_dead_count has to match root->dead_count and second css_tryget has to confirm the that the group is still alive and it pins it until we get a next memcg. Spotted-by: Ying Han Original-idea-by: Johannes Weiner Signed-off-by: Michal Hocko --- mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 57 insertions(+), 9 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e9f5c47..f9b5719 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { }; struct mem_cgroup_reclaim_iter { - /* last scanned hierarchy member with elevated css ref count */ + /* + * last scanned hierarchy member. Valid only if last_dead_count + * matches memcg->dead_count of the hierarchy root group. + */ struct mem_cgroup *last_visited; + unsigned int last_dead_count; + /* scan generation, increased every round-trip */ unsigned int generation; /* lock to protect the position and generation */ @@ -357,6 +362,7 @@ struct mem_cgroup { struct mem_cgroup_stat_cpu nocpu_base; spinlock_t pcp_counter_lock; + atomic_t dead_count; #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) struct tcp_memcontrol tcp_mem; #endif @@ -1158,19 +1164,33 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, int nid = zone_to_nid(reclaim->zone); int zid = zone_idx(reclaim->zone); struct mem_cgroup_per_zone *mz; + unsigned int dead_count; mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; spin_lock(&iter->iter_lock); - last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { - if (last_visited) { - css_put(&last_visited->css); - iter->last_visited = NULL; - } + iter->last_visited = NULL; spin_unlock(&iter->iter_lock); goto out_unlock; } + + /* + * last_visited might be invalid if some of the group + * downwards was removed. As we do not know which one + * disappeared we have to start all over again from the + * root. + * css ref count then makes sure that css won't + * disappear while we iterate to the next memcg + */ + last_visited = iter->last_visited; + dead_count = atomic_read(&root->dead_count); + smp_rmb(); + if (last_visited && + ((dead_count != iter->last_dead_count) || + !css_tryget(&last_visited->css))) { + last_visited = NULL; + } } /* @@ -1210,10 +1230,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (css && !memcg) curr = mem_cgroup_from_css(css); - /* make sure that the cached memcg is not removed */ - if (curr) - css_get(&curr->css); + /* + * No memory barrier is needed here because we are + * protected by iter_lock + */ iter->last_visited = curr; + iter->last_dead_count = atomic_read(&root->dead_count); if (!css) iter->generation++; @@ -6375,10 +6397,36 @@ free_out: return ERR_PTR(error); } +/* + * Announce all parents that a group from their hierarchy is gone. + */ +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; + + while ((parent = parent_mem_cgroup(parent))) + atomic_inc(&parent->dead_count); + + /* + * if the root memcg is not hierarchical we have to check it + * explicitely. + */ + if (!root_mem_cgroup->use_hierarchy) + atomic_inc(&parent->dead_count); + + /* + * Make sure that dead_count updates are visible before other + * cleanup from css_offline. + * Pairs with smp_rmb in mem_cgroup_iter + */ + smp_wmb(); +} + static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + mem_cgroup_uncache_from_reclaim(memcg); mem_cgroup_reparent_charges(memcg); mem_cgroup_destroy_all_caches(memcg); } -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759861Ab3BKT6m (ORCPT ); Mon, 11 Feb 2013 14:58:42 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39592 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759516Ab3BKT6l (ORCPT ); Mon, 11 Feb 2013 14:58:41 -0500 Date: Mon, 11 Feb 2013 14:58:24 -0500 From: Johannes Weiner To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211195824.GB15951@cmpxchg.org> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211192929.GB29000@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote: > On Mon 11-02-13 12:56:19, Johannes Weiner wrote: > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > > > Maybe we could keep the counter per memcg but that would mean that we > > > would need to go up the hierarchy as well. We wouldn't have to go over > > > node-zone-priority cleanup so it would be much more lightweight. > > > > > > I am not sure this is necessarily better than explicit cleanup because > > > it brings yet another kind of generation number to the game but I guess > > > I can live with it if people really thing the relaxed way is much > > > better. > > > What do you think about the patch below (untested yet)? > > > > Better, but I think you can get rid of both locks: > > What is the other lock you have in mind. The iter lock itself. I mean, multiple reclaimers can still race but there won't be any corruption (if you make iter->dead_count a long, setting it happens atomically, we nly need the memcg->dead_count to be an atomic because of the inc) and the worst that could happen is that a reclaim starts at the wrong point in hierarchy, right? But as you said in the changelog that introduced the lock, it's never actually been a practical problem. You just need to put the wmb back in place, so that we never see the dead_count give the green light while the cached position is stale, or we'll tryget random memory. > > mem_cgroup_iter: > > rcu_read_lock() > > if atomic_read(&root->dead_count) == iter->dead_count: > > smp_rmb() > > if tryget(iter->position): > > position = iter->position > > memcg = find_next(postion) > > css_put(position) > > iter->position = memcg > > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > > iter->dead_count = atomic_read(&root->dead_count) > > rcu_read_unlock() > > Updated patch bellow: Cool, thanks. I hope you don't find it too ugly anymore :-) > >From 756c4f0091d250bc5ff816f8e9d11840e8522b3a Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 11 Feb 2013 20:23:51 +0100 > Subject: [PATCH] memcg: relax memcg iter caching > > Now that per-node-zone-priority iterator caches memory cgroups rather > than their css ids we have to be careful and remove them from the > iterator when they are on the way out otherwise they might hang for > unbounded amount of time (until the global/targeted reclaim triggers the > zone under priority to find out the group is dead and let it to find the > final rest). > > We can fix this issue by relaxing rules for the last_visited memcg as > well. > Instead of taking reference to css before it is stored into > iter->last_visited we can just store its pointer and track the number of > removed groups for each memcg. This number would be stored into iterator > everytime when a memcg is cached. If the iter count doesn't match the > curent walker root's one we will start over from the root again. The > group counter is incremented upwards the hierarchy every time a group is > removed. > > Locking rules are a bit complicated but we primarily rely on rcu which > protects css from disappearing while it is proved to be still valid. The > validity is checked in two steps. First the iter->last_dead_count has > to match root->dead_count and second css_tryget has to confirm the > that the group is still alive and it pins it until we get a next memcg. > > Spotted-by: Ying Han > Original-idea-by: Johannes Weiner > Signed-off-by: Michal Hocko > --- > mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 57 insertions(+), 9 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e9f5c47..f9b5719 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { > }; > > struct mem_cgroup_reclaim_iter { > - /* last scanned hierarchy member with elevated css ref count */ > + /* > + * last scanned hierarchy member. Valid only if last_dead_count > + * matches memcg->dead_count of the hierarchy root group. > + */ > struct mem_cgroup *last_visited; > + unsigned int last_dead_count; > + > /* scan generation, increased every round-trip */ > unsigned int generation; > /* lock to protect the position and generation */ > @@ -357,6 +362,7 @@ struct mem_cgroup { > struct mem_cgroup_stat_cpu nocpu_base; > spinlock_t pcp_counter_lock; > > + atomic_t dead_count; > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) > struct tcp_memcontrol tcp_mem; > #endif > @@ -1158,19 +1164,33 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > int nid = zone_to_nid(reclaim->zone); > int zid = zone_idx(reclaim->zone); > struct mem_cgroup_per_zone *mz; > + unsigned int dead_count; > > mz = mem_cgroup_zoneinfo(root, nid, zid); > iter = &mz->reclaim_iter[reclaim->priority]; > spin_lock(&iter->iter_lock); > - last_visited = iter->last_visited; > if (prev && reclaim->generation != iter->generation) { > - if (last_visited) { > - css_put(&last_visited->css); > - iter->last_visited = NULL; > - } > + iter->last_visited = NULL; > spin_unlock(&iter->iter_lock); > goto out_unlock; > } > + > + /* > + * last_visited might be invalid if some of the group > + * downwards was removed. As we do not know which one > + * disappeared we have to start all over again from the > + * root. > + * css ref count then makes sure that css won't > + * disappear while we iterate to the next memcg > + */ > + last_visited = iter->last_visited; > + dead_count = atomic_read(&root->dead_count); > + smp_rmb(); Confused about this barrier, see below. As per above, if you remove the iter lock, those lines are mixed up. You need to read the dead count first because the writer updates the dead count after it sets the new position. That way, if the dead count gives the go-ahead, you KNOW that the position cache is valid, because it has been updated first. If either the two reads or the two writes get reordered, you risk seeing a matching dead count while the position cache is stale. > + if (last_visited && > + ((dead_count != iter->last_dead_count) || > + !css_tryget(&last_visited->css))) { > + last_visited = NULL; > + } > } > > /* > @@ -1210,10 +1230,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (css && !memcg) > curr = mem_cgroup_from_css(css); > > - /* make sure that the cached memcg is not removed */ > - if (curr) > - css_get(&curr->css); > + /* > + * No memory barrier is needed here because we are > + * protected by iter_lock > + */ > iter->last_visited = curr; > + iter->last_dead_count = atomic_read(&root->dead_count); > > if (!css) > iter->generation++; > @@ -6375,10 +6397,36 @@ free_out: > return ERR_PTR(error); > } > > +/* > + * Announce all parents that a group from their hierarchy is gone. > + */ > +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg) How about static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) ? > +{ > + struct mem_cgroup *parent = memcg; > + > + while ((parent = parent_mem_cgroup(parent))) > + atomic_inc(&parent->dead_count); > + > + /* > + * if the root memcg is not hierarchical we have to check it > + * explicitely. > + */ > + if (!root_mem_cgroup->use_hierarchy) > + atomic_inc(&parent->dead_count); Increase root_mem_cgroup->dead_count instead? > + /* > + * Make sure that dead_count updates are visible before other > + * cleanup from css_offline. > + * Pairs with smp_rmb in mem_cgroup_iter > + */ > + smp_wmb(); That's unexpected. What other cleanups? A race between this and mem_cgroup_iter should be fine because of the RCU synchronization. Thanks! Johannes From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932663Ab3BKV2F (ORCPT ); Mon, 11 Feb 2013 16:28:05 -0500 Received: from mail-ee0-f48.google.com ([74.125.83.48]:36988 "EHLO mail-ee0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932605Ab3BKV2D (ORCPT ); Mon, 11 Feb 2013 16:28:03 -0500 Date: Mon, 11 Feb 2013 22:27:56 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211212756.GC29000@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211195824.GB15951@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote: > > On Mon 11-02-13 12:56:19, Johannes Weiner wrote: > > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > > > > Maybe we could keep the counter per memcg but that would mean that we > > > > would need to go up the hierarchy as well. We wouldn't have to go over > > > > node-zone-priority cleanup so it would be much more lightweight. > > > > > > > > I am not sure this is necessarily better than explicit cleanup because > > > > it brings yet another kind of generation number to the game but I guess > > > > I can live with it if people really thing the relaxed way is much > > > > better. > > > > What do you think about the patch below (untested yet)? > > > > > > Better, but I think you can get rid of both locks: > > > > What is the other lock you have in mind. > > The iter lock itself. I mean, multiple reclaimers can still race but > there won't be any corruption (if you make iter->dead_count a long, > setting it happens atomically, we nly need the memcg->dead_count to be > an atomic because of the inc) and the worst that could happen is that > a reclaim starts at the wrong point in hierarchy, right? The lack of synchronization basically means that 2 parallel reclaimers can reclaim every group exactly once (ideally) or up to each group twice in the worst case. So the exclusion was quite comfortable. > But as you said in the changelog that introduced the lock, it's never > actually been a practical problem. That is true but those bugs would be subtle though so I wouldn't be opposed to prevent from them before we get burnt. But if you think that we should keep the previous semantic I can drop that patch. > You just need to put the wmb back in place, so that we never see the > dead_count give the green light while the cached position is stale, or > we'll tryget random memory. > > > > mem_cgroup_iter: > > > rcu_read_lock() > > > if atomic_read(&root->dead_count) == iter->dead_count: > > > smp_rmb() > > > if tryget(iter->position): > > > position = iter->position > > > memcg = find_next(postion) > > > css_put(position) > > > iter->position = memcg > > > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > > > iter->dead_count = atomic_read(&root->dead_count) > > > rcu_read_unlock() > > > > Updated patch bellow: > > Cool, thanks. I hope you don't find it too ugly anymore :-) It's getting trick and you know how people love when you have to play and rely on atomics with memory barriers... [...] > > + > > + /* > > + * last_visited might be invalid if some of the group > > + * downwards was removed. As we do not know which one > > + * disappeared we have to start all over again from the > > + * root. > > + * css ref count then makes sure that css won't > > + * disappear while we iterate to the next memcg > > + */ > > + last_visited = iter->last_visited; > > + dead_count = atomic_read(&root->dead_count); > > + smp_rmb(); > > Confused about this barrier, see below. > > As per above, if you remove the iter lock, those lines are mixed up. > You need to read the dead count first because the writer updates the > dead count after it sets the new position. You are right, we need + dead_count = atomic_read(&root->dead_count); + smp_rmb(); + last_visited = iter->last_visited; > That way, if the dead count gives the go-ahead, you KNOW that the > position cache is valid, because it has been updated first. OK, you are right. We can live without css_tryget because dead_count is either OK which means that css would be alive at least this rcu period (and RCU walk would be safe as well) or it is incremented which means that we have started css_offline already and then css is dead already. So css_tryget can be dropped. > If either the two reads or the two writes get reordered, you risk > seeing a matching dead count while the position cache is stale. > > > + if (last_visited && > > + ((dead_count != iter->last_dead_count) || > > + !css_tryget(&last_visited->css))) { > > + last_visited = NULL; > > + } > > } > > > > /* > > @@ -1210,10 +1230,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > if (css && !memcg) > > curr = mem_cgroup_from_css(css); > > > > - /* make sure that the cached memcg is not removed */ > > - if (curr) > > - css_get(&curr->css); > > + /* > > + * No memory barrier is needed here because we are > > + * protected by iter_lock > > + */ > > iter->last_visited = curr; + smp_wmb(); > > + iter->last_dead_count = atomic_read(&root->dead_count); > > > > if (!css) > > iter->generation++; > > @@ -6375,10 +6397,36 @@ free_out: > > return ERR_PTR(error); > > } > > > > +/* > > + * Announce all parents that a group from their hierarchy is gone. > > + */ > > +static void mem_cgroup_uncache_from_reclaim(struct mem_cgroup *memcg) > > How about > > static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) OK > ? > > > +{ > > + struct mem_cgroup *parent = memcg; > > + > > + while ((parent = parent_mem_cgroup(parent))) > > + atomic_inc(&parent->dead_count); > > + > > + /* > > + * if the root memcg is not hierarchical we have to check it > > + * explicitely. > > + */ > > + if (!root_mem_cgroup->use_hierarchy) > > + atomic_inc(&parent->dead_count); > > Increase root_mem_cgroup->dead_count instead? Sure. C&P > > + /* > > + * Make sure that dead_count updates are visible before other > > + * cleanup from css_offline. > > + * Pairs with smp_rmb in mem_cgroup_iter > > + */ > > + smp_wmb(); > > That's unexpected. What other cleanups? A race between this and > mem_cgroup_iter should be fine because of the RCU synchronization. OK, I was too careful, probably (memory barriers are always head scratchers). I was worried about all dead_count should be committed before we do other steps in the clean up like reparenting charges etc. But as you say it will not do any changes. I will get back to this tomorrow. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932741Ab3BKWHt (ORCPT ); Mon, 11 Feb 2013 17:07:49 -0500 Received: from mail-ea0-f171.google.com ([209.85.215.171]:33550 "EHLO mail-ea0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932253Ab3BKWHr (ORCPT ); Mon, 11 Feb 2013 17:07:47 -0500 Date: Mon, 11 Feb 2013 23:07:42 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211220742.GD29000@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211212756.GC29000@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 11-02-13 22:27:56, Michal Hocko wrote: [...] > I will get back to this tomorrow. Maybe not a great idea as it is getting late here and brain turns into cabbage but there we go: --- >>From f927358fe620837081d7a7ec6bf27af378deb35d Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 11 Feb 2013 23:02:00 +0100 Subject: [PATCH] memcg: relax memcg iter caching Now that per-node-zone-priority iterator caches memory cgroups rather than their css ids we have to be careful and remove them from the iterator when they are on the way out otherwise they might hang for unbounded amount of time (until the global/targeted reclaim triggers the zone under priority to find out the group is dead and let it to find the final rest). We can fix this issue by relaxing rules for the last_visited memcg as well. Instead of taking reference to css before it is stored into iter->last_visited we can just store its pointer and track the number of removed groups for each memcg. This number would be stored into iterator everytime when a memcg is cached. If the iter count doesn't match the curent walker root's one we will start over from the root again. The group counter is incremented upwards the hierarchy every time a group is removed. Locking rules got a bit complicated. We primarily rely on rcu read lock which makes sure that once we see an up-to-date dead_count then iter->last_visited is valid for RCU walk. smp_rmb makes sure that dead_count is read before last_visited and last_dead_count while smp_wmb makes sure that last_visited is updated before last_dead_count so the up-to-date last_dead_count cannot point to an outdated last_visited. Which also means that css reference counting is no longer needed because RCU will keep last_visited alive. Spotted-by: Ying Han Original-idea-by: Johannes Weiner Signed-off-by: Michal Hocko --- mm/memcontrol.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 44 insertions(+), 9 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e9f5c47..42f9d94 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { }; struct mem_cgroup_reclaim_iter { - /* last scanned hierarchy member with elevated css ref count */ + /* + * last scanned hierarchy member. Valid only if last_dead_count + * matches memcg->dead_count of the hierarchy root group. + */ struct mem_cgroup *last_visited; + unsigned int last_dead_count; + /* scan generation, increased every round-trip */ unsigned int generation; /* lock to protect the position and generation */ @@ -357,6 +362,7 @@ struct mem_cgroup { struct mem_cgroup_stat_cpu nocpu_base; spinlock_t pcp_counter_lock; + atomic_t dead_count; #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) struct tcp_memcontrol tcp_mem; #endif @@ -1158,19 +1164,30 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, int nid = zone_to_nid(reclaim->zone); int zid = zone_idx(reclaim->zone); struct mem_cgroup_per_zone *mz; + unsigned int dead_count; mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; spin_lock(&iter->iter_lock); - last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { - if (last_visited) { - css_put(&last_visited->css); - iter->last_visited = NULL; - } + iter->last_visited = NULL; spin_unlock(&iter->iter_lock); goto out_unlock; } + + /* + * last_visited might be invalid if some of the group + * downwards was removed. As we do not know which one + * disappeared we have to start all over again from the + * root. + */ + dead_count = atomic_read(&root->dead_count); + smp_rmb(); + last_visited = iter->last_visited; + if (last_visited && + ((dead_count != iter->last_dead_count))) { + last_visited = NULL; + } } /* @@ -1210,10 +1227,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (css && !memcg) curr = mem_cgroup_from_css(css); - /* make sure that the cached memcg is not removed */ - if (curr) - css_get(&curr->css); iter->last_visited = curr; + smp_wmb(); + iter->last_dead_count = atomic_read(&root->dead_count); if (!css) iter->generation++; @@ -6375,10 +6391,29 @@ free_out: return ERR_PTR(error); } +/* + * Announce all parents that a group from their hierarchy is gone. + */ +static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; + + while ((parent = parent_mem_cgroup(parent))) + atomic_inc(&parent->dead_count); + + /* + * if the root memcg is not hierarchical we have to check it + * explicitely. + */ + if (!root_mem_cgroup->use_hierarchy) + atomic_inc(&root_mem_cgroup->dead_count); +} + static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + mem_cgroup_invalidate_reclaim_iterators(memcg); mem_cgroup_reparent_charges(memcg); mem_cgroup_destroy_all_caches(memcg); } -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932842Ab3BKWkD (ORCPT ); Mon, 11 Feb 2013 17:40:03 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39606 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932641Ab3BKWkA (ORCPT ); Mon, 11 Feb 2013 17:40:00 -0500 Date: Mon, 11 Feb 2013 17:39:43 -0500 From: Johannes Weiner To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130211223943.GC15951@cmpxchg.org> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211212756.GC29000@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote: > > > On Mon 11-02-13 12:56:19, Johannes Weiner wrote: > > > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > > > > > Maybe we could keep the counter per memcg but that would mean that we > > > > > would need to go up the hierarchy as well. We wouldn't have to go over > > > > > node-zone-priority cleanup so it would be much more lightweight. > > > > > > > > > > I am not sure this is necessarily better than explicit cleanup because > > > > > it brings yet another kind of generation number to the game but I guess > > > > > I can live with it if people really thing the relaxed way is much > > > > > better. > > > > > What do you think about the patch below (untested yet)? > > > > > > > > Better, but I think you can get rid of both locks: > > > > > > What is the other lock you have in mind. > > > > The iter lock itself. I mean, multiple reclaimers can still race but > > there won't be any corruption (if you make iter->dead_count a long, > > setting it happens atomically, we nly need the memcg->dead_count to be > > an atomic because of the inc) and the worst that could happen is that > > a reclaim starts at the wrong point in hierarchy, right? > > The lack of synchronization basically means that 2 parallel reclaimers > can reclaim every group exactly once (ideally) or up to each group > twice in the worst case. > So the exclusion was quite comfortable. It's quite unlikely, though. Don't forget that they actually reclaim in between, I just can't see them line up perfectly and race to the iterator at the same time repeatedly. It's more likely to happen at the higher priority levels where less reclaim happens, and then it's not a big deal anyway. With lower priority levels, when the glitches would be more problematic, they also become even less likely. > > But as you said in the changelog that introduced the lock, it's never > > actually been a practical problem. > > That is true but those bugs would be subtle though so I wouldn't be > opposed to prevent from them before we get burnt. But if you think that > we should keep the previous semantic I can drop that patch. I just think that the problem is unlikely and not that big of a deal. > > You just need to put the wmb back in place, so that we never see the > > dead_count give the green light while the cached position is stale, or > > we'll tryget random memory. > > > > > > mem_cgroup_iter: > > > > rcu_read_lock() > > > > if atomic_read(&root->dead_count) == iter->dead_count: > > > > smp_rmb() > > > > if tryget(iter->position): > > > > position = iter->position > > > > memcg = find_next(postion) > > > > css_put(position) > > > > iter->position = memcg > > > > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > > > > iter->dead_count = atomic_read(&root->dead_count) > > > > rcu_read_unlock() > > > > > > Updated patch bellow: > > > > Cool, thanks. I hope you don't find it too ugly anymore :-) > > It's getting trick and you know how people love when you have to play > and rely on atomics with memory barriers... My bumper sticker reads "I don't believe in mutual exclusion" (the kernel hacker's version of smile for the red light camera). I mean, you were the one complaining about the lock... > > That way, if the dead count gives the go-ahead, you KNOW that the > > position cache is valid, because it has been updated first. > > OK, you are right. We can live without css_tryget because dead_count is > either OK which means that css would be alive at least this rcu period > (and RCU walk would be safe as well) or it is incremented which means > that we have started css_offline already and then css is dead already. > So css_tryget can be dropped. Not quite :) The dead_count check is for completed destructions, but the try_get is needed to detect a race with an ongoing destruction. Basically, the dead_count verifies the iterator pointer is valid (and the rcu reader lock keeps it that way), the try_get verifies that the object pointed to is still alive. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756006Ab3BLJy3 (ORCPT ); Tue, 12 Feb 2013 04:54:29 -0500 Received: from cantor2.suse.de ([195.135.220.15]:41713 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751027Ab3BLJy0 (ORCPT ); Tue, 12 Feb 2013 04:54:26 -0500 Date: Tue, 12 Feb 2013 10:54:19 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212095419.GB4863@dhcp22.suse.cz> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130211223943.GC15951@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > On Mon, Feb 11, 2013 at 08:29:29PM +0100, Michal Hocko wrote: > > > > On Mon 11-02-13 12:56:19, Johannes Weiner wrote: > > > > > On Mon, Feb 11, 2013 at 04:16:49PM +0100, Michal Hocko wrote: > > > > > > Maybe we could keep the counter per memcg but that would mean that we > > > > > > would need to go up the hierarchy as well. We wouldn't have to go over > > > > > > node-zone-priority cleanup so it would be much more lightweight. > > > > > > > > > > > > I am not sure this is necessarily better than explicit cleanup because > > > > > > it brings yet another kind of generation number to the game but I guess > > > > > > I can live with it if people really thing the relaxed way is much > > > > > > better. > > > > > > What do you think about the patch below (untested yet)? > > > > > > > > > > Better, but I think you can get rid of both locks: > > > > > > > > What is the other lock you have in mind. > > > > > > The iter lock itself. I mean, multiple reclaimers can still race but > > > there won't be any corruption (if you make iter->dead_count a long, > > > setting it happens atomically, we nly need the memcg->dead_count to be > > > an atomic because of the inc) and the worst that could happen is that > > > a reclaim starts at the wrong point in hierarchy, right? > > > > The lack of synchronization basically means that 2 parallel reclaimers > > can reclaim every group exactly once (ideally) or up to each group > > twice in the worst case. > > So the exclusion was quite comfortable. > > It's quite unlikely, though. Don't forget that they actually reclaim > in between, I just can't see them line up perfectly and race to the > iterator at the same time repeatedly. It's more likely to happen at > the higher priority levels where less reclaim happens, and then it's > not a big deal anyway. With lower priority levels, when the glitches > would be more problematic, they also become even less likely. Fair enough, I will drop that patch in the next version. > > > But as you said in the changelog that introduced the lock, it's never > > > actually been a practical problem. > > > > That is true but those bugs would be subtle though so I wouldn't be > > opposed to prevent from them before we get burnt. But if you think that > > we should keep the previous semantic I can drop that patch. > > I just think that the problem is unlikely and not that big of a deal. > > > > You just need to put the wmb back in place, so that we never see the > > > dead_count give the green light while the cached position is stale, or > > > we'll tryget random memory. > > > > > > > > mem_cgroup_iter: > > > > > rcu_read_lock() > > > > > if atomic_read(&root->dead_count) == iter->dead_count: > > > > > smp_rmb() > > > > > if tryget(iter->position): > > > > > position = iter->position > > > > > memcg = find_next(postion) > > > > > css_put(position) > > > > > iter->position = memcg > > > > > smp_wmb() /* Write position cache BEFORE marking it uptodate */ > > > > > iter->dead_count = atomic_read(&root->dead_count) > > > > > rcu_read_unlock() > > > > > > > > Updated patch bellow: > > > > > > Cool, thanks. I hope you don't find it too ugly anymore :-) > > > > It's getting trick and you know how people love when you have to play > > and rely on atomics with memory barriers... > > My bumper sticker reads "I don't believe in mutual exclusion" (the > kernel hacker's version of smile for the red light camera). Ohh, those easy riders. > I mean, you were the one complaining about the lock... > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > position cache is valid, because it has been updated first. > > > > OK, you are right. We can live without css_tryget because dead_count is > > either OK which means that css would be alive at least this rcu period > > (and RCU walk would be safe as well) or it is incremented which means > > that we have started css_offline already and then css is dead already. > > So css_tryget can be dropped. > > Not quite :) > > The dead_count check is for completed destructions, Not quite :P. dead_count is incremented in css_offline callback which is called before the cgroup core releases its last reference and unlinks the group from the siblinks. css_tryget would already fail at this stage because CSS_DEACT_BIAS is in place at that time but this doesn't break RCU walk. So I think we are safe even without css_get. Or am I missing something? [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761259Ab3BLPK1 (ORCPT ); Tue, 12 Feb 2013 10:10:27 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39654 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1760612Ab3BLPK0 (ORCPT ); Tue, 12 Feb 2013 10:10:26 -0500 Date: Tue, 12 Feb 2013 10:10:02 -0500 From: Johannes Weiner To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212151002.GD15951@cmpxchg.org> References: <1357235661-29564-1-git-send-email-mhocko@suse.cz> <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212095419.GB4863@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > position cache is valid, because it has been updated first. > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > either OK which means that css would be alive at least this rcu period > > > (and RCU walk would be safe as well) or it is incremented which means > > > that we have started css_offline already and then css is dead already. > > > So css_tryget can be dropped. > > > > Not quite :) > > > > The dead_count check is for completed destructions, > > Not quite :P. dead_count is incremented in css_offline callback which is > called before the cgroup core releases its last reference and unlinks > the group from the siblinks. css_tryget would already fail at this stage > because CSS_DEACT_BIAS is in place at that time but this doesn't break > RCU walk. So I think we are safe even without css_get. But you drop the RCU lock before you return. dead_count IS incremented for every destruction, but it's not reliable for concurrent ones, is what I meant. Again, if there is a dead_count mismatch, your pointer might be dangling, easy case. However, even if there is no mismatch, you could still race with a destruction that has marked the object dead, and then frees it once you drop the RCU lock, so you need try_get() to check if the object is dead, or you could return a pointer to freed or soon to be freed memory. /* * If the dead_count mismatches, a destruction has happened or is * happening concurrently. If the dead_count matches, a destruction * might still happen concurrently, but since we checked under RCU, * that destruction won't free the object until we release the RCU * reader lock. Thus, the dead_count check verifies the pointer is * still valid, css_tryget() verifies the cgroup pointed to is alive. */ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933200Ab3BLPnf (ORCPT ); Tue, 12 Feb 2013 10:43:35 -0500 Received: from cantor2.suse.de ([195.135.220.15]:57578 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758691Ab3BLPnd (ORCPT ); Tue, 12 Feb 2013 10:43:33 -0500 Date: Tue, 12 Feb 2013 16:43:30 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212154330.GG4863@dhcp22.suse.cz> References: <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212151002.GD15951@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > position cache is valid, because it has been updated first. > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > either OK which means that css would be alive at least this rcu period > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > that we have started css_offline already and then css is dead already. > > > > So css_tryget can be dropped. > > > > > > Not quite :) > > > > > > The dead_count check is for completed destructions, > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > called before the cgroup core releases its last reference and unlinks > > the group from the siblinks. css_tryget would already fail at this stage > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > RCU walk. So I think we are safe even without css_get. > > But you drop the RCU lock before you return. > > dead_count IS incremented for every destruction, but it's not reliable > for concurrent ones, is what I meant. Again, if there is a dead_count > mismatch, your pointer might be dangling, easy case. However, even if > there is no mismatch, you could still race with a destruction that has > marked the object dead, and then frees it once you drop the RCU lock, > so you need try_get() to check if the object is dead, or you could > return a pointer to freed or soon to be freed memory. Wait a moment. But what prevents from the following race? rcu_read_lock() mem_cgroup_css_offline(memcg) root->dead_count++ iter->last_dead_count = root->dead_count iter->last_visited = memcg // final css_put(memcg); // last_visited is still valid rcu_read_unlock() [...] // next iteration rcu_read_lock() iter->last_dead_count == root->dead_count // KABOOM The race window between dead_count++ and css_put is quite big but that is not important because that css_put can happen anytime before we start the next iteration and take rcu_read_lock. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933321Ab3BLQNg (ORCPT ); Tue, 12 Feb 2013 11:13:36 -0500 Received: from cantor2.suse.de ([195.135.220.15]:59079 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932173Ab3BLQNf (ORCPT ); Tue, 12 Feb 2013 11:13:35 -0500 Date: Tue, 12 Feb 2013 17:13:32 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212161332.GI4863@dhcp22.suse.cz> References: <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212154330.GG4863@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 12-02-13 16:43:30, Michal Hocko wrote: [...] The example was not complete: > Wait a moment. But what prevents from the following race? > > rcu_read_lock() cgroup_next_descendant_pre css_tryget(css); memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, &css->refcnt) > mem_cgroup_css_offline(memcg) We should be safe if we did synchronize_rcu() before root->dead_count++, no? Because then we would have a guarantee that if css_tryget(memcg) suceeded then we wouldn't race with dead_count++ it triggered. > root->dead_count++ > iter->last_dead_count = root->dead_count > iter->last_visited = memcg > // final > css_put(memcg); > // last_visited is still valid > rcu_read_unlock() > [...] > // next iteration > rcu_read_lock() > iter->last_dead_count == root->dead_count > // KABOOM > > The race window between dead_count++ and css_put is quite big but that > is not important because that css_put can happen anytime before we start > the next iteration and take rcu_read_lock. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758922Ab3BLQYp (ORCPT ); Tue, 12 Feb 2013 11:24:45 -0500 Received: from cantor2.suse.de ([195.135.220.15]:59531 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758297Ab3BLQYo (ORCPT ); Tue, 12 Feb 2013 11:24:44 -0500 Date: Tue, 12 Feb 2013 17:24:42 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212162442.GJ4863@dhcp22.suse.cz> References: <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212161332.GI4863@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 12-02-13 17:13:32, Michal Hocko wrote: > On Tue 12-02-13 16:43:30, Michal Hocko wrote: > [...] > The example was not complete: > > > Wait a moment. But what prevents from the following race? > > > > rcu_read_lock() > > cgroup_next_descendant_pre > css_tryget(css); > memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, &css->refcnt) > > > mem_cgroup_css_offline(memcg) > > We should be safe if we did synchronize_rcu() before root->dead_count++, > no? > Because then we would have a guarantee that if css_tryget(memcg) > suceeded then we wouldn't race with dead_count++ it triggered. > > > root->dead_count++ > > iter->last_dead_count = root->dead_count > > iter->last_visited = memcg > > // final > > css_put(memcg); > > // last_visited is still valid > > rcu_read_unlock() > > [...] > > // next iteration > > rcu_read_lock() > > iter->last_dead_count == root->dead_count > > // KABOOM Ohh I have missed that we took a reference on the current memcg which will be stored into last_visited. And then later, during the next iteration it will be still alive until we are done because previous patch moved css_put to the very end. So this race is not possible. I still need to think about parallel iteration and a race with removal. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933403Ab3BLQeD (ORCPT ); Tue, 12 Feb 2013 11:34:03 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39664 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932330Ab3BLQeB convert rfc822-to-8bit (ORCPT ); Tue, 12 Feb 2013 11:34:01 -0500 User-Agent: K-9 Mail for Android In-Reply-To: <20130212154330.GG4863@dhcp22.suse.cz> References: <1357235661-29564-5-git-send-email-mhocko@suse.cz> <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: 8BIT Content-Type: text/plain; charset=UTF-8 Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators From: Johannes Weiner Date: Tue, 12 Feb 2013 11:33:39 -0500 To: Michal Hocko CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Message-ID: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Michal Hocko wrote: >On Tue 12-02-13 10:10:02, Johannes Weiner wrote: >> On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: >> > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: >> > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: >> > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: >> > > > > That way, if the dead count gives the go-ahead, you KNOW that >the >> > > > > position cache is valid, because it has been updated first. >> > > > >> > > > OK, you are right. We can live without css_tryget because >dead_count is >> > > > either OK which means that css would be alive at least this rcu >period >> > > > (and RCU walk would be safe as well) or it is incremented which >means >> > > > that we have started css_offline already and then css is dead >already. >> > > > So css_tryget can be dropped. >> > > >> > > Not quite :) >> > > >> > > The dead_count check is for completed destructions, >> > >> > Not quite :P. dead_count is incremented in css_offline callback >which is >> > called before the cgroup core releases its last reference and >unlinks >> > the group from the siblinks. css_tryget would already fail at this >stage >> > because CSS_DEACT_BIAS is in place at that time but this doesn't >break >> > RCU walk. So I think we are safe even without css_get. >> >> But you drop the RCU lock before you return. >> >> dead_count IS incremented for every destruction, but it's not >reliable >> for concurrent ones, is what I meant. Again, if there is a >dead_count >> mismatch, your pointer might be dangling, easy case. However, even >if >> there is no mismatch, you could still race with a destruction that >has >> marked the object dead, and then frees it once you drop the RCU lock, >> so you need try_get() to check if the object is dead, or you could >> return a pointer to freed or soon to be freed memory. > >Wait a moment. But what prevents from the following race? > >rcu_read_lock() > mem_cgroup_css_offline(memcg) > root->dead_count++ >iter->last_dead_count = root->dead_count use the dead count read the first time for comparison, i.e. only one atomic read in that function. you are right, we would miss to account for that concurrent destruction otherwise. >iter->last_visited = memcg > // final > css_put(memcg); >// last_visited is still valid >rcu_read_unlock() >[...] >// next iteration >rcu_read_lock() >iter->last_dead_count == root->dead_count >// KABOOM > >The race window between dead_count++ and css_put is quite big but that >is not important because that css_put can happen anytime before we >start >the next iteration and take rcu_read_lock. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758903Ab3BLQh7 (ORCPT ); Tue, 12 Feb 2013 11:37:59 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60201 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758698Ab3BLQh6 (ORCPT ); Tue, 12 Feb 2013 11:37:58 -0500 Date: Tue, 12 Feb 2013 17:37:56 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212163756.GK4863@dhcp22.suse.cz> References: <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212162442.GJ4863@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 12-02-13 17:24:42, Michal Hocko wrote: > On Tue 12-02-13 17:13:32, Michal Hocko wrote: > > On Tue 12-02-13 16:43:30, Michal Hocko wrote: > > [...] > > The example was not complete: > > > > > Wait a moment. But what prevents from the following race? > > > > > > rcu_read_lock() > > > > cgroup_next_descendant_pre > > css_tryget(css); > > memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, &css->refcnt) > > > > > mem_cgroup_css_offline(memcg) > > > > We should be safe if we did synchronize_rcu() before root->dead_count++, > > no? > > Because then we would have a guarantee that if css_tryget(memcg) > > suceeded then we wouldn't race with dead_count++ it triggered. > > > > > root->dead_count++ > > > iter->last_dead_count = root->dead_count > > > iter->last_visited = memcg > > > // final > > > css_put(memcg); > > > // last_visited is still valid > > > rcu_read_unlock() > > > [...] > > > // next iteration > > > rcu_read_lock() > > > iter->last_dead_count == root->dead_count > > > // KABOOM > > Ohh I have missed that we took a reference on the current memcg which > will be stored into last_visited. And then later, during the next > iteration it will be still alive until we are done because previous > patch moved css_put to the very end. And that wouldn't help because: css_tryget(memcg) // OK CSS_DEACT_BIAS root->dead_count++ iter->last_visited = memcg iter->last_dead_count = root->dead_count prev = memcg css_put(memcg) memcg_iter_break css_put(memcg) // it will released //new iteration iter->last_dead_count == root->dead_count //ok css_tryget() // KABOOM because css is already gone Bit I still might be missing something and need to get back to this with a clean head. Sorry about the spam -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933419Ab3BLQlh (ORCPT ); Tue, 12 Feb 2013 11:41:37 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39672 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758698Ab3BLQlg convert rfc822-to-8bit (ORCPT ); Tue, 12 Feb 2013 11:41:36 -0500 User-Agent: K-9 Mail for Android In-Reply-To: <20130212162442.GJ4863@dhcp22.suse.cz> References: <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: 8BIT Content-Type: text/plain; charset=UTF-8 Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators From: Johannes Weiner Date: Tue, 12 Feb 2013 11:41:03 -0500 To: Michal Hocko CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Message-ID: <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Michal Hocko wrote: >On Tue 12-02-13 17:13:32, Michal Hocko wrote: >> On Tue 12-02-13 16:43:30, Michal Hocko wrote: >> [...] >> The example was not complete: >> >> > Wait a moment. But what prevents from the following race? >> > >> > rcu_read_lock() >> >> cgroup_next_descendant_pre >> css_tryget(css); >> memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, >&css->refcnt) >> >> > mem_cgroup_css_offline(memcg) >> >> We should be safe if we did synchronize_rcu() before >root->dead_count++, >> no? >> Because then we would have a guarantee that if css_tryget(memcg) >> suceeded then we wouldn't race with dead_count++ it triggered. >> >> > root->dead_count++ >> > iter->last_dead_count = root->dead_count >> > iter->last_visited = memcg >> > // final >> > css_put(memcg); >> > // last_visited is still valid >> > rcu_read_unlock() >> > [...] >> > // next iteration >> > rcu_read_lock() >> > iter->last_dead_count == root->dead_count >> > // KABOOM > >Ohh I have missed that we took a reference on the current memcg which >will be stored into last_visited. And then later, during the next >iteration it will be still alive until we are done because previous >patch moved css_put to the very end. >So this race is not possible. I still need to think about parallel >iteration and a race with removal. I thought the whole point was to not have a reference in last_visited because have the iterator might be unused indefinitely :-) We only store a pointer and validate it before use the next time around. So I think the race is still possible, but we can deal with it by not losing concurrent dead count changes, i.e. one atomic read in the iterator function. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933456Ab3BLRMX (ORCPT ); Tue, 12 Feb 2013 12:12:23 -0500 Received: from cantor2.suse.de ([195.135.220.15]:33958 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933369Ab3BLRMU (ORCPT ); Tue, 12 Feb 2013 12:12:20 -0500 Date: Tue, 12 Feb 2013 18:12:16 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212171216.GA17663@dhcp22.suse.cz> References: <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 12-02-13 11:41:03, Johannes Weiner wrote: > > > Michal Hocko wrote: > > >On Tue 12-02-13 17:13:32, Michal Hocko wrote: > >> On Tue 12-02-13 16:43:30, Michal Hocko wrote: > >> [...] > >> The example was not complete: > >> > >> > Wait a moment. But what prevents from the following race? > >> > > >> > rcu_read_lock() > >> > >> cgroup_next_descendant_pre > >> css_tryget(css); > >> memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, > >&css->refcnt) > >> > >> > mem_cgroup_css_offline(memcg) > >> > >> We should be safe if we did synchronize_rcu() before > >root->dead_count++, > >> no? > >> Because then we would have a guarantee that if css_tryget(memcg) > >> suceeded then we wouldn't race with dead_count++ it triggered. > >> > >> > root->dead_count++ > >> > iter->last_dead_count = root->dead_count > >> > iter->last_visited = memcg > >> > // final > >> > css_put(memcg); > >> > // last_visited is still valid > >> > rcu_read_unlock() > >> > [...] > >> > // next iteration > >> > rcu_read_lock() > >> > iter->last_dead_count == root->dead_count > >> > // KABOOM > > > >Ohh I have missed that we took a reference on the current memcg which > >will be stored into last_visited. And then later, during the next > >iteration it will be still alive until we are done because previous > >patch moved css_put to the very end. > >So this race is not possible. I still need to think about parallel > >iteration and a race with removal. > > I thought the whole point was to not have a reference in last_visited > because have the iterator might be unused indefinitely :-) OK, it seems that I managed to confuse ;) > We only store a pointer and validate it before use the next time > around. So I think the race is still possible, but we can deal with > it by not losing concurrent dead count changes, i.e. one atomic read > in the iterator function. All reads from root->dead_count are atomic already, so I am not sure what you mean here. Anyway, I hope I won't make this even more confusing if I post what I have right now: --- >>From 52121928be61282dc19e32179056615ffdf128a9 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 12 Feb 2013 18:08:26 +0100 Subject: [PATCH] memcg: relax memcg iter caching Now that per-node-zone-priority iterator caches memory cgroups rather than their css ids we have to be careful and remove them from the iterator when they are on the way out otherwise they might hang for unbounded amount of time (until the global/targeted reclaim triggers the zone under priority to find out the group is dead and let it to find the final rest). We can fix this issue by relaxing rules for the last_visited memcg as well. Instead of taking reference to css before it is stored into iter->last_visited we can just store its pointer and track the number of removed groups for each memcg. This number would be stored into iterator everytime when a memcg is cached. If the iter count doesn't match the curent walker root's one we will start over from the root again. The group counter is incremented upwards the hierarchy every time a group is removed. Locking rules got a bit complicated. We primarily rely on rcu read lock which makes sure that once we see an up-to-date dead_count then iter->last_visited is valid for RCU walk. smp_rmb makes sure that dead_count is read before last_visited and last_dead_count while smp_wmb makes sure that last_visited is updated before last_dead_count so the up-to-date last_dead_count cannot point to an outdated last_visited. css_tryget then makes sure that the last_visited is still alive. Spotted-by: Ying Han Original-idea-by: Johannes Weiner Signed-off-by: Michal Hocko --- mm/memcontrol.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 60 insertions(+), 9 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 727ec39..31bb9b0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { }; struct mem_cgroup_reclaim_iter { - /* last scanned hierarchy member with elevated css ref count */ + /* + * last scanned hierarchy member. Valid only if last_dead_count + * matches memcg->dead_count of the hierarchy root group. + */ struct mem_cgroup *last_visited; + unsigned int last_dead_count; + /* scan generation, increased every round-trip */ unsigned int generation; }; @@ -355,6 +360,7 @@ struct mem_cgroup { struct mem_cgroup_stat_cpu nocpu_base; spinlock_t pcp_counter_lock; + atomic_t dead_count; #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) struct tcp_memcontrol tcp_mem; #endif @@ -1156,17 +1162,36 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, int nid = zone_to_nid(reclaim->zone); int zid = zone_idx(reclaim->zone); struct mem_cgroup_per_zone *mz; + unsigned int dead_count; mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; - last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { - if (last_visited) { - css_put(&last_visited->css); - iter->last_visited = NULL; - } + iter->last_visited = NULL; goto out_unlock; } + + /* + * If the dead_count mismatches, a destruction + * has happened or is happening concurrently. + * If the dead_count matches, a destruction + * might still happen concurrently, but since + * we checked under RCU, that destruction + * won't free the object until we release the + * RCU reader lock. Thus, the dead_count + * check verifies the pointer is still valid, + * css_tryget() verifies the cgroup pointed to + * is alive. + */ + dead_count = atomic_read(&root->dead_count); + smp_rmb(); + last_visited = iter->last_visited; + if (last_visited) { + if ((dead_count != iter->last_dead_count) || + !css_tryget(&last_visited->css)) { + last_visited = NULL; + } + } } /* @@ -1206,10 +1231,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (css && !memcg) curr = mem_cgroup_from_css(css); - /* make sure that the cached memcg is not removed */ - if (curr) - css_get(&curr->css); iter->last_visited = curr; + smp_wmb(); + iter->last_dead_count = atomic_read(&root->dead_count); if (!css) iter->generation++; @@ -6366,10 +6390,37 @@ free_out: return ERR_PTR(error); } +/* + * Announce all parents that a group from their hierarchy is gone. + */ +static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) +{ + struct mem_cgroup *parent = memcg; + + /* + * Make sure we are not racing with mem_cgroup_iter when it stores + * a new iter->last_visited. Wait until that RCU finishes so that + * it cannot see already incremented dead_count with memcg which + * would be already dead next time but dead_count wouldn't tell + * us about that. + */ + synchronize_rcu(); + while ((parent = parent_mem_cgroup(parent))) + atomic_inc(&parent->dead_count); + + /* + * if the root memcg is not hierarchical we have to check it + * explicitely. + */ + if (!root_mem_cgroup->use_hierarchy) + atomic_inc(&root_mem_cgroup->dead_count); +} + static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + mem_cgroup_invalidate_reclaim_iterators(memcg); mem_cgroup_reparent_charges(memcg); mem_cgroup_destroy_all_caches(memcg); } -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933507Ab3BLRZx (ORCPT ); Tue, 12 Feb 2013 12:25:53 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39683 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932336Ab3BLRZw (ORCPT ); Tue, 12 Feb 2013 12:25:52 -0500 Date: Tue, 12 Feb 2013 12:25:26 -0500 From: Johannes Weiner To: "Paul E. McKenney" Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212172526.GC25235@cmpxchg.org> References: <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161051.GQ2666@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212161051.GQ2666@linux.vnet.ibm.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote: > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote: > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > > > position cache is valid, because it has been updated first. > > > > > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > > > either OK which means that css would be alive at least this rcu period > > > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > > > that we have started css_offline already and then css is dead already. > > > > > > So css_tryget can be dropped. > > > > > > > > > > Not quite :) > > > > > > > > > > The dead_count check is for completed destructions, > > > > > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > > > called before the cgroup core releases its last reference and unlinks > > > > the group from the siblinks. css_tryget would already fail at this stage > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > > > RCU walk. So I think we are safe even without css_get. > > > > > > But you drop the RCU lock before you return. > > > > > > dead_count IS incremented for every destruction, but it's not reliable > > > for concurrent ones, is what I meant. Again, if there is a dead_count > > > mismatch, your pointer might be dangling, easy case. However, even if > > > there is no mismatch, you could still race with a destruction that has > > > marked the object dead, and then frees it once you drop the RCU lock, > > > so you need try_get() to check if the object is dead, or you could > > > return a pointer to freed or soon to be freed memory. > > > > Wait a moment. But what prevents from the following race? > > > > rcu_read_lock() > > mem_cgroup_css_offline(memcg) > > root->dead_count++ > > iter->last_dead_count = root->dead_count > > iter->last_visited = memcg > > // final > > css_put(memcg); > > // last_visited is still valid > > rcu_read_unlock() > > [...] > > // next iteration > > rcu_read_lock() > > iter->last_dead_count == root->dead_count > > // KABOOM > > > > The race window between dead_count++ and css_put is quite big but that > > is not important because that css_put can happen anytime before we start > > the next iteration and take rcu_read_lock. > > The usual approach is to make sure that there is a grace period (either > synchronize_rcu() or call_rcu()) between the time that the data is > made inaccessible to readers (this would be mem_cgroup_css_offline()?) > and the time it is freed (css_put(), correct?). Absolutely! And there is a synchronize_rcu() in between those two operations. However, we want to keep a weak reference to the cgroup after we drop the rcu read-side lock, so rcu alone is not enough for us to guarantee object life time. We still have to carefully detect any concurrent offlinings in order to validate the weak reference next time around. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933598Ab3BLRhy (ORCPT ); Tue, 12 Feb 2013 12:37:54 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39691 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932404Ab3BLRhw (ORCPT ); Tue, 12 Feb 2013 12:37:52 -0500 Date: Tue, 12 Feb 2013 12:37:41 -0500 From: Johannes Weiner To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212173741.GD25235@cmpxchg.org> References: <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> <20130212171216.GA17663@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212171216.GA17663@dhcp22.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote: > On Tue 12-02-13 11:41:03, Johannes Weiner wrote: > > > > > > Michal Hocko wrote: > > > > >On Tue 12-02-13 17:13:32, Michal Hocko wrote: > > >> On Tue 12-02-13 16:43:30, Michal Hocko wrote: > > >> [...] > > >> The example was not complete: > > >> > > >> > Wait a moment. But what prevents from the following race? > > >> > > > >> > rcu_read_lock() > > >> > > >> cgroup_next_descendant_pre > > >> css_tryget(css); > > >> memcg = mem_cgroup_from_css(css) atomic_add(CSS_DEACT_BIAS, > > >&css->refcnt) > > >> > > >> > mem_cgroup_css_offline(memcg) > > >> > > >> We should be safe if we did synchronize_rcu() before > > >root->dead_count++, > > >> no? > > >> Because then we would have a guarantee that if css_tryget(memcg) > > >> suceeded then we wouldn't race with dead_count++ it triggered. > > >> > > >> > root->dead_count++ > > >> > iter->last_dead_count = root->dead_count > > >> > iter->last_visited = memcg > > >> > // final > > >> > css_put(memcg); > > >> > // last_visited is still valid > > >> > rcu_read_unlock() > > >> > [...] > > >> > // next iteration > > >> > rcu_read_lock() > > >> > iter->last_dead_count == root->dead_count > > >> > // KABOOM > > > > > >Ohh I have missed that we took a reference on the current memcg which > > >will be stored into last_visited. And then later, during the next > > >iteration it will be still alive until we are done because previous > > >patch moved css_put to the very end. > > >So this race is not possible. I still need to think about parallel > > >iteration and a race with removal. > > > > I thought the whole point was to not have a reference in last_visited > > because have the iterator might be unused indefinitely :-) > > OK, it seems that I managed to confuse ;) > > > We only store a pointer and validate it before use the next time > > around. So I think the race is still possible, but we can deal with > > it by not losing concurrent dead count changes, i.e. one atomic read > > in the iterator function. > > All reads from root->dead_count are atomic already, so I am not sure > what you mean here. Anyway, I hope I won't make this even more confusing > if I post what I have right now: Yes, but we are doing two reads. Can't the memcg that we'll store in last_visited be offlined during this and be freed after we drop the rcu read lock? If we had just one read, we would detect this properly. > --- > >From 52121928be61282dc19e32179056615ffdf128a9 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Tue, 12 Feb 2013 18:08:26 +0100 > Subject: [PATCH] memcg: relax memcg iter caching > > Now that per-node-zone-priority iterator caches memory cgroups rather > than their css ids we have to be careful and remove them from the > iterator when they are on the way out otherwise they might hang for > unbounded amount of time (until the global/targeted reclaim triggers the > zone under priority to find out the group is dead and let it to find the > final rest). > > We can fix this issue by relaxing rules for the last_visited memcg as > well. > Instead of taking reference to css before it is stored into > iter->last_visited we can just store its pointer and track the number of > removed groups for each memcg. This number would be stored into iterator > everytime when a memcg is cached. If the iter count doesn't match the > curent walker root's one we will start over from the root again. The > group counter is incremented upwards the hierarchy every time a group is > removed. > > Locking rules got a bit complicated. We primarily rely on rcu read > lock which makes sure that once we see an up-to-date dead_count then > iter->last_visited is valid for RCU walk. smp_rmb makes sure that > dead_count is read before last_visited and last_dead_count while smp_wmb > makes sure that last_visited is updated before last_dead_count so the > up-to-date last_dead_count cannot point to an outdated last_visited. > css_tryget then makes sure that the last_visited is still alive. > > Spotted-by: Ying Han > Original-idea-by: Johannes Weiner > Signed-off-by: Michal Hocko > --- > mm/memcontrol.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 60 insertions(+), 9 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 727ec39..31bb9b0 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { > }; > > struct mem_cgroup_reclaim_iter { > - /* last scanned hierarchy member with elevated css ref count */ > + /* > + * last scanned hierarchy member. Valid only if last_dead_count > + * matches memcg->dead_count of the hierarchy root group. > + */ > struct mem_cgroup *last_visited; > + unsigned int last_dead_count; Since we read and write this without a lock, I would feel more comfortable if this were a full word, i.e. unsigned long. That guarantees we don't see any partial states. > @@ -1156,17 +1162,36 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > int nid = zone_to_nid(reclaim->zone); > int zid = zone_idx(reclaim->zone); > struct mem_cgroup_per_zone *mz; > + unsigned int dead_count; > > mz = mem_cgroup_zoneinfo(root, nid, zid); > iter = &mz->reclaim_iter[reclaim->priority]; > - last_visited = iter->last_visited; > if (prev && reclaim->generation != iter->generation) { > - if (last_visited) { > - css_put(&last_visited->css); > - iter->last_visited = NULL; > - } > + iter->last_visited = NULL; > goto out_unlock; > } > + > + /* > + * If the dead_count mismatches, a destruction > + * has happened or is happening concurrently. > + * If the dead_count matches, a destruction > + * might still happen concurrently, but since > + * we checked under RCU, that destruction > + * won't free the object until we release the > + * RCU reader lock. Thus, the dead_count > + * check verifies the pointer is still valid, > + * css_tryget() verifies the cgroup pointed to > + * is alive. > + */ > + dead_count = atomic_read(&root->dead_count); > + smp_rmb(); > + last_visited = iter->last_visited; > + if (last_visited) { > + if ((dead_count != iter->last_dead_count) || > + !css_tryget(&last_visited->css)) { > + last_visited = NULL; > + } > + } > } > > /* > @@ -1206,10 +1231,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (css && !memcg) > curr = mem_cgroup_from_css(css); > > - /* make sure that the cached memcg is not removed */ > - if (curr) > - css_get(&curr->css); > iter->last_visited = curr; > + smp_wmb(); > + iter->last_dead_count = atomic_read(&root->dead_count); iter->last_dead_count = dead_count This way, we detect if curr is offlined between the first reading and the second reading. Otherwise, it could get freed when the reference is dropped and then last_visited points to invalid memory while the dead_count is uptodate. > @@ -6366,10 +6390,37 @@ free_out: > return ERR_PTR(error); > } > > +/* > + * Announce all parents that a group from their hierarchy is gone. > + */ > +static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *parent = memcg; > + > + /* > + * Make sure we are not racing with mem_cgroup_iter when it stores > + * a new iter->last_visited. Wait until that RCU finishes so that > + * it cannot see already incremented dead_count with memcg which > + * would be already dead next time but dead_count wouldn't tell > + * us about that. > + */ > + synchronize_rcu(); Ah, you are stabilizing the counter between the two reads. It's cheaper to just do one read instead. Saves the atomic op and saves the synchronization point :-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933611Ab3BLR4w (ORCPT ); Tue, 12 Feb 2013 12:56:52 -0500 Received: from e9.ny.us.ibm.com ([32.97.182.139]:38216 "EHLO e9.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932342Ab3BLR4v (ORCPT ); Tue, 12 Feb 2013 12:56:51 -0500 Date: Tue, 12 Feb 2013 08:10:51 -0800 From: "Paul E. McKenney" To: Michal Hocko Cc: Johannes Weiner , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212161051.GQ2666@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20130208193318.GA15951@cmpxchg.org> <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212154330.GG4863@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13021217-7182-0000-0000-000005279644 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote: > On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > > position cache is valid, because it has been updated first. > > > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > > either OK which means that css would be alive at least this rcu period > > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > > that we have started css_offline already and then css is dead already. > > > > > So css_tryget can be dropped. > > > > > > > > Not quite :) > > > > > > > > The dead_count check is for completed destructions, > > > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > > called before the cgroup core releases its last reference and unlinks > > > the group from the siblinks. css_tryget would already fail at this stage > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > > RCU walk. So I think we are safe even without css_get. > > > > But you drop the RCU lock before you return. > > > > dead_count IS incremented for every destruction, but it's not reliable > > for concurrent ones, is what I meant. Again, if there is a dead_count > > mismatch, your pointer might be dangling, easy case. However, even if > > there is no mismatch, you could still race with a destruction that has > > marked the object dead, and then frees it once you drop the RCU lock, > > so you need try_get() to check if the object is dead, or you could > > return a pointer to freed or soon to be freed memory. > > Wait a moment. But what prevents from the following race? > > rcu_read_lock() > mem_cgroup_css_offline(memcg) > root->dead_count++ > iter->last_dead_count = root->dead_count > iter->last_visited = memcg > // final > css_put(memcg); > // last_visited is still valid > rcu_read_unlock() > [...] > // next iteration > rcu_read_lock() > iter->last_dead_count == root->dead_count > // KABOOM > > The race window between dead_count++ and css_put is quite big but that > is not important because that css_put can happen anytime before we start > the next iteration and take rcu_read_lock. The usual approach is to make sure that there is a grace period (either synchronize_rcu() or call_rcu()) between the time that the data is made inaccessible to readers (this would be mem_cgroup_css_offline()?) and the time it is freed (css_put(), correct?). Thanx, Paul From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933626Ab3BLR5A (ORCPT ); Tue, 12 Feb 2013 12:57:00 -0500 Received: from cantor2.suse.de ([195.135.220.15]:35379 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933614Ab3BLR45 (ORCPT ); Tue, 12 Feb 2013 12:56:57 -0500 Date: Tue, 12 Feb 2013 18:56:51 +0100 From: Michal Hocko To: "Paul E. McKenney" Cc: Johannes Weiner , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212175651.GC17663@dhcp22.suse.cz> References: <20130211151649.GD19922@dhcp22.suse.cz> <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161051.GQ2666@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212161051.GQ2666@linux.vnet.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 12-02-13 08:10:51, Paul E. McKenney wrote: > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote: > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > > > position cache is valid, because it has been updated first. > > > > > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > > > either OK which means that css would be alive at least this rcu period > > > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > > > that we have started css_offline already and then css is dead already. > > > > > > So css_tryget can be dropped. > > > > > > > > > > Not quite :) > > > > > > > > > > The dead_count check is for completed destructions, > > > > > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > > > called before the cgroup core releases its last reference and unlinks > > > > the group from the siblinks. css_tryget would already fail at this stage > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > > > RCU walk. So I think we are safe even without css_get. > > > > > > But you drop the RCU lock before you return. > > > > > > dead_count IS incremented for every destruction, but it's not reliable > > > for concurrent ones, is what I meant. Again, if there is a dead_count > > > mismatch, your pointer might be dangling, easy case. However, even if > > > there is no mismatch, you could still race with a destruction that has > > > marked the object dead, and then frees it once you drop the RCU lock, > > > so you need try_get() to check if the object is dead, or you could > > > return a pointer to freed or soon to be freed memory. > > > > Wait a moment. But what prevents from the following race? > > > > rcu_read_lock() > > mem_cgroup_css_offline(memcg) > > root->dead_count++ > > iter->last_dead_count = root->dead_count > > iter->last_visited = memcg > > // final > > css_put(memcg); > > // last_visited is still valid > > rcu_read_unlock() > > [...] > > // next iteration > > rcu_read_lock() > > iter->last_dead_count == root->dead_count > > // KABOOM > > > > The race window between dead_count++ and css_put is quite big but that > > is not important because that css_put can happen anytime before we start > > the next iteration and take rcu_read_lock. > > The usual approach is to make sure that there is a grace period (either > synchronize_rcu() or call_rcu()) between the time that the data is > made inaccessible to readers (this would be mem_cgroup_css_offline()?) > and the time it is freed (css_put(), correct?). Yes, that was my suggestion and I put it before dead_count is incremented down the mem_cgroup_css_offline road. Johannes still thinks we can do without it if we reduce the number of atomic_read(dead_count) which sounds like a way to go but I will rather think about it with a fresh head tomorrow. Anyway, thanks for jumping in. Earth is always a bit shaky when all the barriers and rcu mix together. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933538Ab3BLTl1 (ORCPT ); Tue, 12 Feb 2013 14:41:27 -0500 Received: from e9.ny.us.ibm.com ([32.97.182.139]:49272 "EHLO e9.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933404Ab3BLTl0 (ORCPT ); Tue, 12 Feb 2013 14:41:26 -0500 Date: Tue, 12 Feb 2013 10:31:48 -0800 From: "Paul E. McKenney" To: Johannes Weiner Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212183148.GW2666@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20130211175619.GC13218@cmpxchg.org> <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161051.GQ2666@linux.vnet.ibm.com> <20130212172526.GC25235@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212172526.GC25235@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13021219-7182-0000-0000-000005280904 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 12, 2013 at 12:25:26PM -0500, Johannes Weiner wrote: > On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote: > > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote: > > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > > > > position cache is valid, because it has been updated first. > > > > > > > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > > > > either OK which means that css would be alive at least this rcu period > > > > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > > > > that we have started css_offline already and then css is dead already. > > > > > > > So css_tryget can be dropped. > > > > > > > > > > > > Not quite :) > > > > > > > > > > > > The dead_count check is for completed destructions, > > > > > > > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > > > > called before the cgroup core releases its last reference and unlinks > > > > > the group from the siblinks. css_tryget would already fail at this stage > > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > > > > RCU walk. So I think we are safe even without css_get. > > > > > > > > But you drop the RCU lock before you return. > > > > > > > > dead_count IS incremented for every destruction, but it's not reliable > > > > for concurrent ones, is what I meant. Again, if there is a dead_count > > > > mismatch, your pointer might be dangling, easy case. However, even if > > > > there is no mismatch, you could still race with a destruction that has > > > > marked the object dead, and then frees it once you drop the RCU lock, > > > > so you need try_get() to check if the object is dead, or you could > > > > return a pointer to freed or soon to be freed memory. > > > > > > Wait a moment. But what prevents from the following race? > > > > > > rcu_read_lock() > > > mem_cgroup_css_offline(memcg) > > > root->dead_count++ > > > iter->last_dead_count = root->dead_count > > > iter->last_visited = memcg > > > // final > > > css_put(memcg); > > > // last_visited is still valid > > > rcu_read_unlock() > > > [...] > > > // next iteration > > > rcu_read_lock() > > > iter->last_dead_count == root->dead_count > > > // KABOOM > > > > > > The race window between dead_count++ and css_put is quite big but that > > > is not important because that css_put can happen anytime before we start > > > the next iteration and take rcu_read_lock. > > > > The usual approach is to make sure that there is a grace period (either > > synchronize_rcu() or call_rcu()) between the time that the data is > > made inaccessible to readers (this would be mem_cgroup_css_offline()?) > > and the time it is freed (css_put(), correct?). > > Absolutely! And there is a synchronize_rcu() in between those two > operations. > > However, we want to keep a weak reference to the cgroup after we drop > the rcu read-side lock, so rcu alone is not enough for us to guarantee > object life time. We still have to carefully detect any concurrent > offlinings in order to validate the weak reference next time around. That would make things more interesting. ;-) Exactly who or what holds the weak reference? And the idea is that if you attempt to use the weak reference beforehand, the css_put() does not actually free it, but if you attempt to use it afterwards, you get some sort of failure indication? Thanx, Paul From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933654Ab3BLTyZ (ORCPT ); Tue, 12 Feb 2013 14:54:25 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:39704 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933153Ab3BLTyU (ORCPT ); Tue, 12 Feb 2013 14:54:20 -0500 Date: Tue, 12 Feb 2013 14:53:58 -0500 From: Johannes Weiner To: "Paul E. McKenney" Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130212195358.GE25235@cmpxchg.org> References: <20130211192929.GB29000@dhcp22.suse.cz> <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161051.GQ2666@linux.vnet.ibm.com> <20130212172526.GC25235@cmpxchg.org> <20130212183148.GW2666@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212183148.GW2666@linux.vnet.ibm.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 12, 2013 at 10:31:48AM -0800, Paul E. McKenney wrote: > On Tue, Feb 12, 2013 at 12:25:26PM -0500, Johannes Weiner wrote: > > On Tue, Feb 12, 2013 at 08:10:51AM -0800, Paul E. McKenney wrote: > > > On Tue, Feb 12, 2013 at 04:43:30PM +0100, Michal Hocko wrote: > > > > On Tue 12-02-13 10:10:02, Johannes Weiner wrote: > > > > > On Tue, Feb 12, 2013 at 10:54:19AM +0100, Michal Hocko wrote: > > > > > > On Mon 11-02-13 17:39:43, Johannes Weiner wrote: > > > > > > > On Mon, Feb 11, 2013 at 10:27:56PM +0100, Michal Hocko wrote: > > > > > > > > On Mon 11-02-13 14:58:24, Johannes Weiner wrote: > > > > > > > > > That way, if the dead count gives the go-ahead, you KNOW that the > > > > > > > > > position cache is valid, because it has been updated first. > > > > > > > > > > > > > > > > OK, you are right. We can live without css_tryget because dead_count is > > > > > > > > either OK which means that css would be alive at least this rcu period > > > > > > > > (and RCU walk would be safe as well) or it is incremented which means > > > > > > > > that we have started css_offline already and then css is dead already. > > > > > > > > So css_tryget can be dropped. > > > > > > > > > > > > > > Not quite :) > > > > > > > > > > > > > > The dead_count check is for completed destructions, > > > > > > > > > > > > Not quite :P. dead_count is incremented in css_offline callback which is > > > > > > called before the cgroup core releases its last reference and unlinks > > > > > > the group from the siblinks. css_tryget would already fail at this stage > > > > > > because CSS_DEACT_BIAS is in place at that time but this doesn't break > > > > > > RCU walk. So I think we are safe even without css_get. > > > > > > > > > > But you drop the RCU lock before you return. > > > > > > > > > > dead_count IS incremented for every destruction, but it's not reliable > > > > > for concurrent ones, is what I meant. Again, if there is a dead_count > > > > > mismatch, your pointer might be dangling, easy case. However, even if > > > > > there is no mismatch, you could still race with a destruction that has > > > > > marked the object dead, and then frees it once you drop the RCU lock, > > > > > so you need try_get() to check if the object is dead, or you could > > > > > return a pointer to freed or soon to be freed memory. > > > > > > > > Wait a moment. But what prevents from the following race? > > > > > > > > rcu_read_lock() > > > > mem_cgroup_css_offline(memcg) > > > > root->dead_count++ > > > > iter->last_dead_count = root->dead_count > > > > iter->last_visited = memcg > > > > // final > > > > css_put(memcg); > > > > // last_visited is still valid > > > > rcu_read_unlock() > > > > [...] > > > > // next iteration > > > > rcu_read_lock() > > > > iter->last_dead_count == root->dead_count > > > > // KABOOM > > > > > > > > The race window between dead_count++ and css_put is quite big but that > > > > is not important because that css_put can happen anytime before we start > > > > the next iteration and take rcu_read_lock. > > > > > > The usual approach is to make sure that there is a grace period (either > > > synchronize_rcu() or call_rcu()) between the time that the data is > > > made inaccessible to readers (this would be mem_cgroup_css_offline()?) > > > and the time it is freed (css_put(), correct?). > > > > Absolutely! And there is a synchronize_rcu() in between those two > > operations. > > > > However, we want to keep a weak reference to the cgroup after we drop > > the rcu read-side lock, so rcu alone is not enough for us to guarantee > > object life time. We still have to carefully detect any concurrent > > offlinings in order to validate the weak reference next time around. > > That would make things more interesting. ;-) > > Exactly who or what holds the weak reference? And the idea is that if > you attempt to use the weak reference beforehand, the css_put() does not > actually free it, but if you attempt to use it afterwards, you get some > sort of failure indication? Yes, exactly. We are using a seqlock-style cookie comparison to see if any objects in the pool of objects that we may point to was destroyed. We are having trouble to agree on how to safely read the counter :-) Long version: It's an iterator over a hierarchy of cgroups, but page reclaim may stop iteration at will and might not come back for an indefinite amount of time (until memory pressure triggers reclaim again). So we want to allow cgroups to be destroyed while one of the iterators may still pointing at it (we have iterators per-node, per-zone, per reclaim priority level, that's why it's not feasible to invalidate them pro-actively upon cgroup destruction). The idea is that we have a counter that counts cgroup destructions in each cgroup hierarchy and we remember a snapshot of that counter at the time we remember the iterator position. If any group in that group's hierarchy gets killed before we come back to the iterator, the counter mismatches. Easy. If any group is getting killed concurrently, the counter might match our cookie, but the object could be marked dead already, while rcu prevents it from being freed. The remaining worry is/was that we have two reads of the destruction counter: one when validating the weak reference, another one when updating the iterator. If a destruction starts in between those two, and modifies the counter, we would miss that destruction and the object that is now weakly referenced could get freed while the corresponding snapshot matches the latest value of the destruction counter. Michal's idea was to hold off the destruction counter inc between those reads with synchronize_rcu(). My idea was to simply read the counter only once and use that same value to both check and update the iterator with. That should catch this type of race condition and save the atomic & the extra synchronize_rcu(). At least I fail to see the downside of reading it only once: iteration: rcu_read_lock() dead_count = atomic_read(&hierarchy->dead_count) smp_rmb() previous = iterator->position if (iterator->dead_count != dead_count) /* A cgroup in our hierarchy was killed, pointer might be dangling */ don't use iterator if (!tryget(&previous)) /* The cgroup is marked dead, don't use it */ don't use iterator next = find_next_and_tryget(hierarchy, &previous) /* what happens if destruction of next starts NOW? */ css_put(previous) iterator->position = next smp_wmb() iterator->dead_count = dead_count /* my suggestion, instead of a second atomic_read() */ rcu_read_unlock() return next /* caller drops ref eventually, iterator->cgroup becomes weak */ destruction: bias(cgroup->refcount) /* disables future tryget */ //synchronize_rcu() /* Michal's suggestion */ atomic_inc(&cgroup->hierarchy->dead_count) synchronize_rcu() free(cgroup) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758778Ab3BMILq (ORCPT ); Wed, 13 Feb 2013 03:11:46 -0500 Received: from mx0.parallels.com ([199.115.104.20]:55385 "EHLO mx0.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751981Ab3BMILp (ORCPT ); Wed, 13 Feb 2013 03:11:45 -0500 Message-ID: <511B4ACF.90209@parallels.com> Date: Wed, 13 Feb 2013 12:11:59 +0400 From: Glauber Costa User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: Johannes Weiner CC: Michal Hocko , , , KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators References: <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> <20130212171216.GA17663@dhcp22.suse.cz> <20130212173741.GD25235@cmpxchg.org> In-Reply-To: <20130212173741.GD25235@cmpxchg.org> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/12/2013 09:37 PM, Johannes Weiner wrote: >> > All reads from root->dead_count are atomic already, so I am not sure >> > what you mean here. Anyway, I hope I won't make this even more confusing >> > if I post what I have right now: > Yes, but we are doing two reads. Can't the memcg that we'll store in > last_visited be offlined during this and be freed after we drop the > rcu read lock? If we had just one read, we would detect this > properly. > I don't want to add any more confusion to an already fun discussion, but IIUC, you are trying to avoid triggering a second round of reclaim in an already dead memcg, right? Can't you generalize the mechanism I use for kmemcg, where a very similar problem exists ? This is how it looks like: /* this atomically sets a bit in the memcg. It does so * unconditionally, and it is (so far) okay if it is set * twice */ memcg_kmem_mark_dead(memcg); /* * Then if kmem charges is not zero, we don't actually destroy the * memcg. The function where it lives will always be called when usage * reaches 0, so we guarantee that we will never miss the chance to * call the destruction function at least once. * * I suspect you could use a mechanism like this, or extend * this very same, to prevent the second reclaim to be even called */ if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0) return; /* * this is how we guarantee that the destruction fuction is called at * most once. The second caller would see the bit unset. */ if (memcg_kmem_test_and_clear_dead(memcg)) mem_cgroup_put(memcg); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932757Ab3BMJvU (ORCPT ); Wed, 13 Feb 2013 04:51:20 -0500 Received: from cantor2.suse.de ([195.135.220.15]:34131 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932451Ab3BMJvQ (ORCPT ); Wed, 13 Feb 2013 04:51:16 -0500 Date: Wed, 13 Feb 2013 10:51:13 +0100 From: Michal Hocko To: Johannes Weiner Cc: "Paul E. McKenney" , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130213095113.GA23562@dhcp22.suse.cz> References: <20130211195824.GB15951@cmpxchg.org> <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161051.GQ2666@linux.vnet.ibm.com> <20130212172526.GC25235@cmpxchg.org> <20130212183148.GW2666@linux.vnet.ibm.com> <20130212195358.GE25235@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212195358.GE25235@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 12-02-13 14:53:58, Johannes Weiner wrote: [...] > iteration: > rcu_read_lock() > dead_count = atomic_read(&hierarchy->dead_count) > smp_rmb() > previous = iterator->position > if (iterator->dead_count != dead_count) > /* A cgroup in our hierarchy was killed, pointer might be dangling */ > don't use iterator > if (!tryget(&previous)) > /* The cgroup is marked dead, don't use it */ > don't use iterator > next = find_next_and_tryget(hierarchy, &previous) > /* what happens if destruction of next starts NOW? */ OK, I thought that this depends on the ordering of CSS_DEACT_BIAS and dead_count writes - because there is no memory ordering enforced between those two. But it shouldn't matter because we are checking both. If the increment is seen sooner then we do not care about css_tryget and if css is deactivated before dead_count++ then the css_tryget would shout. More interesting ordering, however, is dead_count++ vs. css_put from cgroup core. Say we have the following: CPU0 CPU1 CPU2 iter->position = A; iter->dead_count = dead_count; rcu_read_unlock() return A mem_cgroup_iter_break css_put(A) bias(A) css_offline() css_put(A) // in cgroup_destroy_locked // last ref and A will be freed rcu_read_lock() read parent->dead_count parent->dead_count++ // got reordered from css_offline css_tryget(A) // kaboom The reordering window is really huge and I think it is impossible to trigger in real life. And mem_cgroup_reparent_charges calls mem_cgroup_start_move unconditionally which in turn calls synchronize_rcu() which is a full barrier AFAIU so dead_count++ cannot be reordered ATM. But should we rely on that? Shouldn't we add smp_wmb after dead_count++ as I had in an earlier version of the patch? > css_put(previous) > iterator->position = next > smp_wmb() > iterator->dead_count = dead_count /* my suggestion, instead of a second atomic_read() */ > rcu_read_unlock() > return next /* caller drops ref eventually, iterator->cgroup becomes weak */ > > destruction: > bias(cgroup->refcount) /* disables future tryget */ > //synchronize_rcu() /* Michal's suggestion */ > atomic_inc(&cgroup->hierarchy->dead_count) > synchronize_rcu() > free(cgroup) Other than that this should work. I will update the patch accordingly. Thanks! -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759027Ab3BMKfE (ORCPT ); Wed, 13 Feb 2013 05:35:04 -0500 Received: from cantor2.suse.de ([195.135.220.15]:36127 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752833Ab3BMKfC (ORCPT ); Wed, 13 Feb 2013 05:35:02 -0500 Date: Wed, 13 Feb 2013 11:34:59 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130213103459.GB23562@dhcp22.suse.cz> References: <20130211212756.GC29000@dhcp22.suse.cz> <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> <20130212171216.GA17663@dhcp22.suse.cz> <20130212173741.GD25235@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130212173741.GD25235@cmpxchg.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 12-02-13 12:37:41, Johannes Weiner wrote: > On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote: [...] > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 727ec39..31bb9b0 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { > > }; > > > > struct mem_cgroup_reclaim_iter { > > - /* last scanned hierarchy member with elevated css ref count */ > > + /* > > + * last scanned hierarchy member. Valid only if last_dead_count > > + * matches memcg->dead_count of the hierarchy root group. > > + */ > > struct mem_cgroup *last_visited; > > + unsigned int last_dead_count; > > Since we read and write this without a lock, I would feel more > comfortable if this were a full word, i.e. unsigned long. That > guarantees we don't see any partial states. OK. Changed. Although I though that int is read/modified atomically as well if it is aligned to its size. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759054Ab3BMKiO (ORCPT ); Wed, 13 Feb 2013 05:38:14 -0500 Received: from cantor2.suse.de ([195.135.220.15]:36219 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753365Ab3BMKiN (ORCPT ); Wed, 13 Feb 2013 05:38:13 -0500 Date: Wed, 13 Feb 2013 11:38:11 +0100 From: Michal Hocko To: Glauber Costa Cc: Johannes Weiner , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130213103811.GC23562@dhcp22.suse.cz> References: <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> <20130212171216.GA17663@dhcp22.suse.cz> <20130212173741.GD25235@cmpxchg.org> <511B4ACF.90209@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <511B4ACF.90209@parallels.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 13-02-13 12:11:59, Glauber Costa wrote: > On 02/12/2013 09:37 PM, Johannes Weiner wrote: > >> > All reads from root->dead_count are atomic already, so I am not sure > >> > what you mean here. Anyway, I hope I won't make this even more confusing > >> > if I post what I have right now: > > Yes, but we are doing two reads. Can't the memcg that we'll store in > > last_visited be offlined during this and be freed after we drop the > > rcu read lock? If we had just one read, we would detect this > > properly. > > > > I don't want to add any more confusion to an already fun discussion, but > IIUC, you are trying to avoid triggering a second round of reclaim in an > already dead memcg, right? No this is not about the second round of the reclaim but rather iteration racing with removal. And we want to do it as lightweight as possible. We cannot work with memcg directly because it might have disappeared in the mean time and we do not want to hold a reference on it because there would be no guarantee somebody will release it later on. So mark_dead && test_and_clear_dead would not work in this context. [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934001Ab3BMM4Y (ORCPT ); Wed, 13 Feb 2013 07:56:24 -0500 Received: from cantor2.suse.de ([195.135.220.15]:41539 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933921Ab3BMM4W (ORCPT ); Wed, 13 Feb 2013 07:56:22 -0500 Date: Wed, 13 Feb 2013 13:56:17 +0100 From: Michal Hocko To: Johannes Weiner Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Ying Han , Tejun Heo , Glauber Costa , Li Zefan Subject: Re: [PATCH v3 4/7] memcg: remove memcg from the reclaim iterators Message-ID: <20130213125617.GD23562@dhcp22.suse.cz> References: <20130211223943.GC15951@cmpxchg.org> <20130212095419.GB4863@dhcp22.suse.cz> <20130212151002.GD15951@cmpxchg.org> <20130212154330.GG4863@dhcp22.suse.cz> <20130212161332.GI4863@dhcp22.suse.cz> <20130212162442.GJ4863@dhcp22.suse.cz> <63d3b5fa-dbc6-4bc9-8867-f9961e644305@email.android.com> <20130212171216.GA17663@dhcp22.suse.cz> <20130212173741.GD25235@cmpxchg.org> <20130213103459.GB23562@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130213103459.GB23562@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 13-02-13 11:34:59, Michal Hocko wrote: > On Tue 12-02-13 12:37:41, Johannes Weiner wrote: > > On Tue, Feb 12, 2013 at 06:12:16PM +0100, Michal Hocko wrote: > [...] > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 727ec39..31bb9b0 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -144,8 +144,13 @@ struct mem_cgroup_stat_cpu { > > > }; > > > > > > struct mem_cgroup_reclaim_iter { > > > - /* last scanned hierarchy member with elevated css ref count */ > > > + /* > > > + * last scanned hierarchy member. Valid only if last_dead_count > > > + * matches memcg->dead_count of the hierarchy root group. > > > + */ > > > struct mem_cgroup *last_visited; > > > + unsigned int last_dead_count; > > > > Since we read and write this without a lock, I would feel more > > comfortable if this were a full word, i.e. unsigned long. That > > guarantees we don't see any partial states. > > OK. Changed. Although I though that int is read/modified atomically as > well if it is aligned to its size. Ohh, I guess what was your concern. If last_dead_count was int then it would fit into the same full word slot with generation and so the parallel read-modify-update cycle could be an issue. -- Michal Hocko SUSE Labs