From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id F07B26B0092 for ; Tue, 13 Nov 2012 10:31:18 -0500 (EST) From: Michal Hocko Subject: [RFC] rework mem_cgroup iterator Date: Tue, 13 Nov 2012 16:30:34 +0100 Message-Id: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Hi all, this patch set tries to make mem_cgroup_iter saner in the way how it walks hierarchies. css->id based traversal is far from being ideal as it is not deterministic because it depends on the creation ordering. Diffstat looks promising but it is fair the say that the biggest cleanup is just css_get_next removal. The memcg code has grown a bit but I think it is worth the resulting outcome (the sanity ;)). The first patch fixes a potential misbehaving which I haven't seen but the fix is needed for the later patches anyway. We could take it alone as well but I do not have any bug report to base the fix on. The second patch replaces css_get_next by cgroup iterators which are scheduled for 3.8 in Tejun's tree and I depend on the following two patches: fe1e904c cgroup: implement generic child / descendant walk macros 7e187c6c cgroup: use rculist ops for cgroup->children The third patch is an attempt for simplification of the mem_cgroup_iter. It basically removes all css usages to make the code easier. The next patch removes the big while(!memcg) loop around the iterating logic. It could have been folded into #3 but I rather have the rework separate from the code moving noise. The last patch just removes css_get_next as there is no user for it any longer. I am also thinking that leaf-to-root iteration makes more sense but this patch is not included in the series yet because I have to think some more about the justification. So far I didn't get to testing but I am posting this early if everybody is OK with this change. Any thoughts? Cumulative diffstat: include/linux/cgroup.h | 7 --- kernel/cgroup.c | 49 --------------------- mm/memcontrol.c | 110 +++++++++++++++++++++++++++++++++--------------- 3 files changed, 75 insertions(+), 91 deletions(-) Michal Hocko (5): memcg: synchronize per-zone iterator access by a spinlock memcg: rework mem_cgroup_iter to use cgroup iterators memcg: simplify mem_cgroup_iter memcg: clean up mem_cgroup_iter cgroup: remove css_get_next -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx108.postini.com [74.125.245.108]) by kanga.kvack.org (Postfix) with SMTP id 5D10F6B0095 for ; Tue, 13 Nov 2012 10:31:19 -0500 (EST) From: Michal Hocko Subject: [RFC 1/5] memcg: synchronize per-zone iterator access by a spinlock Date: Tue, 13 Nov 2012 16:30:35 +0100 Message-Id: <1352820639-13521-2-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa per-zone per-priority iterator is aimed at coordinating concurrent reclaimers on the same hierarchy (or the global reclaim when all groups are reclaimed) so that all groups get reclaimed evenly as much as possible. iter->position holds the last css->id visited and iter->generation signals the completed tree walk (when it is incremented). Concurrent reclaimers are supposed to provide a reclaim cookie which holds the reclaim priority and the last generation they saw. If cookie's generation doesn't match the iterator's view then other concurrent reclaimer already did the job and the tree walk is done for that priority. This scheme works nicely in most cases but it is not raceless. Two racing reclaimers can see the same iter->position and so bang on the same group. iter->generation increment is not serialized as well so a reclaimer can see an updated iter->position with and old generation so the iteration might be restarted from the root of the hierarchy. The simplest way to fix this issue is to synchronise access to the iterator by a lock. This implementation uses per-zone per-priority spinlock which linearizes only directly racing reclaimers which use reclaim cookies so the effect of the new locking should be really minimal. I have to note that I haven't seen this as a real issue so far. The primary motivation for the change is different. The following patch will change the way how the iterator is implemented and css->id iteration will be replaced cgroup generic iteration which requires storing mem_cgroup pointer into iterator and that requires reference counting and so concurrent access will be a problem. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6136fec..0fe5177 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -146,6 +146,8 @@ struct mem_cgroup_reclaim_iter { int position; /* scan generation, increased every round-trip */ unsigned int generation; + /* lock to protect the position and generation */ + spinlock_t iter_lock; }; /* @@ -1093,8 +1095,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; - if (prev && reclaim->generation != iter->generation) + spin_lock(&iter->iter_lock); + if (prev && reclaim->generation != iter->generation) { + spin_unlock(&iter->iter_lock); return NULL; + } id = iter->position; } @@ -1113,6 +1118,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; + spin_unlock(&iter->iter_lock); } if (prev && !css) @@ -5871,8 +5877,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) return 1; for (zone = 0; zone < MAX_NR_ZONES; zone++) { + int prio; + mz = &pn->zoneinfo[zone]; lruvec_init(&mz->lruvec, &NODE_DATA(node)->node_zones[zone]); + for (prio = 0; prio < DEF_PRIORITY + 1; prio++) + spin_lock_init(&mz->reclaim_iter[prio].iter_lock); mz->usage_in_excess = 0; mz->on_tree = false; mz->memcg = memcg; -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx121.postini.com [74.125.245.121]) by kanga.kvack.org (Postfix) with SMTP id 83F1E6B0095 for ; Tue, 13 Nov 2012 10:31:21 -0500 (EST) From: Michal Hocko Subject: [RFC 3/5] memcg: simplify mem_cgroup_iter Date: Tue, 13 Nov 2012 16:30:37 +0100 Message-Id: <1352820639-13521-4-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Current implementation of mem_cgroup_iter has to consider both css and memcg to find out whether no group has been found (css==NULL - aka the loop is completed) and that no memcg is associated with the found node (!memcg - aka css_tryget failed because the group is no longer alive). This leads to awkward tweaks like tests for css && !memcg to skip the current node. It will be much easier if we got rid off css variable altogether and only rely on memcg. In order to do that the iteration part has to skip dead nodes. This sounds natural to me and as a nice side effect we will get a simple invariant that memcg is always alive when non-NULL and all nodes have been visited otherwise. We could get rid of the surrounding while loop but keep it for now for an easier review. It will go away in the next patch. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 52 ++++++++++++++++++++++++---------------------------- 1 file changed, 24 insertions(+), 28 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5da1e58..dd84094 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1086,7 +1086,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, while (!memcg) { struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - struct cgroup_subsys_state *css = NULL; if (reclaim) { int nid = zone_to_nid(reclaim->zone); @@ -1113,49 +1112,46 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, * treatment. */ if (!last_visited) { - css = &root->css; + memcg = root; } else { - struct cgroup *next_cgroup; - + struct cgroup *next_cgroup, + *pos = last_visited->css.cgroup; +skip_node: next_cgroup = cgroup_next_descendant_pre( - last_visited->css.cgroup, + pos, root->css.cgroup); - if (next_cgroup) - css = cgroup_subsys_state(next_cgroup, - mem_cgroup_subsys_id); + /* + * Even if we find a group we have to make sure it is + * alive. If not we, should skip the node. + */ + if (next_cgroup) { + struct mem_cgroup *mem = mem_cgroup_from_cont( + next_cgroup); + if (css_tryget(&mem->css)) + memcg = mem; + else { + pos = next_cgroup; + goto skip_node; + } + } } - /* - * Even if we find a group we have to make sure it is alive. - * css && !memcg means that the groups should be skipped and - * we should continue the tree walk. - */ - if (css == &root->css || (css && css_tryget(css))) - memcg = mem_cgroup_from_css(css); - if (reclaim) { - struct mem_cgroup *curr = memcg; - if (last_visited) mem_cgroup_put(last_visited); + if (memcg) + mem_cgroup_get(memcg); + iter->last_visited = memcg; - if (css && !memcg) - curr = mem_cgroup_from_css(css); - if (curr) - mem_cgroup_get(curr); - iter->last_visited = curr; - - if (!css) + if (!memcg) iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; spin_unlock(&iter->iter_lock); - } else if (css && !memcg) { - last_visited = mem_cgroup_from_css(css); } rcu_read_unlock(); - if (prev && !css) + if (prev && !memcg) return NULL; } return memcg; -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx147.postini.com [74.125.245.147]) by kanga.kvack.org (Postfix) with SMTP id 782F46B0098 for ; Tue, 13 Nov 2012 10:31:20 -0500 (EST) From: Michal Hocko Subject: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Date: Tue, 13 Nov 2012 16:30:36 +0100 Message-Id: <1352820639-13521-3-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa mem_cgroup_iter curently relies on css->id when walking down a group hierarchy tree. This is really awkward because the tree walk depends on the groups creation ordering. The only guarantee is that a parent node is visited before its children. Example 1) mkdir -p a a/d a/b/c 2) mkdir -a a/b/c a/d Will create the same trees but the tree walks will be different: 1) a, d, b, c 2) a, b, c, d 574bd9f7 (cgroup: implement generic child / descendant walk macros) has introduced generic cgroup tree walkers which provide either pre-order or post-order tree walk. This patch converts css->id based iteration to pre-order tree walk to keep the semantic with the original iterator where parent is always visited before its subtree. cgroup_for_each_descendant_pre suggests using post_create and pre_destroy for proper synchronization with groups addidition resp. removal. This implementation doesn't use those because a new memory cgroup is fully initialized in mem_cgroup_create and css_tryget makes sure that the group is alive when we encounter it by iterator. If the reclaim cookie is used we need to store the last visited group into the iterator so we have to be careful that it doesn't disappear in the mean time. Elevated reference count on the memcg guarantees that the group will not vanish even though it has been already removed from the tree. In such a case css_tryget will fail and the iteration is retried (groups are linked with RCU safe lists so the forward progress is still possible). iter_lock will make sure that only one reclaimer will see the last_visited group and the reference count game around it. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 49 insertions(+), 15 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0fe5177..5da1e58 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -142,8 +142,8 @@ struct mem_cgroup_stat_cpu { }; struct mem_cgroup_reclaim_iter { - /* css_id of the last scanned hierarchy member */ - int position; + /* last scanned hierarchy member with elevated ref count */ + struct mem_cgroup *last_visited; /* scan generation, increased every round-trip */ unsigned int generation; /* lock to protect the position and generation */ @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, struct mem_cgroup_reclaim_cookie *reclaim) { - struct mem_cgroup *memcg = NULL; - int id = 0; + struct mem_cgroup *memcg = NULL, + *last_visited = NULL; if (mem_cgroup_disabled()) return NULL; @@ -1073,7 +1073,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, root = root_mem_cgroup; if (prev && !reclaim) - id = css_id(&prev->css); + last_visited = prev; if (prev && prev != root) css_put(&prev->css); @@ -1086,7 +1086,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, while (!memcg) { struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - struct cgroup_subsys_state *css; + struct cgroup_subsys_state *css = NULL; if (reclaim) { int nid = zone_to_nid(reclaim->zone); @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; spin_lock(&iter->iter_lock); + last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { + if (last_visited) { + mem_cgroup_put(last_visited); + iter->last_visited = NULL; + } spin_unlock(&iter->iter_lock); return NULL; } - id = iter->position; } rcu_read_lock(); - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); - if (css) { - if (css == &root->css || css_tryget(css)) - memcg = mem_cgroup_from_css(css); - } else - id = 0; - rcu_read_unlock(); + /* + * Root is not visited by cgroup iterators so it needs a special + * treatment. + */ + if (!last_visited) { + css = &root->css; + } else { + struct cgroup *next_cgroup; + + next_cgroup = cgroup_next_descendant_pre( + last_visited->css.cgroup, + root->css.cgroup); + if (next_cgroup) + css = cgroup_subsys_state(next_cgroup, + mem_cgroup_subsys_id); + } + + /* + * Even if we find a group we have to make sure it is alive. + * css && !memcg means that the groups should be skipped and + * we should continue the tree walk. + */ + if (css == &root->css || (css && css_tryget(css))) + memcg = mem_cgroup_from_css(css); if (reclaim) { - iter->position = id; + struct mem_cgroup *curr = memcg; + + if (last_visited) + mem_cgroup_put(last_visited); + + if (css && !memcg) + curr = mem_cgroup_from_css(css); + if (curr) + mem_cgroup_get(curr); + iter->last_visited = curr; + if (!css) iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; spin_unlock(&iter->iter_lock); + } else if (css && !memcg) { + last_visited = mem_cgroup_from_css(css); } + rcu_read_unlock(); if (prev && !css) return NULL; -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id A2B696B009A for ; Tue, 13 Nov 2012 10:31:22 -0500 (EST) From: Michal Hocko Subject: [RFC 4/5] memcg: clean up mem_cgroup_iter Date: Tue, 13 Nov 2012 16:30:38 +0100 Message-Id: <1352820639-13521-5-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Get rid of while(!memcg) loop as it is no longer needed because there will always be at least one group that should be visited (root). This patch doesn't add any change to the implementation but it is separate to make a review easier. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 120 +++++++++++++++++++++++++++---------------------------- 1 file changed, 60 insertions(+), 60 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index dd84094..b924f27 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1063,6 +1063,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, struct mem_cgroup_reclaim_cookie *reclaim) { + struct mem_cgroup_reclaim_iter *uninitialized_var(iter); struct mem_cgroup *memcg = NULL, *last_visited = NULL; @@ -1084,76 +1085,75 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, return root; } - while (!memcg) { - struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - - if (reclaim) { - int nid = zone_to_nid(reclaim->zone); - int zid = zone_idx(reclaim->zone); - struct mem_cgroup_per_zone *mz; - - mz = mem_cgroup_zoneinfo(root, nid, zid); - iter = &mz->reclaim_iter[reclaim->priority]; - spin_lock(&iter->iter_lock); - last_visited = iter->last_visited; - if (prev && reclaim->generation != iter->generation) { - if (last_visited) { - mem_cgroup_put(last_visited); - iter->last_visited = NULL; - } - spin_unlock(&iter->iter_lock); - return NULL; + if (reclaim) { + int nid = zone_to_nid(reclaim->zone); + int zid = zone_idx(reclaim->zone); + struct mem_cgroup_per_zone *mz; + + mz = mem_cgroup_zoneinfo(root, nid, zid); + iter = &mz->reclaim_iter[reclaim->priority]; + spin_lock(&iter->iter_lock); + last_visited = iter->last_visited; + if (prev && reclaim->generation != iter->generation) { + if (last_visited) { + mem_cgroup_put(last_visited); + iter->last_visited = NULL; } + spin_unlock(&iter->iter_lock); + return NULL; } + } - rcu_read_lock(); + rcu_read_lock(); + /* + * Root is not visited by cgroup iterators so it needs a special + * treatment. + */ + if (!last_visited) { + memcg = root; + } else { + struct cgroup *next_cgroup, + *pos = last_visited->css.cgroup; +skip_node: + next_cgroup = cgroup_next_descendant_pre( + pos, + root->css.cgroup); /* - * Root is not visited by cgroup iterators so it needs a special - * treatment. + * Even if we find a group we have to make sure it is + * alive. If not we, should skip the node. */ - if (!last_visited) { - memcg = root; - } else { - struct cgroup *next_cgroup, - *pos = last_visited->css.cgroup; -skip_node: - next_cgroup = cgroup_next_descendant_pre( - pos, - root->css.cgroup); - /* - * Even if we find a group we have to make sure it is - * alive. If not we, should skip the node. - */ - if (next_cgroup) { - struct mem_cgroup *mem = mem_cgroup_from_cont( - next_cgroup); - if (css_tryget(&mem->css)) - memcg = mem; - else { - pos = next_cgroup; - goto skip_node; - } + if (next_cgroup) { + struct mem_cgroup *mem = mem_cgroup_from_cont( + next_cgroup); + if (css_tryget(&mem->css)) + memcg = mem; + else { + pos = next_cgroup; + goto skip_node; } } + } - if (reclaim) { - if (last_visited) - mem_cgroup_put(last_visited); - if (memcg) - mem_cgroup_get(memcg); - iter->last_visited = memcg; - - if (!memcg) - iter->generation++; - else if (!prev && memcg) - reclaim->generation = iter->generation; - spin_unlock(&iter->iter_lock); - } - rcu_read_unlock(); + if (reclaim) { + if (last_visited) + mem_cgroup_put(last_visited); + if (memcg) + mem_cgroup_get(memcg); + iter->last_visited = memcg; - if (prev && !memcg) - return NULL; + if (!memcg) + iter->generation++; + else if (!prev && memcg) + reclaim->generation = iter->generation; + spin_unlock(&iter->iter_lock); } + rcu_read_unlock(); + + /* + * At least root has to be visited + */ + VM_BUG_ON(!prev && !memcg); + return memcg; } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx183.postini.com [74.125.245.183]) by kanga.kvack.org (Postfix) with SMTP id 8D3816B0095 for ; Tue, 13 Nov 2012 10:31:23 -0500 (EST) From: Michal Hocko Subject: [RFC 5/5] cgroup: remove css_get_next Date: Tue, 13 Nov 2012 16:30:39 +0100 Message-Id: <1352820639-13521-6-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Now that we have generic and well ordered cgroup tree walkers there is no need to keep css_get_next in the place. Signed-off-by: Michal Hocko --- include/linux/cgroup.h | 7 ------- kernel/cgroup.c | 49 ------------------------------------------------ 2 files changed, 56 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 329eb46..ba46041 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -676,13 +676,6 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css); struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id); -/* - * Get a cgroup whose id is greater than or equal to id under tree of root. - * Returning a cgroup_subsys_state or NULL. - */ -struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id, - struct cgroup_subsys_state *root, int *foundid); - /* Returns true if root is ancestor of cg */ bool css_is_ancestor(struct cgroup_subsys_state *cg, const struct cgroup_subsys_state *root); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index d51958a..4d874b2 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5230,55 +5230,6 @@ struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id) } EXPORT_SYMBOL_GPL(css_lookup); -/** - * css_get_next - lookup next cgroup under specified hierarchy. - * @ss: pointer to subsystem - * @id: current position of iteration. - * @root: pointer to css. search tree under this. - * @foundid: position of found object. - * - * Search next css under the specified hierarchy of rootid. Calling under - * rcu_read_lock() is necessary. Returns NULL if it reaches the end. - */ -struct cgroup_subsys_state * -css_get_next(struct cgroup_subsys *ss, int id, - struct cgroup_subsys_state *root, int *foundid) -{ - struct cgroup_subsys_state *ret = NULL; - struct css_id *tmp; - int tmpid; - int rootid = css_id(root); - int depth = css_depth(root); - - if (!rootid) - return NULL; - - BUG_ON(!ss->use_id); - WARN_ON_ONCE(!rcu_read_lock_held()); - - /* fill start point for scan */ - tmpid = id; - while (1) { - /* - * scan next entry from bitmap(tree), tmpid is updated after - * idr_get_next(). - */ - tmp = idr_get_next(&ss->idr, &tmpid); - if (!tmp) - break; - if (tmp->depth >= depth && tmp->stack[depth] == rootid) { - ret = rcu_dereference(tmp->css); - if (ret) { - *foundid = tmpid; - break; - } - } - /* continue to scan from next id */ - tmpid = tmpid + 1; - } - return ret; -} - /* * get corresponding css from file open on cgroupfs directory */ -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx145.postini.com [74.125.245.145]) by kanga.kvack.org (Postfix) with SMTP id D0A006B0075 for ; Tue, 13 Nov 2012 11:14:47 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id i14so3560394dad.14 for ; Tue, 13 Nov 2012 08:14:47 -0800 (PST) Date: Tue, 13 Nov 2012 08:14:42 -0800 From: Tejun Heo Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121113161442.GA18227@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1352820639-13521-3-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa On Tue, Nov 13, 2012 at 04:30:36PM +0100, Michal Hocko wrote: > @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > struct mem_cgroup *prev, > struct mem_cgroup_reclaim_cookie *reclaim) > { > - struct mem_cgroup *memcg = NULL; > - int id = 0; > + struct mem_cgroup *memcg = NULL, > + *last_visited = NULL; Nitpick but please don't do this. > + /* > + * Root is not visited by cgroup iterators so it needs a special > + * treatment. > + */ > + if (!last_visited) { > + css = &root->css; > + } else { > + struct cgroup *next_cgroup; > + > + next_cgroup = cgroup_next_descendant_pre( > + last_visited->css.cgroup, > + root->css.cgroup); > + if (next_cgroup) > + css = cgroup_subsys_state(next_cgroup, > + mem_cgroup_subsys_id); Hmmm... wouldn't it be better to move the reclaim logic into a function and do the following? reclaim(root); for_each_descendent_pre() reclaim(descendant); If this is a problem, I'd be happy to add a iterator which includes the top node. I'd prefer controllers not using the next functions directly. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx160.postini.com [74.125.245.160]) by kanga.kvack.org (Postfix) with SMTP id A659C6B0070 for ; Tue, 13 Nov 2012 19:03:57 -0500 (EST) Received: from m4.gw.fujitsu.co.jp (unknown [10.0.50.74]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id A74053EE0AE for ; Wed, 14 Nov 2012 09:03:55 +0900 (JST) Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 8AE2C45DE52 for ; Wed, 14 Nov 2012 09:03:55 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 71D6145DE4D for ; Wed, 14 Nov 2012 09:03:55 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 632AE1DB803E for ; Wed, 14 Nov 2012 09:03:55 +0900 (JST) Received: from m1000.s.css.fujitsu.com (m1000.s.css.fujitsu.com [10.240.81.136]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 1A5291DB802F for ; Wed, 14 Nov 2012 09:03:55 +0900 (JST) Message-ID: <50A2DFDC.90402@jp.fujitsu.com> Date: Wed, 14 Nov 2012 09:03:40 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [RFC 1/5] memcg: synchronize per-zone iterator access by a spinlock References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-2-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-2-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa (2012/11/14 0:30), Michal Hocko wrote: > per-zone per-priority iterator is aimed at coordinating concurrent > reclaimers on the same hierarchy (or the global reclaim when all > groups are reclaimed) so that all groups get reclaimed evenly as > much as possible. iter->position holds the last css->id visited > and iter->generation signals the completed tree walk (when it is > incremented). > Concurrent reclaimers are supposed to provide a reclaim cookie which > holds the reclaim priority and the last generation they saw. If cookie's > generation doesn't match the iterator's view then other concurrent > reclaimer already did the job and the tree walk is done for that > priority. > > This scheme works nicely in most cases but it is not raceless. Two > racing reclaimers can see the same iter->position and so bang on the > same group. iter->generation increment is not serialized as well so a > reclaimer can see an updated iter->position with and old generation so > the iteration might be restarted from the root of the hierarchy. > > The simplest way to fix this issue is to synchronise access to the > iterator by a lock. This implementation uses per-zone per-priority > spinlock which linearizes only directly racing reclaimers which use > reclaim cookies so the effect of the new locking should be really > minimal. > > I have to note that I haven't seen this as a real issue so far. The > primary motivation for the change is different. The following patch > will change the way how the iterator is implemented and css->id > iteration will be replaced cgroup generic iteration which requires > storing mem_cgroup pointer into iterator and that requires reference > counting and so concurrent access will be a problem. > > Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id AF9056B0072 for ; Tue, 13 Nov 2012 19:13:46 -0500 (EST) Received: from m3.gw.fujitsu.co.jp (unknown [10.0.50.73]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id 4A8953EE0C1 for ; Wed, 14 Nov 2012 09:13:45 +0900 (JST) Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 282C945DEBC for ; Wed, 14 Nov 2012 09:13:45 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 09B3645DEB5 for ; Wed, 14 Nov 2012 09:13:45 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id C67F3E08003 for ; Wed, 14 Nov 2012 09:13:44 +0900 (JST) Received: from m1001.s.css.fujitsu.com (m1001.s.css.fujitsu.com [10.240.81.139]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 75DED1DB803F for ; Wed, 14 Nov 2012 09:13:44 +0900 (JST) Message-ID: <50A2E229.3050809@jp.fujitsu.com> Date: Wed, 14 Nov 2012 09:13:29 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [RFC] rework mem_cgroup iterator References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa (2012/11/14 0:30), Michal Hocko wrote: > Hi all, > this patch set tries to make mem_cgroup_iter saner in the way how it > walks hierarchies. css->id based traversal is far from being ideal as it > is not deterministic because it depends on the creation ordering. > > Diffstat looks promising but it is fair the say that the biggest cleanup is > just css_get_next removal. The memcg code has grown a bit but I think it is > worth the resulting outcome (the sanity ;)). > > The first patch fixes a potential misbehaving which I haven't seen but the > fix is needed for the later patches anyway. We could take it alone as well > but I do not have any bug report to base the fix on. > > The second patch replaces css_get_next by cgroup iterators which are > scheduled for 3.8 in Tejun's tree and I depend on the following two patches: > fe1e904c cgroup: implement generic child / descendant walk macros > 7e187c6c cgroup: use rculist ops for cgroup->children > > The third patch is an attempt for simplification of the mem_cgroup_iter. It > basically removes all css usages to make the code easier. The next patch > removes the big while(!memcg) loop around the iterating logic. It could have > been folded into #3 but I rather have the rework separate from the code > moving noise. > > The last patch just removes css_get_next as there is no user for it any > longer. > > I am also thinking that leaf-to-root iteration makes more sense but this > patch is not included in the series yet because I have to think some > more about the justification. > > So far I didn't get to testing but I am posting this early if everybody is > OK with this change. > > Any thoughts? > I'm O.K. Maybe I have some points I'm not understanding...I'll make a reply to patches. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx205.postini.com [74.125.245.205]) by kanga.kvack.org (Postfix) with SMTP id 1DF3F6B0072 for ; Tue, 13 Nov 2012 19:20:23 -0500 (EST) Received: from m1.gw.fujitsu.co.jp (unknown [10.0.50.71]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id 2C98C3EE0BC for ; Wed, 14 Nov 2012 09:20:21 +0900 (JST) Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 0E91E45DE5C for ; Wed, 14 Nov 2012 09:20:21 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id EAE6245DE55 for ; Wed, 14 Nov 2012 09:20:20 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id DFA481DB8056 for ; Wed, 14 Nov 2012 09:20:20 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.240.81.134]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 914651DB804B for ; Wed, 14 Nov 2012 09:20:20 +0900 (JST) Message-ID: <50A2E3B3.6080007@jp.fujitsu.com> Date: Wed, 14 Nov 2012 09:20:03 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-3-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa (2012/11/14 0:30), Michal Hocko wrote: > mem_cgroup_iter curently relies on css->id when walking down a group > hierarchy tree. This is really awkward because the tree walk depends on > the groups creation ordering. The only guarantee is that a parent node > is visited before its children. > Example > 1) mkdir -p a a/d a/b/c > 2) mkdir -a a/b/c a/d > Will create the same trees but the tree walks will be different: > 1) a, d, b, c > 2) a, b, c, d > > 574bd9f7 (cgroup: implement generic child / descendant walk macros) has > introduced generic cgroup tree walkers which provide either pre-order > or post-order tree walk. This patch converts css->id based iteration > to pre-order tree walk to keep the semantic with the original iterator > where parent is always visited before its subtree. > > cgroup_for_each_descendant_pre suggests using post_create and > pre_destroy for proper synchronization with groups addidition resp. > removal. This implementation doesn't use those because a new memory > cgroup is fully initialized in mem_cgroup_create and css_tryget makes > sure that the group is alive when we encounter it by iterator. > > If the reclaim cookie is used we need to store the last visited group > into the iterator so we have to be careful that it doesn't disappear in > the mean time. Elevated reference count on the memcg guarantees that > the group will not vanish even though it has been already removed from > the tree. In such a case css_tryget will fail and the iteration is > retried (groups are linked with RCU safe lists so the forward progress > is still possible). iter_lock will make sure that only one reclaimer > will see the last_visited group and the reference count game around it. > > Signed-off-by: Michal Hocko > --- > mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 49 insertions(+), 15 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0fe5177..5da1e58 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -142,8 +142,8 @@ struct mem_cgroup_stat_cpu { > }; > > struct mem_cgroup_reclaim_iter { > - /* css_id of the last scanned hierarchy member */ > - int position; > + /* last scanned hierarchy member with elevated ref count */ > + struct mem_cgroup *last_visited; > /* scan generation, increased every round-trip */ > unsigned int generation; > /* lock to protect the position and generation */ > @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > struct mem_cgroup *prev, > struct mem_cgroup_reclaim_cookie *reclaim) > { > - struct mem_cgroup *memcg = NULL; > - int id = 0; > + struct mem_cgroup *memcg = NULL, > + *last_visited = NULL; > > if (mem_cgroup_disabled()) > return NULL; > @@ -1073,7 +1073,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > root = root_mem_cgroup; > > if (prev && !reclaim) > - id = css_id(&prev->css); > + last_visited = prev; > > if (prev && prev != root) > css_put(&prev->css); > @@ -1086,7 +1086,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > while (!memcg) { > struct mem_cgroup_reclaim_iter *uninitialized_var(iter); > - struct cgroup_subsys_state *css; > + struct cgroup_subsys_state *css = NULL; > > if (reclaim) { > int nid = zone_to_nid(reclaim->zone); > @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > mz = mem_cgroup_zoneinfo(root, nid, zid); > iter = &mz->reclaim_iter[reclaim->priority]; > spin_lock(&iter->iter_lock); > + last_visited = iter->last_visited; > if (prev && reclaim->generation != iter->generation) { > + if (last_visited) { > + mem_cgroup_put(last_visited); > + iter->last_visited = NULL; > + } > spin_unlock(&iter->iter_lock); > return NULL; > } > - id = iter->position; > } > > rcu_read_lock(); > - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); > - if (css) { > - if (css == &root->css || css_tryget(css)) > - memcg = mem_cgroup_from_css(css); > - } else > - id = 0; > - rcu_read_unlock(); > + /* > + * Root is not visited by cgroup iterators so it needs a special > + * treatment. > + */ > + if (!last_visited) { > + css = &root->css; > + } else { > + struct cgroup *next_cgroup; > + > + next_cgroup = cgroup_next_descendant_pre( > + last_visited->css.cgroup, > + root->css.cgroup); Maybe I miss something but.... last_visited is holded by memcg's refcnt. The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed before memcg is freed and last_visited->css.cgroup is out of RCU cycle. Is this safe ? Thanks, -Kame > + if (next_cgroup) > + css = cgroup_subsys_state(next_cgroup, > + mem_cgroup_subsys_id); > + } > + > + /* > + * Even if we find a group we have to make sure it is alive. > + * css && !memcg means that the groups should be skipped and > + * we should continue the tree walk. > + */ > + if (css == &root->css || (css && css_tryget(css))) > + memcg = mem_cgroup_from_css(css); > > if (reclaim) { > - iter->position = id; > + struct mem_cgroup *curr = memcg; > + > + if (last_visited) > + mem_cgroup_put(last_visited); > + > + if (css && !memcg) > + curr = mem_cgroup_from_css(css); > + if (curr) > + mem_cgroup_get(curr); > + iter->last_visited = curr; > + > if (!css) > iter->generation++; > else if (!prev && memcg) > reclaim->generation = iter->generation; > spin_unlock(&iter->iter_lock); > + } else if (css && !memcg) { > + last_visited = mem_cgroup_from_css(css); > } > + rcu_read_unlock(); > > if (prev && !css) > return NULL; > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id 3F1516B0074 for ; Tue, 13 Nov 2012 20:55:17 -0500 (EST) Message-ID: <50A2F9FC.5050303@huawei.com> Date: Wed, 14 Nov 2012 09:55:08 +0800 From: Li Zefan MIME-Version: 1.0 Subject: Re: [RFC] rework mem_cgroup iterator References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa On 2012/11/13 23:30, Michal Hocko wrote: > Hi all, > this patch set tries to make mem_cgroup_iter saner in the way how it > walks hierarchies. css->id based traversal is far from being ideal as it > is not deterministic because it depends on the creation ordering. > > Diffstat looks promising but it is fair the say that the biggest cleanup is > just css_get_next removal. The memcg code has grown a bit but I think it is > worth the resulting outcome (the sanity ;)). > So memcg won't use css id at all, right? Then we can remove the whole css_id stuff, and that's quite a bunch of code. > The first patch fixes a potential misbehaving which I haven't seen but the > fix is needed for the later patches anyway. We could take it alone as well > but I do not have any bug report to base the fix on. > > The second patch replaces css_get_next by cgroup iterators which are > scheduled for 3.8 in Tejun's tree and I depend on the following two patches: > fe1e904c cgroup: implement generic child / descendant walk macros > 7e187c6c cgroup: use rculist ops for cgroup->children > > The third patch is an attempt for simplification of the mem_cgroup_iter. It > basically removes all css usages to make the code easier. The next patch > removes the big while(!memcg) loop around the iterating logic. It could have > been folded into #3 but I rather have the rework separate from the code > moving noise. > > The last patch just removes css_get_next as there is no user for it any > longer. > > I am also thinking that leaf-to-root iteration makes more sense but this > patch is not included in the series yet because I have to think some > more about the justification. > > So far I didn't get to testing but I am posting this early if everybody is > OK with this change. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id 51DE36B005D for ; Wed, 14 Nov 2012 03:17:48 -0500 (EST) Message-ID: <50A3C42F.9020901@parallels.com> Date: Wed, 14 Nov 2012 17:17:51 +0100 From: Glauber Costa MIME-Version: 1.0 Subject: Re: [RFC] rework mem_cgroup iterator References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo On 11/13/2012 04:30 PM, Michal Hocko wrote: > Hi all, > this patch set tries to make mem_cgroup_iter saner in the way how it > walks hierarchies. css->id based traversal is far from being ideal as it > is not deterministic because it depends on the creation ordering. > > Diffstat looks promising but it is fair the say that the biggest cleanup is > just css_get_next removal. The memcg code has grown a bit but I think it is > worth the resulting outcome (the sanity ;)). > > The first patch fixes a potential misbehaving which I haven't seen but the > fix is needed for the later patches anyway. We could take it alone as well > but I do not have any bug report to base the fix on. > > The second patch replaces css_get_next by cgroup iterators which are > scheduled for 3.8 in Tejun's tree and I depend on the following two patches: > fe1e904c cgroup: implement generic child / descendant walk macros > 7e187c6c cgroup: use rculist ops for cgroup->children > > The third patch is an attempt for simplification of the mem_cgroup_iter. It > basically removes all css usages to make the code easier. The next patch > removes the big while(!memcg) loop around the iterating logic. It could have > been folded into #3 but I rather have the rework separate from the code > moving noise. > > The last patch just removes css_get_next as there is no user for it any > longer. > > I am also thinking that leaf-to-root iteration makes more sense but this > patch is not included in the series yet because I have to think some > more about the justification. > > So far I didn't get to testing but I am posting this early if everybody is > OK with this change. > > Any thoughts? > Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c ? It already exists, provide sane ordering, and only relies on parent information - which cgroup core already have - to do the walk. The only minor problem is that we'll have to handle the damn use_hierarchy case, so we may not be able to blindly rely on cgroup->parent. But maybe we can, if we don't guarantee any particular leaf-order. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx182.postini.com [74.125.245.182]) by kanga.kvack.org (Postfix) with SMTP id E5BFA6B006C for ; Wed, 14 Nov 2012 03:37:03 -0500 (EST) Date: Wed, 14 Nov 2012 09:36:53 +0100 From: Michal Hocko Subject: Re: [RFC] rework mem_cgroup iterator Message-ID: <20121114083653.GA17111@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A2F9FC.5050303@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A2F9FC.5050303@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: To: Li Zefan Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa On Wed 14-11-12 09:55:08, Li Zefan wrote: > On 2012/11/13 23:30, Michal Hocko wrote: > > Hi all, > > this patch set tries to make mem_cgroup_iter saner in the way how it > > walks hierarchies. css->id based traversal is far from being ideal as it > > is not deterministic because it depends on the creation ordering. > > > > Diffstat looks promising but it is fair the say that the biggest cleanup is > > just css_get_next removal. The memcg code has grown a bit but I think it is > > worth the resulting outcome (the sanity ;)). > > > > So memcg won't use css id at all, right? Unfortunately we still use it for the swap accounting but that one could be replaced by something else, probably. Have to think about it. > Then we can remove the whole css_id stuff, and that's quite a bunch of > code. Is memcg the only user of css_id? Quick grep shows that yes but I haven't checked all the callers of the exported functions. I would be happy if more code goes away. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id 6FA8C6B0070 for ; Wed, 14 Nov 2012 03:40:17 -0500 (EST) Date: Wed, 14 Nov 2012 09:40:14 +0100 From: Michal Hocko Subject: Re: [RFC] rework mem_cgroup iterator Message-ID: <20121114084014.GB17111@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A3C42F.9020901@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A3C42F.9020901@parallels.com> Sender: owner-linux-mm@kvack.org List-ID: To: Glauber Costa Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo On Wed 14-11-12 17:17:51, Glauber Costa wrote: [...] > Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c? I do not care much about the internal implementation of the core iterators. Those implemented by Tejun make sense to me. I just want to get rid of css->id based ones. Memcg iterator, however, still needs its own iterator on top because we have to handle the parallel reclaimers. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx152.postini.com [74.125.245.152]) by kanga.kvack.org (Postfix) with SMTP id 227086B0080 for ; Wed, 14 Nov 2012 03:51:33 -0500 (EST) Date: Wed, 14 Nov 2012 09:51:29 +0100 From: Michal Hocko Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121114085129.GC17111@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121113161442.GA18227@mtj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa On Tue 13-11-12 08:14:42, Tejun Heo wrote: > On Tue, Nov 13, 2012 at 04:30:36PM +0100, Michal Hocko wrote: > > @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > struct mem_cgroup *prev, > > struct mem_cgroup_reclaim_cookie *reclaim) > > { > > - struct mem_cgroup *memcg = NULL; > > - int id = 0; > > + struct mem_cgroup *memcg = NULL, > > + *last_visited = NULL; > > Nitpick but please don't do this. OK, will make it grep friendlier; > > + /* > > + * Root is not visited by cgroup iterators so it needs a special > > + * treatment. > > + */ > > + if (!last_visited) { > > + css = &root->css; > > + } else { > > + struct cgroup *next_cgroup; > > + > > + next_cgroup = cgroup_next_descendant_pre( > > + last_visited->css.cgroup, > > + root->css.cgroup); > > + if (next_cgroup) > > + css = cgroup_subsys_state(next_cgroup, > > + mem_cgroup_subsys_id); > > Hmmm... wouldn't it be better to move the reclaim logic into a > function and do the following? > > reclaim(root); > for_each_descendent_pre() > reclaim(descendant); We cannot do for_each_descendent_pre here because we do not iterate through the whole hierarchy all the time. Check shrink_zone. > If this is a problem, I'd be happy to add a iterator which includes > the top node. This would help with the above if-else but I do not think this is the worst thing in the function ;) > I'd prefer controllers not using the next functions directly. Well, we will need to use it directly because of the single group reclaim mentioned above. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id 618B16B00A3 for ; Wed, 14 Nov 2012 05:10:58 -0500 (EST) Date: Wed, 14 Nov 2012 11:10:52 +0100 From: Michal Hocko Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121114101052.GD17111@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <50A2E3B3.6080007@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A2E3B3.6080007@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote: > (2012/11/14 0:30), Michal Hocko wrote: [...] > > @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > mz = mem_cgroup_zoneinfo(root, nid, zid); > > iter = &mz->reclaim_iter[reclaim->priority]; > > spin_lock(&iter->iter_lock); > > + last_visited = iter->last_visited; > > if (prev && reclaim->generation != iter->generation) { > > + if (last_visited) { > > + mem_cgroup_put(last_visited); > > + iter->last_visited = NULL; > > + } > > spin_unlock(&iter->iter_lock); > > return NULL; > > } > > - id = iter->position; > > } > > > > rcu_read_lock(); > > - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); > > - if (css) { > > - if (css == &root->css || css_tryget(css)) > > - memcg = mem_cgroup_from_css(css); > > - } else > > - id = 0; > > - rcu_read_unlock(); > > + /* > > + * Root is not visited by cgroup iterators so it needs a special > > + * treatment. > > + */ > > + if (!last_visited) { > > + css = &root->css; > > + } else { > > + struct cgroup *next_cgroup; > > + > > + next_cgroup = cgroup_next_descendant_pre( > > + last_visited->css.cgroup, > > + root->css.cgroup); > > Maybe I miss something but.... last_visited is holded by memcg's refcnt. > The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed > before memcg is freed and last_visited->css.cgroup is out of RCU cycle. > Is this safe ? Good spotted. You are right. What I need to do is to check that the last_visited is alive and restart from the root if not. Something like the bellow (incremental patch on top of this one) should help, right? diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 30efd7e..c0a91a3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, spin_unlock(&iter->iter_lock); return NULL; } + /* + * memcg is still valid because we hold a reference but + * its cgroup might have vanished in the meantime so + * we have to double check it is alive and restart the + * tree walk otherwise. + */ + if (last_visited && !css_tryget(&last_visited->css)) { + mem_cgroup_put(last_visited); + last_visited = NULL; + } } rcu_read_lock(); @@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (reclaim) { struct mem_cgroup *curr = memcg; - if (last_visited) + if (last_visited) { + css_put(&last_visited->css); mem_cgroup_put(last_visited); + } if (css && !memcg) curr = mem_cgroup_from_css(css); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx114.postini.com [74.125.245.114]) by kanga.kvack.org (Postfix) with SMTP id 3839E6B0070 for ; Wed, 14 Nov 2012 13:30:13 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id i14so327640dad.14 for ; Wed, 14 Nov 2012 10:30:12 -0800 (PST) Date: Wed, 14 Nov 2012 10:30:07 -0800 From: Tejun Heo Subject: Re: [RFC] rework mem_cgroup iterator Message-ID: <20121114183007.GC21185@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A2F9FC.5050303@huawei.com> <20121114083653.GA17111@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121114083653.GA17111@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Li Zefan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Hello, Michal. On Wed, Nov 14, 2012 at 09:36:53AM +0100, Michal Hocko wrote: > > So memcg won't use css id at all, right? > > Unfortunately we still use it for the swap accounting but that one could > be replaced by something else, probably. Have to think about it. I have a patch to add cgrp->id pending. From what I can see, memcg should be able to use that for swap accounting. > > Then we can remove the whole css_id stuff, and that's quite a bunch of > > code. Yeap, that's the plan. > Is memcg the only user of css_id? Quick grep shows that yes but I > haven't checked all the callers of the exported functions. I would be > happy if more code goes away. Yeap, memcg is the only user and I really wanna remove it once memcg moves onto saner stuff. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx129.postini.com [74.125.245.129]) by kanga.kvack.org (Postfix) with SMTP id A917F6B007D for ; Wed, 14 Nov 2012 13:41:15 -0500 (EST) Received: by mail-pa0-f41.google.com with SMTP id fa10so535920pad.14 for ; Wed, 14 Nov 2012 10:41:14 -0800 (PST) Date: Wed, 14 Nov 2012 10:41:10 -0800 From: Tejun Heo Subject: Re: [RFC] rework mem_cgroup iterator Message-ID: <20121114184110.GD21185@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A3C42F.9020901@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A3C42F.9020901@parallels.com> Sender: owner-linux-mm@kvack.org List-ID: To: Glauber Costa Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han Hello, Glauber. On Wed, Nov 14, 2012 at 05:17:51PM +0100, Glauber Costa wrote: > Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c > ? It already exists, provide sane ordering, and only relies on parent > information - which cgroup core already have - to do the walk. Hmmm... we can but I personally much prefer for_each_*() iterators over callback based ones. It's just much easier to share states across an iteration and follow the logic. walk_tg_tree_from() does have the benefit of being able to combine pre and post visits in the same walk, which doesn't seem to have any user at the moment. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx114.postini.com [74.125.245.114]) by kanga.kvack.org (Postfix) with SMTP id 4A8A36B0083 for ; Wed, 14 Nov 2012 13:45:08 -0500 (EST) Message-ID: <50A45729.4000203@parallels.com> Date: Thu, 15 Nov 2012 06:44:57 +0400 From: Glauber Costa MIME-Version: 1.0 Subject: Re: [RFC] rework mem_cgroup iterator References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A3C42F.9020901@parallels.com> <20121114184110.GD21185@mtj.dyndns.org> In-Reply-To: <20121114184110.GD21185@mtj.dyndns.org> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Peter Zijlstra , Paul Turner On 11/14/2012 10:41 PM, Tejun Heo wrote: > Hello, Glauber. > > On Wed, Nov 14, 2012 at 05:17:51PM +0100, Glauber Costa wrote: >> Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c >> ? It already exists, provide sane ordering, and only relies on parent >> information - which cgroup core already have - to do the walk. > > Hmmm... we can but I personally much prefer for_each_*() iterators > over callback based ones. It's just much easier to share states > across an iteration and follow the logic. walk_tg_tree_from() does > have the benefit of being able to combine pre and post visits in the > same walk, which doesn't seem to have any user at the moment. > > Thanks. > Is there any particular reason why we can't do the other way around then, and use a for_each_*() for sched walks? Without even consider what I personally prefer, what I really don't like is to have two different cgroup walkers when it seems like we could very well have just one. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx109.postini.com [74.125.245.109]) by kanga.kvack.org (Postfix) with SMTP id 174496B0087 for ; Wed, 14 Nov 2012 13:46:31 -0500 (EST) Received: by mail-pb0-f41.google.com with SMTP id xa7so620870pbc.14 for ; Wed, 14 Nov 2012 10:46:30 -0800 (PST) Date: Wed, 14 Nov 2012 10:46:25 -0800 From: Tejun Heo Subject: Re: [RFC] rework mem_cgroup iterator Message-ID: <20121114184625.GE21185@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A3C42F.9020901@parallels.com> <20121114184110.GD21185@mtj.dyndns.org> <50A45729.4000203@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A45729.4000203@parallels.com> Sender: owner-linux-mm@kvack.org List-ID: To: Glauber Costa Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Peter Zijlstra , Paul Turner Hello, On Thu, Nov 15, 2012 at 06:44:57AM +0400, Glauber Costa wrote: > Is there any particular reason why we can't do the other way around > then, and use a for_each_*() for sched walks? Without even consider what > I personally prefer, what I really don't like is to have two different > cgroup walkers when it seems like we could very well have just one. Ooh, sure thing, let's do that. Will work on that. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx169.postini.com [74.125.245.169]) by kanga.kvack.org (Postfix) with SMTP id ACAED6B0092 for ; Wed, 14 Nov 2012 13:52:50 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id i14so336144dad.14 for ; Wed, 14 Nov 2012 10:52:50 -0800 (PST) Date: Wed, 14 Nov 2012 10:52:45 -0800 From: Tejun Heo Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121114185245.GF21185@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121114085129.GC17111@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Hello, Michal. On Wed, Nov 14, 2012 at 09:51:29AM +0100, Michal Hocko wrote: > > reclaim(root); > > for_each_descendent_pre() > > reclaim(descendant); > > We cannot do for_each_descendent_pre here because we do not iterate > through the whole hierarchy all the time. Check shrink_zone. I'm a bit confused. Why would that make any difference? Shouldn't it be just able to test the condition and continue? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx187.postini.com [74.125.245.187]) by kanga.kvack.org (Postfix) with SMTP id 92D016B005A for ; Wed, 14 Nov 2012 21:13:17 -0500 (EST) Received: from m3.gw.fujitsu.co.jp (unknown [10.0.50.73]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id 904663EE0BC for ; Thu, 15 Nov 2012 11:13:15 +0900 (JST) Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 776B745DEB2 for ; Thu, 15 Nov 2012 11:13:15 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 4BAE345DEB6 for ; Thu, 15 Nov 2012 11:13:15 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 3EF841DB8041 for ; Thu, 15 Nov 2012 11:13:15 +0900 (JST) Received: from m1000.s.css.fujitsu.com (m1000.s.css.fujitsu.com [10.240.81.136]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id E8B0D1DB8040 for ; Thu, 15 Nov 2012 11:13:14 +0900 (JST) Message-ID: <50A44FA0.5010305@jp.fujitsu.com> Date: Thu, 15 Nov 2012 11:12:48 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [RFC] rework mem_cgroup iterator References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A2F9FC.5050303@huawei.com> In-Reply-To: <50A2F9FC.5050303@huawei.com> Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Li Zefan Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa (2012/11/14 10:55), Li Zefan wrote: > On 2012/11/13 23:30, Michal Hocko wrote: >> Hi all, >> this patch set tries to make mem_cgroup_iter saner in the way how it >> walks hierarchies. css->id based traversal is far from being ideal as it >> is not deterministic because it depends on the creation ordering. >> >> Diffstat looks promising but it is fair the say that the biggest cleanup is >> just css_get_next removal. The memcg code has grown a bit but I think it is >> worth the resulting outcome (the sanity ;)). >> > > So memcg won't use css id at all, right? Then we can remove the whole css_id > stuff, and that's quite a bunch of code. > It's used by swap information recording for saving spaces. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx205.postini.com [74.125.245.205]) by kanga.kvack.org (Postfix) with SMTP id CBEA96B0070 for ; Wed, 14 Nov 2012 23:13:10 -0500 (EST) Received: from m2.gw.fujitsu.co.jp (unknown [10.0.50.72]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id 02DF83EE0BD for ; Thu, 15 Nov 2012 13:13:09 +0900 (JST) Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id AA44D45DE50 for ; Thu, 15 Nov 2012 13:13:08 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 8969645DE4D for ; Thu, 15 Nov 2012 13:13:08 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 7CEB11DB8038 for ; Thu, 15 Nov 2012 13:13:08 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.240.81.134]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 27197E08003 for ; Thu, 15 Nov 2012 13:13:08 +0900 (JST) Message-ID: <50A46BB6.6070902@jp.fujitsu.com> Date: Thu, 15 Nov 2012 13:12:38 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <50A2E3B3.6080007@jp.fujitsu.com> <20121114101052.GD17111@dhcp22.suse.cz> In-Reply-To: <20121114101052.GD17111@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa (2012/11/14 19:10), Michal Hocko wrote: > On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote: >> (2012/11/14 0:30), Michal Hocko wrote: > [...] >>> @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, >>> mz = mem_cgroup_zoneinfo(root, nid, zid); >>> iter = &mz->reclaim_iter[reclaim->priority]; >>> spin_lock(&iter->iter_lock); >>> + last_visited = iter->last_visited; >>> if (prev && reclaim->generation != iter->generation) { >>> + if (last_visited) { >>> + mem_cgroup_put(last_visited); >>> + iter->last_visited = NULL; >>> + } >>> spin_unlock(&iter->iter_lock); >>> return NULL; >>> } >>> - id = iter->position; >>> } >>> >>> rcu_read_lock(); >>> - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); >>> - if (css) { >>> - if (css == &root->css || css_tryget(css)) >>> - memcg = mem_cgroup_from_css(css); >>> - } else >>> - id = 0; >>> - rcu_read_unlock(); >>> + /* >>> + * Root is not visited by cgroup iterators so it needs a special >>> + * treatment. >>> + */ >>> + if (!last_visited) { >>> + css = &root->css; >>> + } else { >>> + struct cgroup *next_cgroup; >>> + >>> + next_cgroup = cgroup_next_descendant_pre( >>> + last_visited->css.cgroup, >>> + root->css.cgroup); >> >> Maybe I miss something but.... last_visited is holded by memcg's refcnt. >> The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed >> before memcg is freed and last_visited->css.cgroup is out of RCU cycle. >> Is this safe ? > > Good spotted. You are right. What I need to do is to check that the > last_visited is alive and restart from the root if not. Something like > the bellow (incremental patch on top of this one) should help, right? > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 30efd7e..c0a91a3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > spin_unlock(&iter->iter_lock); > return NULL; > } > + /* > + * memcg is still valid because we hold a reference but > + * its cgroup might have vanished in the meantime so > + * we have to double check it is alive and restart the > + * tree walk otherwise. > + */ > + if (last_visited && !css_tryget(&last_visited->css)) { > + mem_cgroup_put(last_visited); > + last_visited = NULL; > + } > } > > rcu_read_lock(); > @@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (reclaim) { > struct mem_cgroup *curr = memcg; > > - if (last_visited) > + if (last_visited) { > + css_put(&last_visited->css); > mem_cgroup_put(last_visited); > + } > > if (css && !memcg) > curr = mem_cgroup_from_css(css); > I think this will work. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx204.postini.com [74.125.245.204]) by kanga.kvack.org (Postfix) with SMTP id 350746B00A7 for ; Thu, 15 Nov 2012 04:51:08 -0500 (EST) Date: Thu, 15 Nov 2012 10:51:03 +0100 From: Michal Hocko Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115095103.GB11990@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> <20121114185245.GF21185@mtj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121114185245.GF21185@mtj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa On Wed 14-11-12 10:52:45, Tejun Heo wrote: > Hello, Michal. > > On Wed, Nov 14, 2012 at 09:51:29AM +0100, Michal Hocko wrote: > > > reclaim(root); > > > for_each_descendent_pre() > > > reclaim(descendant); > > > > We cannot do for_each_descendent_pre here because we do not iterate > > through the whole hierarchy all the time. Check shrink_zone. > > I'm a bit confused. Why would that make any difference? Shouldn't it > be just able to test the condition and continue? Ohh, I misunderstood your proposal. So what you are suggesting is to put all the logic we have in mem_cgroup_iter inside what you call reclaim here + mem_cgroup_iter_break inside the loop, right? I do not see how this would help us much. mem_cgroup_iter is not the nicest piece of code but it handles quite a complex requirements that we have currently (css reference count, multiple reclaimers racing). So I would rather keep it this way. Further simplifications are welcome of course. Is there any reason why you are not happy about direct using of cgroup_next_descendant_pre? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx180.postini.com [74.125.245.180]) by kanga.kvack.org (Postfix) with SMTP id DDDEC6B00AA for ; Thu, 15 Nov 2012 04:52:37 -0500 (EST) Date: Thu, 15 Nov 2012 10:52:35 +0100 From: Michal Hocko Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115095235.GC11990@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <50A2E3B3.6080007@jp.fujitsu.com> <20121114101052.GD17111@dhcp22.suse.cz> <50A46BB6.6070902@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A46BB6.6070902@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa On Thu 15-11-12 13:12:38, KAMEZAWA Hiroyuki wrote: > (2012/11/14 19:10), Michal Hocko wrote: > >On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote: > >>(2012/11/14 0:30), Michal Hocko wrote: > >[...] > >>>@@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > >>> mz = mem_cgroup_zoneinfo(root, nid, zid); > >>> iter = &mz->reclaim_iter[reclaim->priority]; > >>> spin_lock(&iter->iter_lock); > >>>+ last_visited = iter->last_visited; > >>> if (prev && reclaim->generation != iter->generation) { > >>>+ if (last_visited) { > >>>+ mem_cgroup_put(last_visited); > >>>+ iter->last_visited = NULL; > >>>+ } > >>> spin_unlock(&iter->iter_lock); > >>> return NULL; > >>> } > >>>- id = iter->position; > >>> } > >>> > >>> rcu_read_lock(); > >>>- css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); > >>>- if (css) { > >>>- if (css == &root->css || css_tryget(css)) > >>>- memcg = mem_cgroup_from_css(css); > >>>- } else > >>>- id = 0; > >>>- rcu_read_unlock(); > >>>+ /* > >>>+ * Root is not visited by cgroup iterators so it needs a special > >>>+ * treatment. > >>>+ */ > >>>+ if (!last_visited) { > >>>+ css = &root->css; > >>>+ } else { > >>>+ struct cgroup *next_cgroup; > >>>+ > >>>+ next_cgroup = cgroup_next_descendant_pre( > >>>+ last_visited->css.cgroup, > >>>+ root->css.cgroup); > >> > >>Maybe I miss something but.... last_visited is holded by memcg's refcnt. > >>The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed > >>before memcg is freed and last_visited->css.cgroup is out of RCU cycle. > >>Is this safe ? > > > >Good spotted. You are right. What I need to do is to check that the > >last_visited is alive and restart from the root if not. Something like > >the bellow (incremental patch on top of this one) should help, right? > > > >diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >index 30efd7e..c0a91a3 100644 > >--- a/mm/memcontrol.c > >+++ b/mm/memcontrol.c > >@@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > spin_unlock(&iter->iter_lock); > > return NULL; > > } > >+ /* > >+ * memcg is still valid because we hold a reference but > >+ * its cgroup might have vanished in the meantime so > >+ * we have to double check it is alive and restart the > >+ * tree walk otherwise. > >+ */ > >+ if (last_visited && !css_tryget(&last_visited->css)) { > >+ mem_cgroup_put(last_visited); > >+ last_visited = NULL; > >+ } > > } > > > > rcu_read_lock(); > >@@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > if (reclaim) { > > struct mem_cgroup *curr = memcg; > > > >- if (last_visited) > >+ if (last_visited) { > >+ css_put(&last_visited->css); > > mem_cgroup_put(last_visited); > >+ } > > > > if (css && !memcg) > > curr = mem_cgroup_from_css(css); > > > > I think this will work. Thanks for double checking. The updated patch: --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx124.postini.com [74.125.245.124]) by kanga.kvack.org (Postfix) with SMTP id 0A7746B002B for ; Thu, 15 Nov 2012 09:47:37 -0500 (EST) Received: by mail-pa0-f41.google.com with SMTP id fa10so1221335pad.14 for ; Thu, 15 Nov 2012 06:47:37 -0800 (PST) Date: Thu, 15 Nov 2012 06:47:32 -0800 From: Tejun Heo Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115144732.GB7306@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> <20121114185245.GF21185@mtj.dyndns.org> <20121115095103.GB11990@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121115095103.GB11990@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Hello, Michal. On Thu, Nov 15, 2012 at 10:51:03AM +0100, Michal Hocko wrote: > > I'm a bit confused. Why would that make any difference? Shouldn't it > > be just able to test the condition and continue? > > Ohh, I misunderstood your proposal. So what you are suggesting is > to put all the logic we have in mem_cgroup_iter inside what you call > reclaim here + mem_cgroup_iter_break inside the loop, right? > > I do not see how this would help us much. mem_cgroup_iter is not the > nicest piece of code but it handles quite a complex requirements that we > have currently (css reference count, multiple reclaimers racing). So I > would rather keep it this way. Further simplifications are welcome of > course. > > Is there any reason why you are not happy about direct using of > cgroup_next_descendant_pre? Because I'd like to consider the next functions as implementation detail, and having interations structred as loops tend to read better and less error-prone. e.g. when you use next functions directly, it's way easier to circumvent locking requirements in a way which isn't very obvious. So, unless it messes up the code too much (and I can't see why it would), I'd much prefer if memcg used for_each_*() macros. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx159.postini.com [74.125.245.159]) by kanga.kvack.org (Postfix) with SMTP id 878096B006E for ; Thu, 15 Nov 2012 10:12:58 -0500 (EST) Date: Thu, 15 Nov 2012 16:12:55 +0100 From: Michal Hocko Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115151255.GE11990@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> <20121114185245.GF21185@mtj.dyndns.org> <20121115095103.GB11990@dhcp22.suse.cz> <20121115144732.GB7306@mtj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121115144732.GB7306@mtj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa On Thu 15-11-12 06:47:32, Tejun Heo wrote: > Hello, Michal. > > On Thu, Nov 15, 2012 at 10:51:03AM +0100, Michal Hocko wrote: > > > I'm a bit confused. Why would that make any difference? Shouldn't it > > > be just able to test the condition and continue? > > > > Ohh, I misunderstood your proposal. So what you are suggesting is > > to put all the logic we have in mem_cgroup_iter inside what you call > > reclaim here + mem_cgroup_iter_break inside the loop, right? > > > > I do not see how this would help us much. mem_cgroup_iter is not the > > nicest piece of code but it handles quite a complex requirements that we > > have currently (css reference count, multiple reclaimers racing). So I > > would rather keep it this way. Further simplifications are welcome of > > course. > > > > Is there any reason why you are not happy about direct using of > > cgroup_next_descendant_pre? > > Because I'd like to consider the next functions as implementation > detail, and having interations structred as loops tend to read better > and less error-prone. e.g. when you use next functions directly, it's > way easier to circumvent locking requirements in a way which isn't > very obvious. The whole point behind mem_cgroup_iter is to hide all the complexity behind memcg iteration. Memcg code either use for_each_mem_cgroup_tree for !reclaim case and mem_cgroup_iter otherwise. > So, unless it messes up the code too much (and I can't see why it > would), I'd much prefer if memcg used for_each_*() macros. As I said this would mean that the current mem_cgroup_iter code would have to be inverted which doesn't simplify the code much. I'd rather hide all the grossy details inside the memcg iterator. Or am I still missing your suggestion? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id BCFCB6B002B for ; Thu, 15 Nov 2012 10:31:29 -0500 (EST) Received: by mail-pa0-f41.google.com with SMTP id fa10so1251426pad.14 for ; Thu, 15 Nov 2012 07:31:29 -0800 (PST) Date: Thu, 15 Nov 2012 07:31:24 -0800 From: Tejun Heo Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115153124.GD7306@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> <20121114185245.GF21185@mtj.dyndns.org> <20121115095103.GB11990@dhcp22.suse.cz> <20121115144732.GB7306@mtj.dyndns.org> <20121115151255.GE11990@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121115151255.GE11990@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Hello, Michal. On Thu, Nov 15, 2012 at 04:12:55PM +0100, Michal Hocko wrote: > > Because I'd like to consider the next functions as implementation > > detail, and having interations structred as loops tend to read better > > and less error-prone. e.g. when you use next functions directly, it's > > way easier to circumvent locking requirements in a way which isn't > > very obvious. > > The whole point behind mem_cgroup_iter is to hide all the complexity > behind memcg iteration. Memcg code either use for_each_mem_cgroup_tree > for !reclaim case and mem_cgroup_iter otherwise. > > > So, unless it messes up the code too much (and I can't see why it > > would), I'd much prefer if memcg used for_each_*() macros. > > As I said this would mean that the current mem_cgroup_iter code would > have to be inverted which doesn't simplify the code much. I'd rather > hide all the grossy details inside the memcg iterator. > Or am I still missing your suggestion? One way or the other, I don't think the code complexity would change much. Again, I'd much *prefer* if memcg used what other controllers would be using, but that's a preference and if necessary we can keep the next functions as exposed APIs. I think the issue I have is that I can't see much technical justification for that. If the code becomes much simpler by choosing one over the other, sure, but is that the case here? Isn't it mostly just about where to put the same things? If so, what would be the rationale for requiring a different interface? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id 747BC6B004D for ; Thu, 15 Nov 2012 11:15:08 -0500 (EST) Date: Thu, 15 Nov 2012 17:15:04 +0100 From: Michal Hocko Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115161504.GF11990@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> <20121114185245.GF21185@mtj.dyndns.org> <20121115095103.GB11990@dhcp22.suse.cz> <20121115144732.GB7306@mtj.dyndns.org> <20121115151255.GE11990@dhcp22.suse.cz> <20121115153124.GD7306@mtj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121115153124.GD7306@mtj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa On Thu 15-11-12 07:31:24, Tejun Heo wrote: > Hello, Michal. > > On Thu, Nov 15, 2012 at 04:12:55PM +0100, Michal Hocko wrote: > > > Because I'd like to consider the next functions as implementation > > > detail, and having interations structred as loops tend to read better > > > and less error-prone. e.g. when you use next functions directly, it's > > > way easier to circumvent locking requirements in a way which isn't > > > very obvious. > > > > The whole point behind mem_cgroup_iter is to hide all the complexity > > behind memcg iteration. Memcg code either use for_each_mem_cgroup_tree > > for !reclaim case and mem_cgroup_iter otherwise. > > > > > So, unless it messes up the code too much (and I can't see why it > > > would), I'd much prefer if memcg used for_each_*() macros. > > > > As I said this would mean that the current mem_cgroup_iter code would > > have to be inverted which doesn't simplify the code much. I'd rather > > hide all the grossy details inside the memcg iterator. > > Or am I still missing your suggestion? > > One way or the other, I don't think the code complexity would change > much. Again, I'd much *prefer* if memcg used what other controllers > would be using, but that's a preference and if necessary we can keep > the next functions as exposed APIs. Yes please. > I think the issue I have is that I can't see much technical > justification for that. If the code becomes much simpler by choosing > one over the other, sure, but is that the case here? Yes and I've tried to say that already. Memcg needs hierarchy, css ref counting and concurrent reclaim (per-zone per-priority) aware iteration. All of that is hidden in mem_cgroup_iter currently so the caller doesn't have to care about it at all. Which makes shrink_zone not care about memcg that much. cgroup_for_each_descendant_pre is not suitable at least because it doesn't provide a way to start a walk at a selected node (which is shared per-zone per-priority in memcg case). Even if cgroup_for_each_descendant_pre had start parameter there is still a lot of house keeping that callers would have to handle (css_tryget to start with, update of the cached possible not mentioning use_hierarchy thingy or mem_cgroup_disabled). We also try to not pollute mm/vmscan.c as much as possible so we definitely do not want to bring all this into shrink_zone. This all sounds like too much of a hassle if it is exposed so I would really like to stay with mem_cgroup_iter and slowly simplify it until it can go away (if that is possible at all). > Isn't it mostly just about where to put the same things? Unfortunately no. We wouldn't grow own iterator in such a case. > If so, what would be the rationale for requiring a different > interface? Does the above explain it? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx151.postini.com [74.125.245.151]) by kanga.kvack.org (Postfix) with SMTP id 848186B007B for ; Mon, 19 Nov 2012 09:05:07 -0500 (EST) Date: Mon, 19 Nov 2012 15:05:02 +0100 From: Michal Hocko Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121119140502.GA16803@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <50A2E3B3.6080007@jp.fujitsu.com> <20121114101052.GD17111@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121114101052.GD17111@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa On Wed 14-11-12 11:10:52, Michal Hocko wrote: > On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote: > > (2012/11/14 0:30), Michal Hocko wrote: > [...] > > > @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > > mz = mem_cgroup_zoneinfo(root, nid, zid); > > > iter = &mz->reclaim_iter[reclaim->priority]; > > > spin_lock(&iter->iter_lock); > > > + last_visited = iter->last_visited; > > > if (prev && reclaim->generation != iter->generation) { > > > + if (last_visited) { > > > + mem_cgroup_put(last_visited); > > > + iter->last_visited = NULL; > > > + } > > > spin_unlock(&iter->iter_lock); > > > return NULL; > > > } > > > - id = iter->position; > > > } > > > > > > rcu_read_lock(); > > > - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); > > > - if (css) { > > > - if (css == &root->css || css_tryget(css)) > > > - memcg = mem_cgroup_from_css(css); > > > - } else > > > - id = 0; > > > - rcu_read_unlock(); > > > + /* > > > + * Root is not visited by cgroup iterators so it needs a special > > > + * treatment. > > > + */ > > > + if (!last_visited) { > > > + css = &root->css; > > > + } else { > > > + struct cgroup *next_cgroup; > > > + > > > + next_cgroup = cgroup_next_descendant_pre( > > > + last_visited->css.cgroup, > > > + root->css.cgroup); > > > > Maybe I miss something but.... last_visited is holded by memcg's refcnt. > > The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed > > before memcg is freed and last_visited->css.cgroup is out of RCU cycle. > > Is this safe ? > > Good spotted. You are right. What I need to do is to check that the > last_visited is alive and restart from the root if not. Something like > the bellow (incremental patch on top of this one) should help, right? > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 30efd7e..c0a91a3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > spin_unlock(&iter->iter_lock); > return NULL; > } > + /* > + * memcg is still valid because we hold a reference but > + * its cgroup might have vanished in the meantime so > + * we have to double check it is alive and restart the > + * tree walk otherwise. > + */ > + if (last_visited && !css_tryget(&last_visited->css)) { > + mem_cgroup_put(last_visited); > + last_visited = NULL; > + } > } > > rcu_read_lock(); > @@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (reclaim) { > struct mem_cgroup *curr = memcg; > > - if (last_visited) > + if (last_visited) { > + css_put(&last_visited->css); > mem_cgroup_put(last_visited); > + } > > if (css && !memcg) > curr = mem_cgroup_from_css(css); Now that I think about it again it seems that this is more complicated than necessary. It should be sufficient to hold css' reference for the iter->last_visited because this makes sure that the cgroup won't go away same as mem_cgroup. Memcg reference counting + css_tryget just makes the situation more complicated because it forces us to retry the iteration on css_tryget failure as the cgroup is gone already and we have no point to continue other than start all over again. Which is, ehmm, _really_ ugly. I will repost the updated version sometime this week after it passes some testing. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx147.postini.com [74.125.245.147]) by kanga.kvack.org (Postfix) with SMTP id 489956B005D for ; Mon, 19 Nov 2012 10:11:36 -0500 (EST) Date: Mon, 19 Nov 2012 16:11:30 +0100 From: Michal Hocko Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121119151130.GB16803@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1352820639-13521-3-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa On Tue 13-11-12 16:30:36, Michal Hocko wrote: [...] > + /* > + * Root is not visited by cgroup iterators so it needs a special > + * treatment. > + */ > + if (!last_visited) { > + css = &root->css; > + } else { > + struct cgroup *next_cgroup; > + > + next_cgroup = cgroup_next_descendant_pre( > + last_visited->css.cgroup, > + root->css.cgroup); > + if (next_cgroup) > + css = cgroup_subsys_state(next_cgroup, > + mem_cgroup_subsys_id); This is not correct because cgroup_next_descendant_pre expects pos to be NULL for the first iteration but the way we do iterate (visit the root first) means that the second iteration will have last_visited != NULL and if root doesn't have any children the iteration would go unleashed to to the endless loop. We need something like: struct cgroup *prev_cgroup = (last_visited == root) ? NULL : last_visited->css.cgroup; next_cgroup = cgroup_next_descendant_pre(prev_cgroup, root->css.gtoup); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754298Ab2KMPbY (ORCPT ); Tue, 13 Nov 2012 10:31:24 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60710 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753981Ab2KMPbW (ORCPT ); Tue, 13 Nov 2012 10:31:22 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: [RFC 5/5] cgroup: remove css_get_next Date: Tue, 13 Nov 2012 16:30:39 +0100 Message-Id: <1352820639-13521-6-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that we have generic and well ordered cgroup tree walkers there is no need to keep css_get_next in the place. Signed-off-by: Michal Hocko --- include/linux/cgroup.h | 7 ------- kernel/cgroup.c | 49 ------------------------------------------------ 2 files changed, 56 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 329eb46..ba46041 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -676,13 +676,6 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css); struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id); -/* - * Get a cgroup whose id is greater than or equal to id under tree of root. - * Returning a cgroup_subsys_state or NULL. - */ -struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id, - struct cgroup_subsys_state *root, int *foundid); - /* Returns true if root is ancestor of cg */ bool css_is_ancestor(struct cgroup_subsys_state *cg, const struct cgroup_subsys_state *root); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index d51958a..4d874b2 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5230,55 +5230,6 @@ struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id) } EXPORT_SYMBOL_GPL(css_lookup); -/** - * css_get_next - lookup next cgroup under specified hierarchy. - * @ss: pointer to subsystem - * @id: current position of iteration. - * @root: pointer to css. search tree under this. - * @foundid: position of found object. - * - * Search next css under the specified hierarchy of rootid. Calling under - * rcu_read_lock() is necessary. Returns NULL if it reaches the end. - */ -struct cgroup_subsys_state * -css_get_next(struct cgroup_subsys *ss, int id, - struct cgroup_subsys_state *root, int *foundid) -{ - struct cgroup_subsys_state *ret = NULL; - struct css_id *tmp; - int tmpid; - int rootid = css_id(root); - int depth = css_depth(root); - - if (!rootid) - return NULL; - - BUG_ON(!ss->use_id); - WARN_ON_ONCE(!rcu_read_lock_held()); - - /* fill start point for scan */ - tmpid = id; - while (1) { - /* - * scan next entry from bitmap(tree), tmpid is updated after - * idr_get_next(). - */ - tmp = idr_get_next(&ss->idr, &tmpid); - if (!tmp) - break; - if (tmp->depth >= depth && tmp->stack[depth] == rootid) { - ret = rcu_dereference(tmp->css); - if (ret) { - *foundid = tmpid; - break; - } - } - /* continue to scan from next id */ - tmpid = tmpid + 1; - } - return ret; -} - /* * get corresponding css from file open on cgroupfs directory */ -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754118Ab2KMPbW (ORCPT ); Tue, 13 Nov 2012 10:31:22 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60689 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753873Ab2KMPbU (ORCPT ); Tue, 13 Nov 2012 10:31:20 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: [RFC 3/5] memcg: simplify mem_cgroup_iter Date: Tue, 13 Nov 2012 16:30:37 +0100 Message-Id: <1352820639-13521-4-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Current implementation of mem_cgroup_iter has to consider both css and memcg to find out whether no group has been found (css==NULL - aka the loop is completed) and that no memcg is associated with the found node (!memcg - aka css_tryget failed because the group is no longer alive). This leads to awkward tweaks like tests for css && !memcg to skip the current node. It will be much easier if we got rid off css variable altogether and only rely on memcg. In order to do that the iteration part has to skip dead nodes. This sounds natural to me and as a nice side effect we will get a simple invariant that memcg is always alive when non-NULL and all nodes have been visited otherwise. We could get rid of the surrounding while loop but keep it for now for an easier review. It will go away in the next patch. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 52 ++++++++++++++++++++++++---------------------------- 1 file changed, 24 insertions(+), 28 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5da1e58..dd84094 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1086,7 +1086,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, while (!memcg) { struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - struct cgroup_subsys_state *css = NULL; if (reclaim) { int nid = zone_to_nid(reclaim->zone); @@ -1113,49 +1112,46 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, * treatment. */ if (!last_visited) { - css = &root->css; + memcg = root; } else { - struct cgroup *next_cgroup; - + struct cgroup *next_cgroup, + *pos = last_visited->css.cgroup; +skip_node: next_cgroup = cgroup_next_descendant_pre( - last_visited->css.cgroup, + pos, root->css.cgroup); - if (next_cgroup) - css = cgroup_subsys_state(next_cgroup, - mem_cgroup_subsys_id); + /* + * Even if we find a group we have to make sure it is + * alive. If not we, should skip the node. + */ + if (next_cgroup) { + struct mem_cgroup *mem = mem_cgroup_from_cont( + next_cgroup); + if (css_tryget(&mem->css)) + memcg = mem; + else { + pos = next_cgroup; + goto skip_node; + } + } } - /* - * Even if we find a group we have to make sure it is alive. - * css && !memcg means that the groups should be skipped and - * we should continue the tree walk. - */ - if (css == &root->css || (css && css_tryget(css))) - memcg = mem_cgroup_from_css(css); - if (reclaim) { - struct mem_cgroup *curr = memcg; - if (last_visited) mem_cgroup_put(last_visited); + if (memcg) + mem_cgroup_get(memcg); + iter->last_visited = memcg; - if (css && !memcg) - curr = mem_cgroup_from_css(css); - if (curr) - mem_cgroup_get(curr); - iter->last_visited = curr; - - if (!css) + if (!memcg) iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; spin_unlock(&iter->iter_lock); - } else if (css && !memcg) { - last_visited = mem_cgroup_from_css(css); } rcu_read_unlock(); - if (prev && !css) + if (prev && !memcg) return NULL; } return memcg; -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753965Ab2KMPbU (ORCPT ); Tue, 13 Nov 2012 10:31:20 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60668 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751627Ab2KMPbS (ORCPT ); Tue, 13 Nov 2012 10:31:18 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: [RFC 1/5] memcg: synchronize per-zone iterator access by a spinlock Date: Tue, 13 Nov 2012 16:30:35 +0100 Message-Id: <1352820639-13521-2-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org per-zone per-priority iterator is aimed at coordinating concurrent reclaimers on the same hierarchy (or the global reclaim when all groups are reclaimed) so that all groups get reclaimed evenly as much as possible. iter->position holds the last css->id visited and iter->generation signals the completed tree walk (when it is incremented). Concurrent reclaimers are supposed to provide a reclaim cookie which holds the reclaim priority and the last generation they saw. If cookie's generation doesn't match the iterator's view then other concurrent reclaimer already did the job and the tree walk is done for that priority. This scheme works nicely in most cases but it is not raceless. Two racing reclaimers can see the same iter->position and so bang on the same group. iter->generation increment is not serialized as well so a reclaimer can see an updated iter->position with and old generation so the iteration might be restarted from the root of the hierarchy. The simplest way to fix this issue is to synchronise access to the iterator by a lock. This implementation uses per-zone per-priority spinlock which linearizes only directly racing reclaimers which use reclaim cookies so the effect of the new locking should be really minimal. I have to note that I haven't seen this as a real issue so far. The primary motivation for the change is different. The following patch will change the way how the iterator is implemented and css->id iteration will be replaced cgroup generic iteration which requires storing mem_cgroup pointer into iterator and that requires reference counting and so concurrent access will be a problem. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6136fec..0fe5177 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -146,6 +146,8 @@ struct mem_cgroup_reclaim_iter { int position; /* scan generation, increased every round-trip */ unsigned int generation; + /* lock to protect the position and generation */ + spinlock_t iter_lock; }; /* @@ -1093,8 +1095,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; - if (prev && reclaim->generation != iter->generation) + spin_lock(&iter->iter_lock); + if (prev && reclaim->generation != iter->generation) { + spin_unlock(&iter->iter_lock); return NULL; + } id = iter->position; } @@ -1113,6 +1118,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; + spin_unlock(&iter->iter_lock); } if (prev && !css) @@ -5871,8 +5877,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) return 1; for (zone = 0; zone < MAX_NR_ZONES; zone++) { + int prio; + mz = &pn->zoneinfo[zone]; lruvec_init(&mz->lruvec, &NODE_DATA(node)->node_zones[zone]); + for (prio = 0; prio < DEF_PRIORITY + 1; prio++) + spin_lock_init(&mz->reclaim_iter[prio].iter_lock); mz->usage_in_excess = 0; mz->on_tree = false; mz->memcg = memcg; -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754548Ab2KMPbq (ORCPT ); Tue, 13 Nov 2012 10:31:46 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60703 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751627Ab2KMPbV (ORCPT ); Tue, 13 Nov 2012 10:31:21 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: [RFC 4/5] memcg: clean up mem_cgroup_iter Date: Tue, 13 Nov 2012 16:30:38 +0100 Message-Id: <1352820639-13521-5-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Get rid of while(!memcg) loop as it is no longer needed because there will always be at least one group that should be visited (root). This patch doesn't add any change to the implementation but it is separate to make a review easier. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 120 +++++++++++++++++++++++++++---------------------------- 1 file changed, 60 insertions(+), 60 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index dd84094..b924f27 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1063,6 +1063,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, struct mem_cgroup_reclaim_cookie *reclaim) { + struct mem_cgroup_reclaim_iter *uninitialized_var(iter); struct mem_cgroup *memcg = NULL, *last_visited = NULL; @@ -1084,76 +1085,75 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, return root; } - while (!memcg) { - struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - - if (reclaim) { - int nid = zone_to_nid(reclaim->zone); - int zid = zone_idx(reclaim->zone); - struct mem_cgroup_per_zone *mz; - - mz = mem_cgroup_zoneinfo(root, nid, zid); - iter = &mz->reclaim_iter[reclaim->priority]; - spin_lock(&iter->iter_lock); - last_visited = iter->last_visited; - if (prev && reclaim->generation != iter->generation) { - if (last_visited) { - mem_cgroup_put(last_visited); - iter->last_visited = NULL; - } - spin_unlock(&iter->iter_lock); - return NULL; + if (reclaim) { + int nid = zone_to_nid(reclaim->zone); + int zid = zone_idx(reclaim->zone); + struct mem_cgroup_per_zone *mz; + + mz = mem_cgroup_zoneinfo(root, nid, zid); + iter = &mz->reclaim_iter[reclaim->priority]; + spin_lock(&iter->iter_lock); + last_visited = iter->last_visited; + if (prev && reclaim->generation != iter->generation) { + if (last_visited) { + mem_cgroup_put(last_visited); + iter->last_visited = NULL; } + spin_unlock(&iter->iter_lock); + return NULL; } + } - rcu_read_lock(); + rcu_read_lock(); + /* + * Root is not visited by cgroup iterators so it needs a special + * treatment. + */ + if (!last_visited) { + memcg = root; + } else { + struct cgroup *next_cgroup, + *pos = last_visited->css.cgroup; +skip_node: + next_cgroup = cgroup_next_descendant_pre( + pos, + root->css.cgroup); /* - * Root is not visited by cgroup iterators so it needs a special - * treatment. + * Even if we find a group we have to make sure it is + * alive. If not we, should skip the node. */ - if (!last_visited) { - memcg = root; - } else { - struct cgroup *next_cgroup, - *pos = last_visited->css.cgroup; -skip_node: - next_cgroup = cgroup_next_descendant_pre( - pos, - root->css.cgroup); - /* - * Even if we find a group we have to make sure it is - * alive. If not we, should skip the node. - */ - if (next_cgroup) { - struct mem_cgroup *mem = mem_cgroup_from_cont( - next_cgroup); - if (css_tryget(&mem->css)) - memcg = mem; - else { - pos = next_cgroup; - goto skip_node; - } + if (next_cgroup) { + struct mem_cgroup *mem = mem_cgroup_from_cont( + next_cgroup); + if (css_tryget(&mem->css)) + memcg = mem; + else { + pos = next_cgroup; + goto skip_node; } } + } - if (reclaim) { - if (last_visited) - mem_cgroup_put(last_visited); - if (memcg) - mem_cgroup_get(memcg); - iter->last_visited = memcg; - - if (!memcg) - iter->generation++; - else if (!prev && memcg) - reclaim->generation = iter->generation; - spin_unlock(&iter->iter_lock); - } - rcu_read_unlock(); + if (reclaim) { + if (last_visited) + mem_cgroup_put(last_visited); + if (memcg) + mem_cgroup_get(memcg); + iter->last_visited = memcg; - if (prev && !memcg) - return NULL; + if (!memcg) + iter->generation++; + else if (!prev && memcg) + reclaim->generation = iter->generation; + spin_unlock(&iter->iter_lock); } + rcu_read_unlock(); + + /* + * At least root has to be visited + */ + VM_BUG_ON(!prev && !memcg); + return memcg; } -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754575Ab2KMPcI (ORCPT ); Tue, 13 Nov 2012 10:32:08 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60679 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753389Ab2KMPbT (ORCPT ); Tue, 13 Nov 2012 10:31:19 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Date: Tue, 13 Nov 2012 16:30:36 +0100 Message-Id: <1352820639-13521-3-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org mem_cgroup_iter curently relies on css->id when walking down a group hierarchy tree. This is really awkward because the tree walk depends on the groups creation ordering. The only guarantee is that a parent node is visited before its children. Example 1) mkdir -p a a/d a/b/c 2) mkdir -a a/b/c a/d Will create the same trees but the tree walks will be different: 1) a, d, b, c 2) a, b, c, d 574bd9f7 (cgroup: implement generic child / descendant walk macros) has introduced generic cgroup tree walkers which provide either pre-order or post-order tree walk. This patch converts css->id based iteration to pre-order tree walk to keep the semantic with the original iterator where parent is always visited before its subtree. cgroup_for_each_descendant_pre suggests using post_create and pre_destroy for proper synchronization with groups addidition resp. removal. This implementation doesn't use those because a new memory cgroup is fully initialized in mem_cgroup_create and css_tryget makes sure that the group is alive when we encounter it by iterator. If the reclaim cookie is used we need to store the last visited group into the iterator so we have to be careful that it doesn't disappear in the mean time. Elevated reference count on the memcg guarantees that the group will not vanish even though it has been already removed from the tree. In such a case css_tryget will fail and the iteration is retried (groups are linked with RCU safe lists so the forward progress is still possible). iter_lock will make sure that only one reclaimer will see the last_visited group and the reference count game around it. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 49 insertions(+), 15 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0fe5177..5da1e58 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -142,8 +142,8 @@ struct mem_cgroup_stat_cpu { }; struct mem_cgroup_reclaim_iter { - /* css_id of the last scanned hierarchy member */ - int position; + /* last scanned hierarchy member with elevated ref count */ + struct mem_cgroup *last_visited; /* scan generation, increased every round-trip */ unsigned int generation; /* lock to protect the position and generation */ @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, struct mem_cgroup_reclaim_cookie *reclaim) { - struct mem_cgroup *memcg = NULL; - int id = 0; + struct mem_cgroup *memcg = NULL, + *last_visited = NULL; if (mem_cgroup_disabled()) return NULL; @@ -1073,7 +1073,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, root = root_mem_cgroup; if (prev && !reclaim) - id = css_id(&prev->css); + last_visited = prev; if (prev && prev != root) css_put(&prev->css); @@ -1086,7 +1086,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, while (!memcg) { struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - struct cgroup_subsys_state *css; + struct cgroup_subsys_state *css = NULL; if (reclaim) { int nid = zone_to_nid(reclaim->zone); @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; spin_lock(&iter->iter_lock); + last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { + if (last_visited) { + mem_cgroup_put(last_visited); + iter->last_visited = NULL; + } spin_unlock(&iter->iter_lock); return NULL; } - id = iter->position; } rcu_read_lock(); - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); - if (css) { - if (css == &root->css || css_tryget(css)) - memcg = mem_cgroup_from_css(css); - } else - id = 0; - rcu_read_unlock(); + /* + * Root is not visited by cgroup iterators so it needs a special + * treatment. + */ + if (!last_visited) { + css = &root->css; + } else { + struct cgroup *next_cgroup; + + next_cgroup = cgroup_next_descendant_pre( + last_visited->css.cgroup, + root->css.cgroup); + if (next_cgroup) + css = cgroup_subsys_state(next_cgroup, + mem_cgroup_subsys_id); + } + + /* + * Even if we find a group we have to make sure it is alive. + * css && !memcg means that the groups should be skipped and + * we should continue the tree walk. + */ + if (css == &root->css || (css && css_tryget(css))) + memcg = mem_cgroup_from_css(css); if (reclaim) { - iter->position = id; + struct mem_cgroup *curr = memcg; + + if (last_visited) + mem_cgroup_put(last_visited); + + if (css && !memcg) + curr = mem_cgroup_from_css(css); + if (curr) + mem_cgroup_get(curr); + iter->last_visited = curr; + if (!css) iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; spin_unlock(&iter->iter_lock); + } else if (css && !memcg) { + last_visited = mem_cgroup_from_css(css); } + rcu_read_unlock(); if (prev && !css) return NULL; -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753712Ab2KMPbT (ORCPT ); Tue, 13 Nov 2012 10:31:19 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60663 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751576Ab2KMPbS (ORCPT ); Tue, 13 Nov 2012 10:31:18 -0500 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: [RFC] rework mem_cgroup iterator Date: Tue, 13 Nov 2012 16:30:34 +0100 Message-Id: <1352820639-13521-1-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi all, this patch set tries to make mem_cgroup_iter saner in the way how it walks hierarchies. css->id based traversal is far from being ideal as it is not deterministic because it depends on the creation ordering. Diffstat looks promising but it is fair the say that the biggest cleanup is just css_get_next removal. The memcg code has grown a bit but I think it is worth the resulting outcome (the sanity ;)). The first patch fixes a potential misbehaving which I haven't seen but the fix is needed for the later patches anyway. We could take it alone as well but I do not have any bug report to base the fix on. The second patch replaces css_get_next by cgroup iterators which are scheduled for 3.8 in Tejun's tree and I depend on the following two patches: fe1e904c cgroup: implement generic child / descendant walk macros 7e187c6c cgroup: use rculist ops for cgroup->children The third patch is an attempt for simplification of the mem_cgroup_iter. It basically removes all css usages to make the code easier. The next patch removes the big while(!memcg) loop around the iterating logic. It could have been folded into #3 but I rather have the rework separate from the code moving noise. The last patch just removes css_get_next as there is no user for it any longer. I am also thinking that leaf-to-root iteration makes more sense but this patch is not included in the series yet because I have to think some more about the justification. So far I didn't get to testing but I am posting this early if everybody is OK with this change. Any thoughts? Cumulative diffstat: include/linux/cgroup.h | 7 --- kernel/cgroup.c | 49 --------------------- mm/memcontrol.c | 110 +++++++++++++++++++++++++++++++++--------------- 3 files changed, 75 insertions(+), 91 deletions(-) Michal Hocko (5): memcg: synchronize per-zone iterator access by a spinlock memcg: rework mem_cgroup_iter to use cgroup iterators memcg: simplify mem_cgroup_iter memcg: clean up mem_cgroup_iter cgroup: remove css_get_next From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755008Ab2KMQOs (ORCPT ); Tue, 13 Nov 2012 11:14:48 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:55787 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753733Ab2KMQOr (ORCPT ); Tue, 13 Nov 2012 11:14:47 -0500 Date: Tue, 13 Nov 2012 08:14:42 -0800 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121113161442.GA18227@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1352820639-13521-3-git-send-email-mhocko@suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 13, 2012 at 04:30:36PM +0100, Michal Hocko wrote: > @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > struct mem_cgroup *prev, > struct mem_cgroup_reclaim_cookie *reclaim) > { > - struct mem_cgroup *memcg = NULL; > - int id = 0; > + struct mem_cgroup *memcg = NULL, > + *last_visited = NULL; Nitpick but please don't do this. > + /* > + * Root is not visited by cgroup iterators so it needs a special > + * treatment. > + */ > + if (!last_visited) { > + css = &root->css; > + } else { > + struct cgroup *next_cgroup; > + > + next_cgroup = cgroup_next_descendant_pre( > + last_visited->css.cgroup, > + root->css.cgroup); > + if (next_cgroup) > + css = cgroup_subsys_state(next_cgroup, > + mem_cgroup_subsys_id); Hmmm... wouldn't it be better to move the reclaim logic into a function and do the following? reclaim(root); for_each_descendent_pre() reclaim(descendant); If this is a problem, I'd be happy to add a iterator which includes the top node. I'd prefer controllers not using the next functions directly. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756061Ab2KNAD6 (ORCPT ); Tue, 13 Nov 2012 19:03:58 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:47626 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756040Ab2KNAD5 (ORCPT ); Tue, 13 Nov 2012 19:03:57 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <50A2DFDC.90402@jp.fujitsu.com> Date: Wed, 14 Nov 2012 09:03:40 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: Michal Hocko CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC 1/5] memcg: synchronize per-zone iterator access by a spinlock References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-2-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-2-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/11/14 0:30), Michal Hocko wrote: > per-zone per-priority iterator is aimed at coordinating concurrent > reclaimers on the same hierarchy (or the global reclaim when all > groups are reclaimed) so that all groups get reclaimed evenly as > much as possible. iter->position holds the last css->id visited > and iter->generation signals the completed tree walk (when it is > incremented). > Concurrent reclaimers are supposed to provide a reclaim cookie which > holds the reclaim priority and the last generation they saw. If cookie's > generation doesn't match the iterator's view then other concurrent > reclaimer already did the job and the tree walk is done for that > priority. > > This scheme works nicely in most cases but it is not raceless. Two > racing reclaimers can see the same iter->position and so bang on the > same group. iter->generation increment is not serialized as well so a > reclaimer can see an updated iter->position with and old generation so > the iteration might be restarted from the root of the hierarchy. > > The simplest way to fix this issue is to synchronise access to the > iterator by a lock. This implementation uses per-zone per-priority > spinlock which linearizes only directly racing reclaimers which use > reclaim cookies so the effect of the new locking should be really > minimal. > > I have to note that I haven't seen this as a real issue so far. The > primary motivation for the change is different. The following patch > will change the way how the iterator is implemented and css->id > iteration will be replaced cgroup generic iteration which requires > storing mem_cgroup pointer into iterator and that requires reference > counting and so concurrent access will be a problem. > > Signed-off-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756096Ab2KNANq (ORCPT ); Tue, 13 Nov 2012 19:13:46 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:56789 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756051Ab2KNANp (ORCPT ); Tue, 13 Nov 2012 19:13:45 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <50A2E229.3050809@jp.fujitsu.com> Date: Wed, 14 Nov 2012 09:13:29 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: Michal Hocko CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC] rework mem_cgroup iterator References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/11/14 0:30), Michal Hocko wrote: > Hi all, > this patch set tries to make mem_cgroup_iter saner in the way how it > walks hierarchies. css->id based traversal is far from being ideal as it > is not deterministic because it depends on the creation ordering. > > Diffstat looks promising but it is fair the say that the biggest cleanup is > just css_get_next removal. The memcg code has grown a bit but I think it is > worth the resulting outcome (the sanity ;)). > > The first patch fixes a potential misbehaving which I haven't seen but the > fix is needed for the later patches anyway. We could take it alone as well > but I do not have any bug report to base the fix on. > > The second patch replaces css_get_next by cgroup iterators which are > scheduled for 3.8 in Tejun's tree and I depend on the following two patches: > fe1e904c cgroup: implement generic child / descendant walk macros > 7e187c6c cgroup: use rculist ops for cgroup->children > > The third patch is an attempt for simplification of the mem_cgroup_iter. It > basically removes all css usages to make the code easier. The next patch > removes the big while(!memcg) loop around the iterating logic. It could have > been folded into #3 but I rather have the rework separate from the code > moving noise. > > The last patch just removes css_get_next as there is no user for it any > longer. > > I am also thinking that leaf-to-root iteration makes more sense but this > patch is not included in the series yet because I have to think some > more about the justification. > > So far I didn't get to testing but I am posting this early if everybody is > OK with this change. > > Any thoughts? > I'm O.K. Maybe I have some points I'm not understanding...I'll make a reply to patches. Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756108Ab2KNAUY (ORCPT ); Tue, 13 Nov 2012 19:20:24 -0500 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:56708 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756071Ab2KNAUX (ORCPT ); Tue, 13 Nov 2012 19:20:23 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <50A2E3B3.6080007@jp.fujitsu.com> Date: Wed, 14 Nov 2012 09:20:03 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: Michal Hocko CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-3-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/11/14 0:30), Michal Hocko wrote: > mem_cgroup_iter curently relies on css->id when walking down a group > hierarchy tree. This is really awkward because the tree walk depends on > the groups creation ordering. The only guarantee is that a parent node > is visited before its children. > Example > 1) mkdir -p a a/d a/b/c > 2) mkdir -a a/b/c a/d > Will create the same trees but the tree walks will be different: > 1) a, d, b, c > 2) a, b, c, d > > 574bd9f7 (cgroup: implement generic child / descendant walk macros) has > introduced generic cgroup tree walkers which provide either pre-order > or post-order tree walk. This patch converts css->id based iteration > to pre-order tree walk to keep the semantic with the original iterator > where parent is always visited before its subtree. > > cgroup_for_each_descendant_pre suggests using post_create and > pre_destroy for proper synchronization with groups addidition resp. > removal. This implementation doesn't use those because a new memory > cgroup is fully initialized in mem_cgroup_create and css_tryget makes > sure that the group is alive when we encounter it by iterator. > > If the reclaim cookie is used we need to store the last visited group > into the iterator so we have to be careful that it doesn't disappear in > the mean time. Elevated reference count on the memcg guarantees that > the group will not vanish even though it has been already removed from > the tree. In such a case css_tryget will fail and the iteration is > retried (groups are linked with RCU safe lists so the forward progress > is still possible). iter_lock will make sure that only one reclaimer > will see the last_visited group and the reference count game around it. > > Signed-off-by: Michal Hocko > --- > mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 49 insertions(+), 15 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0fe5177..5da1e58 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -142,8 +142,8 @@ struct mem_cgroup_stat_cpu { > }; > > struct mem_cgroup_reclaim_iter { > - /* css_id of the last scanned hierarchy member */ > - int position; > + /* last scanned hierarchy member with elevated ref count */ > + struct mem_cgroup *last_visited; > /* scan generation, increased every round-trip */ > unsigned int generation; > /* lock to protect the position and generation */ > @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > struct mem_cgroup *prev, > struct mem_cgroup_reclaim_cookie *reclaim) > { > - struct mem_cgroup *memcg = NULL; > - int id = 0; > + struct mem_cgroup *memcg = NULL, > + *last_visited = NULL; > > if (mem_cgroup_disabled()) > return NULL; > @@ -1073,7 +1073,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > root = root_mem_cgroup; > > if (prev && !reclaim) > - id = css_id(&prev->css); > + last_visited = prev; > > if (prev && prev != root) > css_put(&prev->css); > @@ -1086,7 +1086,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > while (!memcg) { > struct mem_cgroup_reclaim_iter *uninitialized_var(iter); > - struct cgroup_subsys_state *css; > + struct cgroup_subsys_state *css = NULL; > > if (reclaim) { > int nid = zone_to_nid(reclaim->zone); > @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > mz = mem_cgroup_zoneinfo(root, nid, zid); > iter = &mz->reclaim_iter[reclaim->priority]; > spin_lock(&iter->iter_lock); > + last_visited = iter->last_visited; > if (prev && reclaim->generation != iter->generation) { > + if (last_visited) { > + mem_cgroup_put(last_visited); > + iter->last_visited = NULL; > + } > spin_unlock(&iter->iter_lock); > return NULL; > } > - id = iter->position; > } > > rcu_read_lock(); > - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); > - if (css) { > - if (css == &root->css || css_tryget(css)) > - memcg = mem_cgroup_from_css(css); > - } else > - id = 0; > - rcu_read_unlock(); > + /* > + * Root is not visited by cgroup iterators so it needs a special > + * treatment. > + */ > + if (!last_visited) { > + css = &root->css; > + } else { > + struct cgroup *next_cgroup; > + > + next_cgroup = cgroup_next_descendant_pre( > + last_visited->css.cgroup, > + root->css.cgroup); Maybe I miss something but.... last_visited is holded by memcg's refcnt. The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed before memcg is freed and last_visited->css.cgroup is out of RCU cycle. Is this safe ? Thanks, -Kame > + if (next_cgroup) > + css = cgroup_subsys_state(next_cgroup, > + mem_cgroup_subsys_id); > + } > + > + /* > + * Even if we find a group we have to make sure it is alive. > + * css && !memcg means that the groups should be skipped and > + * we should continue the tree walk. > + */ > + if (css == &root->css || (css && css_tryget(css))) > + memcg = mem_cgroup_from_css(css); > > if (reclaim) { > - iter->position = id; > + struct mem_cgroup *curr = memcg; > + > + if (last_visited) > + mem_cgroup_put(last_visited); > + > + if (css && !memcg) > + curr = mem_cgroup_from_css(css); > + if (curr) > + mem_cgroup_get(curr); > + iter->last_visited = curr; > + > if (!css) > iter->generation++; > else if (!prev && memcg) > reclaim->generation = iter->generation; > spin_unlock(&iter->iter_lock); > + } else if (css && !memcg) { > + last_visited = mem_cgroup_from_css(css); > } > + rcu_read_unlock(); > > if (prev && !css) > return NULL; > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756196Ab2KNBzl (ORCPT ); Tue, 13 Nov 2012 20:55:41 -0500 Received: from szxga02-in.huawei.com ([119.145.14.65]:39644 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756139Ab2KNBzk (ORCPT ); Tue, 13 Nov 2012 20:55:40 -0500 Message-ID: <50A2F9FC.5050303@huawei.com> Date: Wed, 14 Nov 2012 09:55:08 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121010 Thunderbird/16.0.1 MIME-Version: 1.0 To: Michal Hocko CC: , , KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC] rework mem_cgroup iterator References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2012/11/13 23:30, Michal Hocko wrote: > Hi all, > this patch set tries to make mem_cgroup_iter saner in the way how it > walks hierarchies. css->id based traversal is far from being ideal as it > is not deterministic because it depends on the creation ordering. > > Diffstat looks promising but it is fair the say that the biggest cleanup is > just css_get_next removal. The memcg code has grown a bit but I think it is > worth the resulting outcome (the sanity ;)). > So memcg won't use css id at all, right? Then we can remove the whole css_id stuff, and that's quite a bunch of code. > The first patch fixes a potential misbehaving which I haven't seen but the > fix is needed for the later patches anyway. We could take it alone as well > but I do not have any bug report to base the fix on. > > The second patch replaces css_get_next by cgroup iterators which are > scheduled for 3.8 in Tejun's tree and I depend on the following two patches: > fe1e904c cgroup: implement generic child / descendant walk macros > 7e187c6c cgroup: use rculist ops for cgroup->children > > The third patch is an attempt for simplification of the mem_cgroup_iter. It > basically removes all css usages to make the code easier. The next patch > removes the big while(!memcg) loop around the iterating logic. It could have > been folded into #3 but I rather have the rework separate from the code > moving noise. > > The last patch just removes css_get_next as there is no user for it any > longer. > > I am also thinking that leaf-to-root iteration makes more sense but this > patch is not included in the series yet because I have to think some > more about the justification. > > So far I didn't get to testing but I am posting this early if everybody is > OK with this change. > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932691Ab2KNIRt (ORCPT ); Wed, 14 Nov 2012 03:17:49 -0500 Received: from mx2.parallels.com ([64.131.90.16]:37927 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755450Ab2KNIRs (ORCPT ); Wed, 14 Nov 2012 03:17:48 -0500 Message-ID: <50A3C42F.9020901@parallels.com> Date: Wed, 14 Nov 2012 17:17:51 +0100 From: Glauber Costa User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121016 Thunderbird/16.0.1 MIME-Version: 1.0 To: Michal Hocko CC: , , KAMEZAWA Hiroyuki , Johannes Weiner , "Ying Han" , Tejun Heo Subject: Re: [RFC] rework mem_cgroup iterator References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> In-Reply-To: <1352820639-13521-1-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/13/2012 04:30 PM, Michal Hocko wrote: > Hi all, > this patch set tries to make mem_cgroup_iter saner in the way how it > walks hierarchies. css->id based traversal is far from being ideal as it > is not deterministic because it depends on the creation ordering. > > Diffstat looks promising but it is fair the say that the biggest cleanup is > just css_get_next removal. The memcg code has grown a bit but I think it is > worth the resulting outcome (the sanity ;)). > > The first patch fixes a potential misbehaving which I haven't seen but the > fix is needed for the later patches anyway. We could take it alone as well > but I do not have any bug report to base the fix on. > > The second patch replaces css_get_next by cgroup iterators which are > scheduled for 3.8 in Tejun's tree and I depend on the following two patches: > fe1e904c cgroup: implement generic child / descendant walk macros > 7e187c6c cgroup: use rculist ops for cgroup->children > > The third patch is an attempt for simplification of the mem_cgroup_iter. It > basically removes all css usages to make the code easier. The next patch > removes the big while(!memcg) loop around the iterating logic. It could have > been folded into #3 but I rather have the rework separate from the code > moving noise. > > The last patch just removes css_get_next as there is no user for it any > longer. > > I am also thinking that leaf-to-root iteration makes more sense but this > patch is not included in the series yet because I have to think some > more about the justification. > > So far I didn't get to testing but I am posting this early if everybody is > OK with this change. > > Any thoughts? > Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c ? It already exists, provide sane ordering, and only relies on parent information - which cgroup core already have - to do the walk. The only minor problem is that we'll have to handle the damn use_hierarchy case, so we may not be able to blindly rely on cgroup->parent. But maybe we can, if we don't guarantee any particular leaf-order. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932756Ab2KNIhF (ORCPT ); Wed, 14 Nov 2012 03:37:05 -0500 Received: from cantor2.suse.de ([195.135.220.15]:35524 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751578Ab2KNIhC (ORCPT ); Wed, 14 Nov 2012 03:37:02 -0500 Date: Wed, 14 Nov 2012 09:36:53 +0100 From: Michal Hocko To: Li Zefan Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC] rework mem_cgroup iterator Message-ID: <20121114083653.GA17111@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A2F9FC.5050303@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A2F9FC.5050303@huawei.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 14-11-12 09:55:08, Li Zefan wrote: > On 2012/11/13 23:30, Michal Hocko wrote: > > Hi all, > > this patch set tries to make mem_cgroup_iter saner in the way how it > > walks hierarchies. css->id based traversal is far from being ideal as it > > is not deterministic because it depends on the creation ordering. > > > > Diffstat looks promising but it is fair the say that the biggest cleanup is > > just css_get_next removal. The memcg code has grown a bit but I think it is > > worth the resulting outcome (the sanity ;)). > > > > So memcg won't use css id at all, right? Unfortunately we still use it for the swap accounting but that one could be replaced by something else, probably. Have to think about it. > Then we can remove the whole css_id stuff, and that's quite a bunch of > code. Is memcg the only user of css_id? Quick grep shows that yes but I haven't checked all the callers of the exported functions. I would be happy if more code goes away. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756682Ab2KNIkS (ORCPT ); Wed, 14 Nov 2012 03:40:18 -0500 Received: from cantor2.suse.de ([195.135.220.15]:35576 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751578Ab2KNIkQ (ORCPT ); Wed, 14 Nov 2012 03:40:16 -0500 Date: Wed, 14 Nov 2012 09:40:14 +0100 From: Michal Hocko To: Glauber Costa Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo Subject: Re: [RFC] rework mem_cgroup iterator Message-ID: <20121114084014.GB17111@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A3C42F.9020901@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A3C42F.9020901@parallels.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 14-11-12 17:17:51, Glauber Costa wrote: [...] > Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c? I do not care much about the internal implementation of the core iterators. Those implemented by Tejun make sense to me. I just want to get rid of css->id based ones. Memcg iterator, however, still needs its own iterator on top because we have to handle the parallel reclaimers. [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755015Ab2KNIvf (ORCPT ); Wed, 14 Nov 2012 03:51:35 -0500 Received: from cantor2.suse.de ([195.135.220.15]:36015 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753860Ab2KNIvc (ORCPT ); Wed, 14 Nov 2012 03:51:32 -0500 Date: Wed, 14 Nov 2012 09:51:29 +0100 From: Michal Hocko To: Tejun Heo Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121114085129.GC17111@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121113161442.GA18227@mtj.dyndns.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 13-11-12 08:14:42, Tejun Heo wrote: > On Tue, Nov 13, 2012 at 04:30:36PM +0100, Michal Hocko wrote: > > @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > struct mem_cgroup *prev, > > struct mem_cgroup_reclaim_cookie *reclaim) > > { > > - struct mem_cgroup *memcg = NULL; > > - int id = 0; > > + struct mem_cgroup *memcg = NULL, > > + *last_visited = NULL; > > Nitpick but please don't do this. OK, will make it grep friendlier; > > + /* > > + * Root is not visited by cgroup iterators so it needs a special > > + * treatment. > > + */ > > + if (!last_visited) { > > + css = &root->css; > > + } else { > > + struct cgroup *next_cgroup; > > + > > + next_cgroup = cgroup_next_descendant_pre( > > + last_visited->css.cgroup, > > + root->css.cgroup); > > + if (next_cgroup) > > + css = cgroup_subsys_state(next_cgroup, > > + mem_cgroup_subsys_id); > > Hmmm... wouldn't it be better to move the reclaim logic into a > function and do the following? > > reclaim(root); > for_each_descendent_pre() > reclaim(descendant); We cannot do for_each_descendent_pre here because we do not iterate through the whole hierarchy all the time. Check shrink_zone. > If this is a problem, I'd be happy to add a iterator which includes > the top node. This would help with the above if-else but I do not think this is the worst thing in the function ;) > I'd prefer controllers not using the next functions directly. Well, we will need to use it directly because of the single group reclaim mentioned above. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161091Ab2KNKK7 (ORCPT ); Wed, 14 Nov 2012 05:10:59 -0500 Received: from cantor2.suse.de ([195.135.220.15]:39124 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161008Ab2KNKK5 (ORCPT ); Wed, 14 Nov 2012 05:10:57 -0500 Date: Wed, 14 Nov 2012 11:10:52 +0100 From: Michal Hocko To: Kamezawa Hiroyuki Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121114101052.GD17111@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <50A2E3B3.6080007@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A2E3B3.6080007@jp.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote: > (2012/11/14 0:30), Michal Hocko wrote: [...] > > @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > mz = mem_cgroup_zoneinfo(root, nid, zid); > > iter = &mz->reclaim_iter[reclaim->priority]; > > spin_lock(&iter->iter_lock); > > + last_visited = iter->last_visited; > > if (prev && reclaim->generation != iter->generation) { > > + if (last_visited) { > > + mem_cgroup_put(last_visited); > > + iter->last_visited = NULL; > > + } > > spin_unlock(&iter->iter_lock); > > return NULL; > > } > > - id = iter->position; > > } > > > > rcu_read_lock(); > > - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); > > - if (css) { > > - if (css == &root->css || css_tryget(css)) > > - memcg = mem_cgroup_from_css(css); > > - } else > > - id = 0; > > - rcu_read_unlock(); > > + /* > > + * Root is not visited by cgroup iterators so it needs a special > > + * treatment. > > + */ > > + if (!last_visited) { > > + css = &root->css; > > + } else { > > + struct cgroup *next_cgroup; > > + > > + next_cgroup = cgroup_next_descendant_pre( > > + last_visited->css.cgroup, > > + root->css.cgroup); > > Maybe I miss something but.... last_visited is holded by memcg's refcnt. > The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed > before memcg is freed and last_visited->css.cgroup is out of RCU cycle. > Is this safe ? Good spotted. You are right. What I need to do is to check that the last_visited is alive and restart from the root if not. Something like the bellow (incremental patch on top of this one) should help, right? diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 30efd7e..c0a91a3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, spin_unlock(&iter->iter_lock); return NULL; } + /* + * memcg is still valid because we hold a reference but + * its cgroup might have vanished in the meantime so + * we have to double check it is alive and restart the + * tree walk otherwise. + */ + if (last_visited && !css_tryget(&last_visited->css)) { + mem_cgroup_put(last_visited); + last_visited = NULL; + } } rcu_read_lock(); @@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (reclaim) { struct mem_cgroup *curr = memcg; - if (last_visited) + if (last_visited) { + css_put(&last_visited->css); mem_cgroup_put(last_visited); + } if (css && !memcg) curr = mem_cgroup_from_css(css); -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1423165Ab2KNSaO (ORCPT ); Wed, 14 Nov 2012 13:30:14 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:51305 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423016Ab2KNSaN (ORCPT ); Wed, 14 Nov 2012 13:30:13 -0500 Date: Wed, 14 Nov 2012 10:30:07 -0800 From: Tejun Heo To: Michal Hocko Cc: Li Zefan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Subject: Re: [RFC] rework mem_cgroup iterator Message-ID: <20121114183007.GC21185@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A2F9FC.5050303@huawei.com> <20121114083653.GA17111@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121114083653.GA17111@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michal. On Wed, Nov 14, 2012 at 09:36:53AM +0100, Michal Hocko wrote: > > So memcg won't use css id at all, right? > > Unfortunately we still use it for the swap accounting but that one could > be replaced by something else, probably. Have to think about it. I have a patch to add cgrp->id pending. From what I can see, memcg should be able to use that for swap accounting. > > Then we can remove the whole css_id stuff, and that's quite a bunch of > > code. Yeap, that's the plan. > Is memcg the only user of css_id? Quick grep shows that yes but I > haven't checked all the callers of the exported functions. I would be > happy if more code goes away. Yeap, memcg is the only user and I really wanna remove it once memcg moves onto saner stuff. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1423223Ab2KNSlR (ORCPT ); Wed, 14 Nov 2012 13:41:17 -0500 Received: from mail-pb0-f46.google.com ([209.85.160.46]:40837 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423178Ab2KNSlP (ORCPT ); Wed, 14 Nov 2012 13:41:15 -0500 Date: Wed, 14 Nov 2012 10:41:10 -0800 From: Tejun Heo To: Glauber Costa Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han Subject: Re: [RFC] rework mem_cgroup iterator Message-ID: <20121114184110.GD21185@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A3C42F.9020901@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A3C42F.9020901@parallels.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Glauber. On Wed, Nov 14, 2012 at 05:17:51PM +0100, Glauber Costa wrote: > Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c > ? It already exists, provide sane ordering, and only relies on parent > information - which cgroup core already have - to do the walk. Hmmm... we can but I personally much prefer for_each_*() iterators over callback based ones. It's just much easier to share states across an iteration and follow the logic. walk_tg_tree_from() does have the benefit of being able to combine pre and post visits in the same walk, which doesn't seem to have any user at the moment. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1423208Ab2KNSpI (ORCPT ); Wed, 14 Nov 2012 13:45:08 -0500 Received: from mx2.parallels.com ([64.131.90.16]:35106 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423019Ab2KNSpH (ORCPT ); Wed, 14 Nov 2012 13:45:07 -0500 Message-ID: <50A45729.4000203@parallels.com> Date: Thu, 15 Nov 2012 06:44:57 +0400 From: Glauber Costa User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121016 Thunderbird/16.0.1 MIME-Version: 1.0 To: Tejun Heo CC: Michal Hocko , , , KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Peter Zijlstra , Paul Turner Subject: Re: [RFC] rework mem_cgroup iterator References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A3C42F.9020901@parallels.com> <20121114184110.GD21185@mtj.dyndns.org> In-Reply-To: <20121114184110.GD21185@mtj.dyndns.org> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit X-Originating-IP: [46.39.244.6] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/14/2012 10:41 PM, Tejun Heo wrote: > Hello, Glauber. > > On Wed, Nov 14, 2012 at 05:17:51PM +0100, Glauber Costa wrote: >> Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c >> ? It already exists, provide sane ordering, and only relies on parent >> information - which cgroup core already have - to do the walk. > > Hmmm... we can but I personally much prefer for_each_*() iterators > over callback based ones. It's just much easier to share states > across an iteration and follow the logic. walk_tg_tree_from() does > have the benefit of being able to combine pre and post visits in the > same walk, which doesn't seem to have any user at the moment. > > Thanks. > Is there any particular reason why we can't do the other way around then, and use a for_each_*() for sched walks? Without even consider what I personally prefer, what I really don't like is to have two different cgroup walkers when it seems like we could very well have just one. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1423235Ab2KNSqc (ORCPT ); Wed, 14 Nov 2012 13:46:32 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:51914 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423150Ab2KNSqa (ORCPT ); Wed, 14 Nov 2012 13:46:30 -0500 Date: Wed, 14 Nov 2012 10:46:25 -0800 From: Tejun Heo To: Glauber Costa Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Peter Zijlstra , Paul Turner Subject: Re: [RFC] rework mem_cgroup iterator Message-ID: <20121114184625.GE21185@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A3C42F.9020901@parallels.com> <20121114184110.GD21185@mtj.dyndns.org> <50A45729.4000203@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A45729.4000203@parallels.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Thu, Nov 15, 2012 at 06:44:57AM +0400, Glauber Costa wrote: > Is there any particular reason why we can't do the other way around > then, and use a for_each_*() for sched walks? Without even consider what > I personally prefer, what I really don't like is to have two different > cgroup walkers when it seems like we could very well have just one. Ooh, sure thing, let's do that. Will work on that. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1423252Ab2KNSwv (ORCPT ); Wed, 14 Nov 2012 13:52:51 -0500 Received: from mail-pb0-f46.google.com ([209.85.160.46]:64514 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423016Ab2KNSwu (ORCPT ); Wed, 14 Nov 2012 13:52:50 -0500 Date: Wed, 14 Nov 2012 10:52:45 -0800 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121114185245.GF21185@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121114085129.GC17111@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michal. On Wed, Nov 14, 2012 at 09:51:29AM +0100, Michal Hocko wrote: > > reclaim(root); > > for_each_descendent_pre() > > reclaim(descendant); > > We cannot do for_each_descendent_pre here because we do not iterate > through the whole hierarchy all the time. Check shrink_zone. I'm a bit confused. Why would that make any difference? Shouldn't it be just able to test the condition and continue? Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1423549Ab2KOCNS (ORCPT ); Wed, 14 Nov 2012 21:13:18 -0500 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:58095 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755412Ab2KOCNR (ORCPT ); Wed, 14 Nov 2012 21:13:17 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <50A44FA0.5010305@jp.fujitsu.com> Date: Thu, 15 Nov 2012 11:12:48 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: Li Zefan CC: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC] rework mem_cgroup iterator References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <50A2F9FC.5050303@huawei.com> In-Reply-To: <50A2F9FC.5050303@huawei.com> Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/11/14 10:55), Li Zefan wrote: > On 2012/11/13 23:30, Michal Hocko wrote: >> Hi all, >> this patch set tries to make mem_cgroup_iter saner in the way how it >> walks hierarchies. css->id based traversal is far from being ideal as it >> is not deterministic because it depends on the creation ordering. >> >> Diffstat looks promising but it is fair the say that the biggest cleanup is >> just css_get_next removal. The memcg code has grown a bit but I think it is >> worth the resulting outcome (the sanity ;)). >> > > So memcg won't use css id at all, right? Then we can remove the whole css_id > stuff, and that's quite a bunch of code. > It's used by swap information recording for saving spaces. Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2993429Ab2KOENR (ORCPT ); Wed, 14 Nov 2012 23:13:17 -0500 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:56036 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2993396Ab2KOENL (ORCPT ); Wed, 14 Nov 2012 23:13:11 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <50A46BB6.6070902@jp.fujitsu.com> Date: Thu, 15 Nov 2012 13:12:38 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: Michal Hocko CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <50A2E3B3.6080007@jp.fujitsu.com> <20121114101052.GD17111@dhcp22.suse.cz> In-Reply-To: <20121114101052.GD17111@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/11/14 19:10), Michal Hocko wrote: > On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote: >> (2012/11/14 0:30), Michal Hocko wrote: > [...] >>> @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, >>> mz = mem_cgroup_zoneinfo(root, nid, zid); >>> iter = &mz->reclaim_iter[reclaim->priority]; >>> spin_lock(&iter->iter_lock); >>> + last_visited = iter->last_visited; >>> if (prev && reclaim->generation != iter->generation) { >>> + if (last_visited) { >>> + mem_cgroup_put(last_visited); >>> + iter->last_visited = NULL; >>> + } >>> spin_unlock(&iter->iter_lock); >>> return NULL; >>> } >>> - id = iter->position; >>> } >>> >>> rcu_read_lock(); >>> - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); >>> - if (css) { >>> - if (css == &root->css || css_tryget(css)) >>> - memcg = mem_cgroup_from_css(css); >>> - } else >>> - id = 0; >>> - rcu_read_unlock(); >>> + /* >>> + * Root is not visited by cgroup iterators so it needs a special >>> + * treatment. >>> + */ >>> + if (!last_visited) { >>> + css = &root->css; >>> + } else { >>> + struct cgroup *next_cgroup; >>> + >>> + next_cgroup = cgroup_next_descendant_pre( >>> + last_visited->css.cgroup, >>> + root->css.cgroup); >> >> Maybe I miss something but.... last_visited is holded by memcg's refcnt. >> The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed >> before memcg is freed and last_visited->css.cgroup is out of RCU cycle. >> Is this safe ? > > Good spotted. You are right. What I need to do is to check that the > last_visited is alive and restart from the root if not. Something like > the bellow (incremental patch on top of this one) should help, right? > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 30efd7e..c0a91a3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > spin_unlock(&iter->iter_lock); > return NULL; > } > + /* > + * memcg is still valid because we hold a reference but > + * its cgroup might have vanished in the meantime so > + * we have to double check it is alive and restart the > + * tree walk otherwise. > + */ > + if (last_visited && !css_tryget(&last_visited->css)) { > + mem_cgroup_put(last_visited); > + last_visited = NULL; > + } > } > > rcu_read_lock(); > @@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (reclaim) { > struct mem_cgroup *curr = memcg; > > - if (last_visited) > + if (last_visited) { > + css_put(&last_visited->css); > mem_cgroup_put(last_visited); > + } > > if (css && !memcg) > curr = mem_cgroup_from_css(css); > I think this will work. Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756785Ab2KOJvK (ORCPT ); Thu, 15 Nov 2012 04:51:10 -0500 Received: from cantor2.suse.de ([195.135.220.15]:59896 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755533Ab2KOJvH (ORCPT ); Thu, 15 Nov 2012 04:51:07 -0500 Date: Thu, 15 Nov 2012 10:51:03 +0100 From: Michal Hocko To: Tejun Heo Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115095103.GB11990@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> <20121114185245.GF21185@mtj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121114185245.GF21185@mtj.dyndns.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 14-11-12 10:52:45, Tejun Heo wrote: > Hello, Michal. > > On Wed, Nov 14, 2012 at 09:51:29AM +0100, Michal Hocko wrote: > > > reclaim(root); > > > for_each_descendent_pre() > > > reclaim(descendant); > > > > We cannot do for_each_descendent_pre here because we do not iterate > > through the whole hierarchy all the time. Check shrink_zone. > > I'm a bit confused. Why would that make any difference? Shouldn't it > be just able to test the condition and continue? Ohh, I misunderstood your proposal. So what you are suggesting is to put all the logic we have in mem_cgroup_iter inside what you call reclaim here + mem_cgroup_iter_break inside the loop, right? I do not see how this would help us much. mem_cgroup_iter is not the nicest piece of code but it handles quite a complex requirements that we have currently (css reference count, multiple reclaimers racing). So I would rather keep it this way. Further simplifications are welcome of course. Is there any reason why you are not happy about direct using of cgroup_next_descendant_pre? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756795Ab2KOJwj (ORCPT ); Thu, 15 Nov 2012 04:52:39 -0500 Received: from cantor2.suse.de ([195.135.220.15]:59946 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756492Ab2KOJwh (ORCPT ); Thu, 15 Nov 2012 04:52:37 -0500 Date: Thu, 15 Nov 2012 10:52:35 +0100 From: Michal Hocko To: Kamezawa Hiroyuki Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115095235.GC11990@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <50A2E3B3.6080007@jp.fujitsu.com> <20121114101052.GD17111@dhcp22.suse.cz> <50A46BB6.6070902@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50A46BB6.6070902@jp.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 15-11-12 13:12:38, KAMEZAWA Hiroyuki wrote: > (2012/11/14 19:10), Michal Hocko wrote: > >On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote: > >>(2012/11/14 0:30), Michal Hocko wrote: > >[...] > >>>@@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > >>> mz = mem_cgroup_zoneinfo(root, nid, zid); > >>> iter = &mz->reclaim_iter[reclaim->priority]; > >>> spin_lock(&iter->iter_lock); > >>>+ last_visited = iter->last_visited; > >>> if (prev && reclaim->generation != iter->generation) { > >>>+ if (last_visited) { > >>>+ mem_cgroup_put(last_visited); > >>>+ iter->last_visited = NULL; > >>>+ } > >>> spin_unlock(&iter->iter_lock); > >>> return NULL; > >>> } > >>>- id = iter->position; > >>> } > >>> > >>> rcu_read_lock(); > >>>- css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); > >>>- if (css) { > >>>- if (css == &root->css || css_tryget(css)) > >>>- memcg = mem_cgroup_from_css(css); > >>>- } else > >>>- id = 0; > >>>- rcu_read_unlock(); > >>>+ /* > >>>+ * Root is not visited by cgroup iterators so it needs a special > >>>+ * treatment. > >>>+ */ > >>>+ if (!last_visited) { > >>>+ css = &root->css; > >>>+ } else { > >>>+ struct cgroup *next_cgroup; > >>>+ > >>>+ next_cgroup = cgroup_next_descendant_pre( > >>>+ last_visited->css.cgroup, > >>>+ root->css.cgroup); > >> > >>Maybe I miss something but.... last_visited is holded by memcg's refcnt. > >>The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed > >>before memcg is freed and last_visited->css.cgroup is out of RCU cycle. > >>Is this safe ? > > > >Good spotted. You are right. What I need to do is to check that the > >last_visited is alive and restart from the root if not. Something like > >the bellow (incremental patch on top of this one) should help, right? > > > >diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >index 30efd7e..c0a91a3 100644 > >--- a/mm/memcontrol.c > >+++ b/mm/memcontrol.c > >@@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > spin_unlock(&iter->iter_lock); > > return NULL; > > } > >+ /* > >+ * memcg is still valid because we hold a reference but > >+ * its cgroup might have vanished in the meantime so > >+ * we have to double check it is alive and restart the > >+ * tree walk otherwise. > >+ */ > >+ if (last_visited && !css_tryget(&last_visited->css)) { > >+ mem_cgroup_put(last_visited); > >+ last_visited = NULL; > >+ } > > } > > > > rcu_read_lock(); > >@@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > if (reclaim) { > > struct mem_cgroup *curr = memcg; > > > >- if (last_visited) > >+ if (last_visited) { > >+ css_put(&last_visited->css); > > mem_cgroup_put(last_visited); > >+ } > > > > if (css && !memcg) > > curr = mem_cgroup_from_css(css); > > > > I think this will work. Thanks for double checking. The updated patch: --- >>From 3811de23a0306c229ce44dc655878431a4fdf449 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 13 Nov 2012 11:40:35 +0100 Subject: [PATCH] memcg: rework mem_cgroup_iter to use cgroup iterators mem_cgroup_iter curently relies on css->id when walking down a group hierarchy tree. This is really awkward because the tree walk depends on the groups creation ordering. The only guarantee is that a parent node is visited before its children. Example 1) mkdir -p a a/d a/b/c 2) mkdir -a a/b/c a/d Will create the same trees but the tree walks will be different: 1) a, d, b, c 2) a, b, c, d 574bd9f7 (cgroup: implement generic child / descendant walk macros) has introduced generic cgroup tree walkers which provide either pre-order or post-order tree walk. This patch converts css->id based iteration to pre-order tree walk to keep the semantic with the original iterator where parent is always visited before its subtree. cgroup_for_each_descendant_pre suggests using post_create and pre_destroy for proper synchronization with groups addidition resp. removal. This implementation doesn't use those because a new memory cgroup is fully initialized in mem_cgroup_create and css_tryget makes sure that the group is alive when we encounter it by iterator. If the reclaim cookie is used we need to store the last visited group into the iterator so we have to be careful that it doesn't disappear in the mean time. Elevated reference count on the memcg keeps it alive even though the group have vanished in the meantime. This is checked by css_tryget and if the group is gone the iteration is restarted from the beginning. V2 - iter->last_visited need to check whether the cgroup is still alive. Thanks to Kamezawa for pointing this out. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 74 ++++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 60 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0fe5177..c0a91a3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -142,8 +142,8 @@ struct mem_cgroup_stat_cpu { }; struct mem_cgroup_reclaim_iter { - /* css_id of the last scanned hierarchy member */ - int position; + /* last scanned hierarchy member with elevated ref count */ + struct mem_cgroup *last_visited; /* scan generation, increased every round-trip */ unsigned int generation; /* lock to protect the position and generation */ @@ -1064,7 +1064,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup_reclaim_cookie *reclaim) { struct mem_cgroup *memcg = NULL; - int id = 0; + struct mem_cgroup *last_visited = NULL; if (mem_cgroup_disabled()) return NULL; @@ -1073,7 +1073,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, root = root_mem_cgroup; if (prev && !reclaim) - id = css_id(&prev->css); + last_visited = prev; if (prev && prev != root) css_put(&prev->css); @@ -1086,7 +1086,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, while (!memcg) { struct mem_cgroup_reclaim_iter *uninitialized_var(iter); - struct cgroup_subsys_state *css; + struct cgroup_subsys_state *css = NULL; if (reclaim) { int nid = zone_to_nid(reclaim->zone); @@ -1096,30 +1096,76 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, mz = mem_cgroup_zoneinfo(root, nid, zid); iter = &mz->reclaim_iter[reclaim->priority]; spin_lock(&iter->iter_lock); + last_visited = iter->last_visited; if (prev && reclaim->generation != iter->generation) { + if (last_visited) { + mem_cgroup_put(last_visited); + iter->last_visited = NULL; + } spin_unlock(&iter->iter_lock); return NULL; } - id = iter->position; + /* + * memcg is still valid because we hold a reference but + * its cgroup might have vanished in the meantime so + * we have to double check it is alive and restart the + * tree walk otherwise. + */ + if (last_visited && !css_tryget(&last_visited->css)) { + mem_cgroup_put(last_visited); + last_visited = NULL; + } } rcu_read_lock(); - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); - if (css) { - if (css == &root->css || css_tryget(css)) - memcg = mem_cgroup_from_css(css); - } else - id = 0; - rcu_read_unlock(); + /* + * Root is not visited by cgroup iterators so it needs a special + * treatment. + */ + if (!last_visited) { + css = &root->css; + } else { + struct cgroup *next_cgroup; + + next_cgroup = cgroup_next_descendant_pre( + last_visited->css.cgroup, + root->css.cgroup); + if (next_cgroup) + css = cgroup_subsys_state(next_cgroup, + mem_cgroup_subsys_id); + } + + /* + * Even if we find a group we have to make sure it is alive. + * css && !memcg means that the groups should be skipped and + * we should continue the tree walk. + */ + if (css == &root->css || (css && css_tryget(css))) + memcg = mem_cgroup_from_css(css); if (reclaim) { - iter->position = id; + struct mem_cgroup *curr = memcg; + + if (last_visited) { + css_put(&last_visited->css); + mem_cgroup_put(last_visited); + } + + if (css && !memcg) + curr = mem_cgroup_from_css(css); + if (curr) + mem_cgroup_get(curr); + iter->last_visited = curr; + if (!css) iter->generation++; else if (!prev && memcg) reclaim->generation = iter->generation; spin_unlock(&iter->iter_lock); + } else if (css && !memcg) { + last_visited = mem_cgroup_from_css(css); } + rcu_read_unlock(); if (prev && !css) return NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1768085Ab2KOOrj (ORCPT ); Thu, 15 Nov 2012 09:47:39 -0500 Received: from mail-pb0-f46.google.com ([209.85.160.46]:44936 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2992479Ab2KOOrh (ORCPT ); Thu, 15 Nov 2012 09:47:37 -0500 Date: Thu, 15 Nov 2012 06:47:32 -0800 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115144732.GB7306@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> <20121114185245.GF21185@mtj.dyndns.org> <20121115095103.GB11990@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121115095103.GB11990@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michal. On Thu, Nov 15, 2012 at 10:51:03AM +0100, Michal Hocko wrote: > > I'm a bit confused. Why would that make any difference? Shouldn't it > > be just able to test the condition and continue? > > Ohh, I misunderstood your proposal. So what you are suggesting is > to put all the logic we have in mem_cgroup_iter inside what you call > reclaim here + mem_cgroup_iter_break inside the loop, right? > > I do not see how this would help us much. mem_cgroup_iter is not the > nicest piece of code but it handles quite a complex requirements that we > have currently (css reference count, multiple reclaimers racing). So I > would rather keep it this way. Further simplifications are welcome of > course. > > Is there any reason why you are not happy about direct using of > cgroup_next_descendant_pre? Because I'd like to consider the next functions as implementation detail, and having interations structred as loops tend to read better and less error-prone. e.g. when you use next functions directly, it's way easier to circumvent locking requirements in a way which isn't very obvious. So, unless it messes up the code too much (and I can't see why it would), I'd much prefer if memcg used for_each_*() macros. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1768249Ab2KOPM7 (ORCPT ); Thu, 15 Nov 2012 10:12:59 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44962 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1768148Ab2KOPM6 (ORCPT ); Thu, 15 Nov 2012 10:12:58 -0500 Date: Thu, 15 Nov 2012 16:12:55 +0100 From: Michal Hocko To: Tejun Heo Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115151255.GE11990@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> <20121114185245.GF21185@mtj.dyndns.org> <20121115095103.GB11990@dhcp22.suse.cz> <20121115144732.GB7306@mtj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121115144732.GB7306@mtj.dyndns.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 15-11-12 06:47:32, Tejun Heo wrote: > Hello, Michal. > > On Thu, Nov 15, 2012 at 10:51:03AM +0100, Michal Hocko wrote: > > > I'm a bit confused. Why would that make any difference? Shouldn't it > > > be just able to test the condition and continue? > > > > Ohh, I misunderstood your proposal. So what you are suggesting is > > to put all the logic we have in mem_cgroup_iter inside what you call > > reclaim here + mem_cgroup_iter_break inside the loop, right? > > > > I do not see how this would help us much. mem_cgroup_iter is not the > > nicest piece of code but it handles quite a complex requirements that we > > have currently (css reference count, multiple reclaimers racing). So I > > would rather keep it this way. Further simplifications are welcome of > > course. > > > > Is there any reason why you are not happy about direct using of > > cgroup_next_descendant_pre? > > Because I'd like to consider the next functions as implementation > detail, and having interations structred as loops tend to read better > and less error-prone. e.g. when you use next functions directly, it's > way easier to circumvent locking requirements in a way which isn't > very obvious. The whole point behind mem_cgroup_iter is to hide all the complexity behind memcg iteration. Memcg code either use for_each_mem_cgroup_tree for !reclaim case and mem_cgroup_iter otherwise. > So, unless it messes up the code too much (and I can't see why it > would), I'd much prefer if memcg used for_each_*() macros. As I said this would mean that the current mem_cgroup_iter code would have to be inverted which doesn't simplify the code much. I'd rather hide all the grossy details inside the memcg iterator. Or am I still missing your suggestion? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1768033Ab2KOPba (ORCPT ); Thu, 15 Nov 2012 10:31:30 -0500 Received: from mail-pb0-f46.google.com ([209.85.160.46]:53808 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755745Ab2KOPb3 (ORCPT ); Thu, 15 Nov 2012 10:31:29 -0500 Date: Thu, 15 Nov 2012 07:31:24 -0800 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115153124.GD7306@mtj.dyndns.org> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> <20121114185245.GF21185@mtj.dyndns.org> <20121115095103.GB11990@dhcp22.suse.cz> <20121115144732.GB7306@mtj.dyndns.org> <20121115151255.GE11990@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121115151255.GE11990@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michal. On Thu, Nov 15, 2012 at 04:12:55PM +0100, Michal Hocko wrote: > > Because I'd like to consider the next functions as implementation > > detail, and having interations structred as loops tend to read better > > and less error-prone. e.g. when you use next functions directly, it's > > way easier to circumvent locking requirements in a way which isn't > > very obvious. > > The whole point behind mem_cgroup_iter is to hide all the complexity > behind memcg iteration. Memcg code either use for_each_mem_cgroup_tree > for !reclaim case and mem_cgroup_iter otherwise. > > > So, unless it messes up the code too much (and I can't see why it > > would), I'd much prefer if memcg used for_each_*() macros. > > As I said this would mean that the current mem_cgroup_iter code would > have to be inverted which doesn't simplify the code much. I'd rather > hide all the grossy details inside the memcg iterator. > Or am I still missing your suggestion? One way or the other, I don't think the code complexity would change much. Again, I'd much *prefer* if memcg used what other controllers would be using, but that's a preference and if necessary we can keep the next functions as exposed APIs. I think the issue I have is that I can't see much technical justification for that. If the code becomes much simpler by choosing one over the other, sure, but is that the case here? Isn't it mostly just about where to put the same things? If so, what would be the rationale for requiring a different interface? Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2993487Ab2KOQPJ (ORCPT ); Thu, 15 Nov 2012 11:15:09 -0500 Received: from cantor2.suse.de ([195.135.220.15]:48496 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2993458Ab2KOQPI (ORCPT ); Thu, 15 Nov 2012 11:15:08 -0500 Date: Thu, 15 Nov 2012 17:15:04 +0100 From: Michal Hocko To: Tejun Heo Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121115161504.GF11990@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <20121113161442.GA18227@mtj.dyndns.org> <20121114085129.GC17111@dhcp22.suse.cz> <20121114185245.GF21185@mtj.dyndns.org> <20121115095103.GB11990@dhcp22.suse.cz> <20121115144732.GB7306@mtj.dyndns.org> <20121115151255.GE11990@dhcp22.suse.cz> <20121115153124.GD7306@mtj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121115153124.GD7306@mtj.dyndns.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 15-11-12 07:31:24, Tejun Heo wrote: > Hello, Michal. > > On Thu, Nov 15, 2012 at 04:12:55PM +0100, Michal Hocko wrote: > > > Because I'd like to consider the next functions as implementation > > > detail, and having interations structred as loops tend to read better > > > and less error-prone. e.g. when you use next functions directly, it's > > > way easier to circumvent locking requirements in a way which isn't > > > very obvious. > > > > The whole point behind mem_cgroup_iter is to hide all the complexity > > behind memcg iteration. Memcg code either use for_each_mem_cgroup_tree > > for !reclaim case and mem_cgroup_iter otherwise. > > > > > So, unless it messes up the code too much (and I can't see why it > > > would), I'd much prefer if memcg used for_each_*() macros. > > > > As I said this would mean that the current mem_cgroup_iter code would > > have to be inverted which doesn't simplify the code much. I'd rather > > hide all the grossy details inside the memcg iterator. > > Or am I still missing your suggestion? > > One way or the other, I don't think the code complexity would change > much. Again, I'd much *prefer* if memcg used what other controllers > would be using, but that's a preference and if necessary we can keep > the next functions as exposed APIs. Yes please. > I think the issue I have is that I can't see much technical > justification for that. If the code becomes much simpler by choosing > one over the other, sure, but is that the case here? Yes and I've tried to say that already. Memcg needs hierarchy, css ref counting and concurrent reclaim (per-zone per-priority) aware iteration. All of that is hidden in mem_cgroup_iter currently so the caller doesn't have to care about it at all. Which makes shrink_zone not care about memcg that much. cgroup_for_each_descendant_pre is not suitable at least because it doesn't provide a way to start a walk at a selected node (which is shared per-zone per-priority in memcg case). Even if cgroup_for_each_descendant_pre had start parameter there is still a lot of house keeping that callers would have to handle (css_tryget to start with, update of the cached possible not mentioning use_hierarchy thingy or mem_cgroup_disabled). We also try to not pollute mm/vmscan.c as much as possible so we definitely do not want to bring all this into shrink_zone. This all sounds like too much of a hassle if it is exposed so I would really like to stay with mem_cgroup_iter and slowly simplify it until it can go away (if that is possible at all). > Isn't it mostly just about where to put the same things? Unfortunately no. We wouldn't grow own iterator in such a case. > If so, what would be the rationale for requiring a different > interface? Does the above explain it? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752621Ab2KSOFI (ORCPT ); Mon, 19 Nov 2012 09:05:08 -0500 Received: from cantor2.suse.de ([195.135.220.15]:49951 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752246Ab2KSOFH (ORCPT ); Mon, 19 Nov 2012 09:05:07 -0500 Date: Mon, 19 Nov 2012 15:05:02 +0100 From: Michal Hocko To: Kamezawa Hiroyuki Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121119140502.GA16803@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> <50A2E3B3.6080007@jp.fujitsu.com> <20121114101052.GD17111@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121114101052.GD17111@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 14-11-12 11:10:52, Michal Hocko wrote: > On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote: > > (2012/11/14 0:30), Michal Hocko wrote: > [...] > > > @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > > mz = mem_cgroup_zoneinfo(root, nid, zid); > > > iter = &mz->reclaim_iter[reclaim->priority]; > > > spin_lock(&iter->iter_lock); > > > + last_visited = iter->last_visited; > > > if (prev && reclaim->generation != iter->generation) { > > > + if (last_visited) { > > > + mem_cgroup_put(last_visited); > > > + iter->last_visited = NULL; > > > + } > > > spin_unlock(&iter->iter_lock); > > > return NULL; > > > } > > > - id = iter->position; > > > } > > > > > > rcu_read_lock(); > > > - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); > > > - if (css) { > > > - if (css == &root->css || css_tryget(css)) > > > - memcg = mem_cgroup_from_css(css); > > > - } else > > > - id = 0; > > > - rcu_read_unlock(); > > > + /* > > > + * Root is not visited by cgroup iterators so it needs a special > > > + * treatment. > > > + */ > > > + if (!last_visited) { > > > + css = &root->css; > > > + } else { > > > + struct cgroup *next_cgroup; > > > + > > > + next_cgroup = cgroup_next_descendant_pre( > > > + last_visited->css.cgroup, > > > + root->css.cgroup); > > > > Maybe I miss something but.... last_visited is holded by memcg's refcnt. > > The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed > > before memcg is freed and last_visited->css.cgroup is out of RCU cycle. > > Is this safe ? > > Good spotted. You are right. What I need to do is to check that the > last_visited is alive and restart from the root if not. Something like > the bellow (incremental patch on top of this one) should help, right? > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 30efd7e..c0a91a3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > spin_unlock(&iter->iter_lock); > return NULL; > } > + /* > + * memcg is still valid because we hold a reference but > + * its cgroup might have vanished in the meantime so > + * we have to double check it is alive and restart the > + * tree walk otherwise. > + */ > + if (last_visited && !css_tryget(&last_visited->css)) { > + mem_cgroup_put(last_visited); > + last_visited = NULL; > + } > } > > rcu_read_lock(); > @@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > if (reclaim) { > struct mem_cgroup *curr = memcg; > > - if (last_visited) > + if (last_visited) { > + css_put(&last_visited->css); > mem_cgroup_put(last_visited); > + } > > if (css && !memcg) > curr = mem_cgroup_from_css(css); Now that I think about it again it seems that this is more complicated than necessary. It should be sufficient to hold css' reference for the iter->last_visited because this makes sure that the cgroup won't go away same as mem_cgroup. Memcg reference counting + css_tryget just makes the situation more complicated because it forces us to retry the iteration on css_tryget failure as the cgroup is gone already and we have no point to continue other than start all over again. Which is, ehmm, _really_ ugly. I will repost the updated version sometime this week after it passes some testing. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752943Ab2KSPLg (ORCPT ); Mon, 19 Nov 2012 10:11:36 -0500 Received: from cantor2.suse.de ([195.135.220.15]:53482 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752751Ab2KSPLf (ORCPT ); Mon, 19 Nov 2012 10:11:35 -0500 Date: Mon, 19 Nov 2012 16:11:30 +0100 From: Michal Hocko To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki , Johannes Weiner , Ying Han , Tejun Heo , Glauber Costa Subject: Re: [RFC 2/5] memcg: rework mem_cgroup_iter to use cgroup iterators Message-ID: <20121119151130.GB16803@dhcp22.suse.cz> References: <1352820639-13521-1-git-send-email-mhocko@suse.cz> <1352820639-13521-3-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1352820639-13521-3-git-send-email-mhocko@suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 13-11-12 16:30:36, Michal Hocko wrote: [...] > + /* > + * Root is not visited by cgroup iterators so it needs a special > + * treatment. > + */ > + if (!last_visited) { > + css = &root->css; > + } else { > + struct cgroup *next_cgroup; > + > + next_cgroup = cgroup_next_descendant_pre( > + last_visited->css.cgroup, > + root->css.cgroup); > + if (next_cgroup) > + css = cgroup_subsys_state(next_cgroup, > + mem_cgroup_subsys_id); This is not correct because cgroup_next_descendant_pre expects pos to be NULL for the first iteration but the way we do iterate (visit the root first) means that the second iteration will have last_visited != NULL and if root doesn't have any children the iteration would go unleashed to to the endless loop. We need something like: struct cgroup *prev_cgroup = (last_visited == root) ? NULL : last_visited->css.cgroup; next_cgroup = cgroup_next_descendant_pre(prev_cgroup, root->css.gtoup); -- Michal Hocko SUSE Labs