From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: [PATCH 0/12][V3] memcg: make memcg's life cycle the same as cgroup Date: Fri, 17 May 2013 15:02:16 +0800 Message-ID: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Hi Andrew, All the patches in this patchset has been acked by Michal and Kamezawa-san, and it's ready to be merged into -mm. Changes since v2: - rebased to 3.10-rc1 - collected some acks - the two memcg bug fixes has been merged into mainline - the cgroup core patch has been merged into mainline Changes since v1: - wrote better changelog and added acked-by and reviewed-by tags - revised some comments as suggested by Michal - added a wmb() in kmem_cgroup_css_offline(), pointed out by Michal - fixed a bug which causes a css_put() never be called Now memcg has its own refcnt, so when a cgroup is destroyed, the memcg can still be alive. This patchset converts memcg to always use css_get/put, so memcg will have the same life cycle as its corresponding cgroup. The historical reason that memcg didn't use css_get in some cases, is that cgroup couldn't be removed if there're still css refs. The situation has changed so that rmdir a cgroup will succeed regardless css refs, but won't be freed until css refs goes down to 0. Since the introduction of kmemcg, the memcg refcnt handling grows even more complicated. This patchset greately simplifies memcg's life cycle management. Also, after those changes, we can convert memcg to use cgroup->id, and then we can kill css_id. The first 2 patches are bug fixes that can go into 3.10, and the rest are for 3.10. Li Zefan (7): memcg: use css_get() in sock_update_memcg() memcg: don't use mem_cgroup_get() when creating a kmemcg cache memcg: use css_get/put when charging/uncharging kmem memcg: use css_get/put for swap memcg memcg: don't need to get a reference to the parent memcg: kill memcg refcnt memcg: don't need to free memcg via RCU or workqueue Michal Hocko (2): Revert "memcg: avoid dangling reference count in creation failure." memcg, kmem: fix reference count handling on the error path --- mm/memcontrol.c | 204 ++++++++++++++++++++++-------------------------------------- 1 file changed, 73 insertions(+), 131 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: [PATCH 3/9] memcg: use css_get() in sock_update_memcg() Date: Fri, 17 May 2013 15:03:38 +0800 Message-ID: <5195D64A.9090000@huawei.com> References: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5195D5F8.7000609-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Use css_get/css_put instead of mem_cgroup_get/put. Note, if at the same time someone is moving @current to a different cgroup and removing the old cgroup, css_tryget() may return false, and sock->sk_cgrp won't be initialized, which is fine. Signed-off-by: Li Zefan Acked-by: KAMEZAWA Hiroyuki Acked-by: Michal Hocko --- mm/memcontrol.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4d0458d..f1320d3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -561,15 +561,15 @@ void sock_update_memcg(struct sock *sk) */ if (sk->sk_cgrp) { BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg)); - mem_cgroup_get(sk->sk_cgrp->memcg); + css_get(&sk->sk_cgrp->memcg->css); return; } rcu_read_lock(); memcg = mem_cgroup_from_task(current); cg_proto = sk->sk_prot->proto_cgroup(memcg); - if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) { - mem_cgroup_get(memcg); + if (!mem_cgroup_is_root(memcg) && + memcg_proto_active(cg_proto) && css_tryget(&memcg->css)) { sk->sk_cgrp = cg_proto; } rcu_read_unlock(); @@ -583,7 +583,7 @@ void sock_release_memcg(struct sock *sk) struct mem_cgroup *memcg; WARN_ON(!sk->sk_cgrp->memcg); memcg = sk->sk_cgrp->memcg; - mem_cgroup_put(memcg); + css_put(&sk->sk_cgrp->memcg->css); } } -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: [PATCH 1/9] Revert "memcg: avoid dangling reference count in creation failure." Date: Fri, 17 May 2013 15:03:02 +0800 Message-ID: <5195D626.3010406@huawei.com> References: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5195D5F8.7000609@huawei.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org This reverts commit e4715f01be697a3730c78f8ffffb595591d6a88c mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops an additional reference from all parents so the additional mem_cgrroup_put(parent) potentially causes use-after-free. Cc: # 3.9.x Signed-off-by: Michal Hocko Signed-off-by: Li Zefan Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cb1c9de..5918e90 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6297,8 +6297,6 @@ mem_cgroup_css_online(struct cgroup *cont) * call __mem_cgroup_free, so return directly */ mem_cgroup_put(memcg); - if (parent->use_hierarchy) - mem_cgroup_put(parent); } return error; } -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: [PATCH 2/9] memcg, kmem: fix reference count handling on the error path Date: Fri, 17 May 2013 15:03:23 +0800 Message-ID: <5195D63B.4010202@huawei.com> References: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5195D5F8.7000609@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem fails. This is not correct because only memcg_propagate_kmem takes an additional reference while mem_cgroup_sockets_init is allowed to fail as well (although no current implementation fails) but it doesn't take any reference. This all suggests that it should be memcg_propagate_kmem that should clean up after itself so this patch moves mem_cgroup_put over there. Unfortunately this is not that easy (as pointed out by Li Zefan) because memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it is marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if memcg_propagate_kmem fails so the additional reference is dropped in that case in kmem_cgroup_destroy which means that the reference would be dropped two times. The easiest way then would be to simply remove mem_cgrroup_put from mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right thing. Cc: # 3.8+ Signed-off-by: Michal Hocko Signed-off-by: Li Zefan Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5918e90..4d0458d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6290,14 +6290,6 @@ mem_cgroup_css_online(struct cgroup *cont) error = memcg_init_kmem(memcg, &mem_cgroup_subsys); mutex_unlock(&memcg_create_mutex); - if (error) { - /* - * We call put now because our (and parent's) refcnts - * are already in place. mem_cgroup_put() will internally - * call __mem_cgroup_free, so return directly - */ - mem_cgroup_put(memcg); - } return error; } -- 1.8.0.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: [PATCH 4/9] memcg: don't use mem_cgroup_get() when creating a kmemcg cache Date: Fri, 17 May 2013 15:03:52 +0800 Message-ID: <5195D658.5000301@huawei.com> References: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5195D5F8.7000609@huawei.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put(). There are two things being done in the current code: First, we acquired a css_ref to make sure that the underlying cgroup would not go away. That is a short lived reference, and it is put as soon as the cache is created. At this point, we acquire a long-lived per-cache memcg reference count to guarantee that the memcg will still be alive. so it is: enqueue: css_get create : memcg_get, css_put destroy: memcg_put So we only need to get rid of the memcg_get, change the memcg_put to css_put, and get rid of the now extra css_put. (This changelog is mostly written by Glauber) Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f1320d3..63526f9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3227,7 +3227,7 @@ void memcg_release_cache(struct kmem_cache *s) list_del(&s->memcg_params->list); mutex_unlock(&memcg->slab_caches_mutex); - mem_cgroup_put(memcg); + css_put(&memcg->css); out: kfree(s->memcg_params); } @@ -3387,16 +3387,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg, mutex_lock(&memcg_cache_mutex); new_cachep = cachep->memcg_params->memcg_caches[idx]; - if (new_cachep) + if (new_cachep) { + css_put(&memcg->css); goto out; + } new_cachep = kmem_cache_dup(memcg, cachep); if (new_cachep == NULL) { new_cachep = cachep; + css_put(&memcg->css); goto out; } - mem_cgroup_get(memcg); atomic_set(&new_cachep->memcg_params->nr_pages , 0); cachep->memcg_params->memcg_caches[idx] = new_cachep; @@ -3484,8 +3486,6 @@ static void memcg_create_cache_work_func(struct work_struct *w) cw = container_of(w, struct create_work, work); memcg_create_kmem_cache(cw->memcg, cw->cachep); - /* Drop the reference gotten when we enqueued. */ - css_put(&cw->memcg->css); kfree(cw); } -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Date: Fri, 17 May 2013 15:04:06 +0800 Message-ID: <5195D666.6030408@huawei.com> References: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5195D5F8.7000609@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Use css_get/put instead of mem_cgroup_get/put. We can't do a simple replacement, because here mem_cgroup_put() is called during mem_cgroup_css_free(), while mem_cgroup_css_free() won't be called until css refcnt goes down to 0. Instead we increment css refcnt in mem_cgroup_css_offline(), and then check if there's still kmem charges. If not, css refcnt will be decremented immediately, otherwise the refcnt won't be decremented when kmem charges goes down to 0. v2: - added wmb() in kmem_cgroup_css_offline(), pointed out by Michal - revised comments as suggested by Michal - fixed to check if kmem is activated in kmem_cgroup_css_offline() Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 41 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 63526f9..2f0d117 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3033,8 +3033,16 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size) if (res_counter_uncharge(&memcg->kmem, size)) return; + /* + * Releases a reference taken in kmem_cgroup_css_offline in case + * this last uncharge is racing with the offlining code or it is + * outliving the memcg existence. + * + * The memory barrier imposed by test&clear is paired with the + * explicit one in kmem_cgroup_css_offline. + */ if (memcg_kmem_test_and_clear_dead(memcg)) - mem_cgroup_put(memcg); + css_put(&memcg->css); } void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep) @@ -5130,14 +5138,6 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val) * starts accounting before all call sites are patched */ memcg_kmem_set_active(memcg); - - /* - * kmem charges can outlive the cgroup. In the case of slab - * pages, for instance, a page contain objects from various - * processes, so it is unfeasible to migrate them away. We - * need to reference count the memcg because of that. - */ - mem_cgroup_get(memcg); } else ret = res_counter_set_limit(&memcg->kmem, val); out: @@ -5170,12 +5170,10 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg) goto out; /* - * destroy(), called if we fail, will issue static_key_slow_inc() and - * mem_cgroup_put() if kmem is enabled. We have to either call them - * unconditionally, or clear the KMEM_ACTIVE flag. I personally find - * this more consistent, since it always leads to the same destroy path + * __mem_cgroup_free() will issue static_key_slow_dec() because this + * memcg is active already. If the later initialization fails then the + * cgroup core triggers the cleanup so we do not have to do it here. */ - mem_cgroup_get(memcg); static_key_slow_inc(&memcg_kmem_enabled_key); mutex_lock(&set_limit_mutex); @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) return mem_cgroup_sockets_init(memcg, ss); } -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) { - mem_cgroup_sockets_destroy(memcg); + if (!memcg_kmem_is_active(memcg)) + return; + /* + * kmem charges can outlive the cgroup. In the case of slab + * pages, for instance, a page contain objects from various + * processes. As we prevent from taking a reference for every + * such allocation we have to be careful when doing uncharge + * (see memcg_uncharge_kmem) and here during offlining. + * + * The idea is that that only the _last_ uncharge which sees + * the dead memcg will drop the last reference. An additional + * reference is taken here before the group is marked dead + * which is then paired with css_put during uncharge resp. here. + * + * Although this might sound strange as this path is called when + * the reference has already dropped down to 0 and shouldn't be + * incremented anymore (css_tryget would fail) we do not have + * other options because of the kmem allocations lifetime. + */ + css_get(&memcg->css); + + /* see comment in memcg_uncharge_kmem() */ + wmb(); memcg_kmem_mark_dead(memcg); if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0) return; - /* - * Charges already down to 0, undo mem_cgroup_get() done in the charge - * path here, being careful not to race with memcg_uncharge_kmem: it is - * possible that the charges went down to 0 between mark_dead and the - * res_counter read, so in that case, we don't need the put - */ if (memcg_kmem_test_and_clear_dead(memcg)) - mem_cgroup_put(memcg); + css_put(&memcg->css); } #else static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) @@ -5882,7 +5896,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) return 0; } -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) { } #endif @@ -6315,6 +6329,8 @@ static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + kmem_cgroup_css_offline(memcg); + mem_cgroup_invalidate_reclaim_iterators(memcg); mem_cgroup_reparent_charges(memcg); mem_cgroup_destroy_all_caches(memcg); @@ -6324,7 +6340,7 @@ static void mem_cgroup_css_free(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - kmem_cgroup_destroy(memcg); + mem_cgroup_sockets_destroy(memcg); mem_cgroup_put(memcg); } -- 1.8.0.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: [PATCH 6/9] memcg: use css_get/put for swap memcg Date: Fri, 17 May 2013 15:04:25 +0800 Message-ID: <5195D679.5080100@huawei.com> References: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5195D5F8.7000609@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Use css_get/put instead of mem_cgroup_get/put. A simple replacement will do. The historical reason that memcg has its own refcnt instead of always using css_get/put, is that cgroup couldn't be removed if there're still css refs, so css refs can't be used as long-lived reference. The situation has changed so that rmdir a cgroup will succeed regardless css refs, but won't be freed until css refs goes down to 0. Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2f0d117..b11ea88 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4185,12 +4185,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype, unlock_page_cgroup(pc); /* * even after unlock, we have memcg->res.usage here and this memcg - * will never be freed. + * will never be freed, so it's safe to call css_get(). */ memcg_check_events(memcg, page); if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) { mem_cgroup_swap_statistics(memcg, true); - mem_cgroup_get(memcg); + css_get(&memcg->css); } /* * Migration does not charge the res_counter for the @@ -4290,7 +4290,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) /* * record memcg information, if swapout && memcg != NULL, - * mem_cgroup_get() was called in uncharge(). + * css_get() was called in uncharge(). */ if (do_swap_account && swapout && memcg) swap_cgroup_record(ent, css_id(&memcg->css)); @@ -4321,7 +4321,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent) if (!mem_cgroup_is_root(memcg)) res_counter_uncharge(&memcg->memsw, PAGE_SIZE); mem_cgroup_swap_statistics(memcg, false); - mem_cgroup_put(memcg); + css_put(&memcg->css); } rcu_read_unlock(); } @@ -4355,11 +4355,14 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry, * This function is only called from task migration context now. * It postpones res_counter and refcount handling till the end * of task migration(mem_cgroup_clear_mc()) for performance - * improvement. But we cannot postpone mem_cgroup_get(to) - * because if the process that has been moved to @to does - * swap-in, the refcount of @to might be decreased to 0. + * improvement. But we cannot postpone css_get(to) because if + * the process that has been moved to @to does swap-in, the + * refcount of @to might be decreased to 0. + * + * We are in attach() phase, so the cgroup is guaranteed to be + * alive, so we can just call css_get(). */ - mem_cgroup_get(to); + css_get(&to->css); return 0; } return -EINVAL; @@ -6651,6 +6654,7 @@ static void __mem_cgroup_clear_mc(void) { struct mem_cgroup *from = mc.from; struct mem_cgroup *to = mc.to; + int i; /* we must uncharge all the leftover precharges from mc.to */ if (mc.precharge) { @@ -6671,7 +6675,9 @@ static void __mem_cgroup_clear_mc(void) if (!mem_cgroup_is_root(mc.from)) res_counter_uncharge(&mc.from->memsw, PAGE_SIZE * mc.moved_swap); - __mem_cgroup_put(mc.from, mc.moved_swap); + + for (i = 0; i < mc.moved_swap; i++) + css_put(&mc.from->css); if (!mem_cgroup_is_root(mc.to)) { /* @@ -6681,7 +6687,7 @@ static void __mem_cgroup_clear_mc(void) res_counter_uncharge(&mc.to->res, PAGE_SIZE * mc.moved_swap); } - /* we've already done mem_cgroup_get(mc.to) */ + /* we've already done css_get(mc.to) */ mc.moved_swap = 0; } memcg_oom_recover(from); -- 1.8.0.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: [PATCH 7/9] memcg: don't need to get a reference to the parent Date: Fri, 17 May 2013 15:04:42 +0800 Message-ID: <5195D68A.8000205@huawei.com> References: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5195D5F8.7000609@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org The cgroup core guarantees it's always safe to access the parent. v2: - added a comment in mem_cgroup_put() as suggested by Michal - removed mem_cgroup_get(), otherwise gcc will warn that it's not used Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b11ea88..c6d267a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -508,7 +508,6 @@ enum res_type { */ static DEFINE_MUTEX(memcg_create_mutex); -static void mem_cgroup_get(struct mem_cgroup *memcg); static void mem_cgroup_put(struct mem_cgroup *memcg); static inline @@ -6171,19 +6170,10 @@ static void free_rcu(struct rcu_head *rcu_head) schedule_work(&memcg->work_freeing); } -static void mem_cgroup_get(struct mem_cgroup *memcg) -{ - atomic_inc(&memcg->refcnt); -} - static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) { - if (atomic_sub_and_test(count, &memcg->refcnt)) { - struct mem_cgroup *parent = parent_mem_cgroup(memcg); + if (atomic_sub_and_test(count, &memcg->refcnt)) call_rcu(&memcg->rcu_freeing, free_rcu); - if (parent) - mem_cgroup_put(parent); - } } static void mem_cgroup_put(struct mem_cgroup *memcg) @@ -6286,12 +6276,9 @@ mem_cgroup_css_online(struct cgroup *cont) res_counter_init(&memcg->kmem, &parent->kmem); /* - * We increment refcnt of the parent to ensure that we can - * safely access it on res_counter_charge/uncharge. - * This refcnt will be decremented when freeing this - * mem_cgroup(see mem_cgroup_put). + * No need to take a reference to the parent because cgroup + * core guarantees its existence. */ - mem_cgroup_get(parent); } else { res_counter_init(&memcg->res, NULL); res_counter_init(&memcg->memsw, NULL); -- 1.8.0.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: [PATCH 8/9] memcg: kill memcg refcnt Date: Fri, 17 May 2013 15:05:08 +0800 Message-ID: <5195D6A4.8060109@huawei.com> References: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5195D5F8.7000609@huawei.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Now memcg has the same life cycle as its corresponding cgroup. Kill the useless refcnt. Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c6d267a..348126a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -303,8 +303,6 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; - atomic_t refcnt; - int swappiness; /* OOM-Killer disable */ int oom_kill_disable; @@ -508,8 +506,6 @@ enum res_type { */ static DEFINE_MUTEX(memcg_create_mutex); -static void mem_cgroup_put(struct mem_cgroup *memcg); - static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) { @@ -6170,17 +6166,6 @@ static void free_rcu(struct rcu_head *rcu_head) schedule_work(&memcg->work_freeing); } -static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) -{ - if (atomic_sub_and_test(count, &memcg->refcnt)) - call_rcu(&memcg->rcu_freeing, free_rcu); -} - -static void mem_cgroup_put(struct mem_cgroup *memcg) -{ - __mem_cgroup_put(memcg, 1); -} - /* * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled. */ @@ -6240,7 +6225,6 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->last_scanned_node = MAX_NUMNODES; INIT_LIST_HEAD(&memcg->oom_notify); - atomic_set(&memcg->refcnt, 1); memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); @@ -6332,7 +6316,7 @@ static void mem_cgroup_css_free(struct cgroup *cont) mem_cgroup_sockets_destroy(memcg); - mem_cgroup_put(memcg); + call_rcu(&memcg->rcu_freeing, free_rcu); } #ifdef CONFIG_MMU -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: [PATCH 9/9] memcg: don't need to free memcg via RCU or workqueue Date: Fri, 17 May 2013 15:05:35 +0800 Message-ID: <5195D6BF.3020907@huawei.com> References: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5195D5F8.7000609@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Now memcg has the same life cycle with its corresponding cgroup, and a cgroup is freed via RCU and then mem_cgroup_css_free() will be called in a work function, so we can simply call __mem_cgroup_free() in mem_cgroup_css_free(). This actually reverts 59927fb984de1703c67bc640c3e522d8b5276c73 ("memcg: free mem_cgroup by RCU to fix oops"). Cc: Hugh Dickins Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 51 +++++---------------------------------------------- 1 file changed, 5 insertions(+), 46 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 348126a..eb27b58 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -267,28 +267,10 @@ struct mem_cgroup { /* vmpressure notifications */ struct vmpressure vmpressure; - union { - /* - * the counter to account for mem+swap usage. - */ - struct res_counter memsw; - - /* - * rcu_freeing is used only when freeing struct mem_cgroup, - * so put it into a union to avoid wasting more memory. - * It must be disjoint from the css field. It could be - * in a union with the res field, but res plays a much - * larger part in mem_cgroup life than memsw, and might - * be of interest, even at time of free, when debugging. - * So share rcu_head with the less interesting memsw. - */ - struct rcu_head rcu_freeing; - /* - * We also need some space for a worker in deferred freeing. - * By the time we call it, rcu_freeing is no longer in use. - */ - struct work_struct work_freeing; - }; + /* + * the counter to account for mem+swap usage. + */ + struct res_counter memsw; /* * the counter to account for kernel memory usage. @@ -6143,29 +6125,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) vfree(memcg); } - -/* - * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU, - * but in process context. The work_freeing structure is overlaid - * on the rcu_freeing structure, which itself is overlaid on memsw. - */ -static void free_work(struct work_struct *work) -{ - struct mem_cgroup *memcg; - - memcg = container_of(work, struct mem_cgroup, work_freeing); - __mem_cgroup_free(memcg); -} - -static void free_rcu(struct rcu_head *rcu_head) -{ - struct mem_cgroup *memcg; - - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); - INIT_WORK(&memcg->work_freeing, free_work); - schedule_work(&memcg->work_freeing); -} - /* * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled. */ @@ -6316,7 +6275,7 @@ static void mem_cgroup_css_free(struct cgroup *cont) mem_cgroup_sockets_destroy(memcg); - call_rcu(&memcg->rcu_freeing, free_rcu); + __mem_cgroup_free(memcg); } #ifdef CONFIG_MMU -- 1.8.0.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: Re: [PATCH 0/12][V3] memcg: make memcg's life cycle the same as cgroup Date: Fri, 17 May 2013 15:06:56 +0800 Message-ID: <5195D710.2040501@huawei.com> References: <5195D5F8.7000609@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5195D5F8.7000609-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org The subject should be "[PATCH 0/9][v3] ..." On 2013/5/17 15:02, Li Zefan wrote: > Hi Andrew, > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Date: Fri, 17 May 2013 11:08:22 -0700 Message-ID: <20130517180822.GC12632@mtj.dyndns.org> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=GVuPDFNR0ALJuSST698bPnLZyt/s2V9YBrDZWiOfqHo=; b=DwsqXhVbbdepfCvcvHSEmmdNEfc15UWbKh6FvIuMhsYkgBk5cppdQ66sH7lLOLw9Il RWXdAMNPXpRD1yW8RPmN0M6GySEV5eg/kYlnQx6wGCQ8+cISzEKgMjhZDJlxZjVjCKfp Ek4CPV1PlK8Cly8vj/6kKpAjSWKXDHERa7xA+MZVEFlFdvtV54EaCFNUHSve3lKVzyI2 qwsaYIU3TBDHMQakBeR4H0qQvK4bIDbyedn+SkTwSfHIEZlRMkRZQjWSVHpJpvWJGHfi C3SMfKpIocdc+l6qD6j3zPhxZtn9TZ0vEprT0uloiVgaYE7l7BF3ivQn8pqrCFfOz15t FRpA== Content-Disposition: inline In-Reply-To: <5195D666.6030408-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Li Zefan Cc: Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Hey, On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > + /* > + * Releases a reference taken in kmem_cgroup_css_offline in case > + * this last uncharge is racing with the offlining code or it is > + * outliving the memcg existence. > + * > + * The memory barrier imposed by test&clear is paired with the > + * explicit one in kmem_cgroup_css_offline. Paired with the wmb to achieve what? > + */ > if (memcg_kmem_test_and_clear_dead(memcg)) > - mem_cgroup_put(memcg); > + css_put(&memcg->css); The other side is wmb, so there gotta be something which wants to read which were written before wmb here but the only thing after the barrier is css_put() which doesn't need such thing, so I'm lost on what the barrier pair is achieving here. In general, please be *very* explicit about what's going on whenever something is depending on barrier pairs. It'll make it easier for both the author and reviewers to actually understand what's going on and why it's necessary. ... > @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > return mem_cgroup_sockets_init(memcg, ss); > } > > -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) > +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) > { > - mem_cgroup_sockets_destroy(memcg); > + if (!memcg_kmem_is_active(memcg)) > + return; > > + /* > + * kmem charges can outlive the cgroup. In the case of slab > + * pages, for instance, a page contain objects from various > + * processes. As we prevent from taking a reference for every > + * such allocation we have to be careful when doing uncharge > + * (see memcg_uncharge_kmem) and here during offlining. > + * > + * The idea is that that only the _last_ uncharge which sees > + * the dead memcg will drop the last reference. An additional > + * reference is taken here before the group is marked dead > + * which is then paired with css_put during uncharge resp. here. > + * > + * Although this might sound strange as this path is called when > + * the reference has already dropped down to 0 and shouldn't be > + * incremented anymore (css_tryget would fail) we do not have Hmmm? offline is called on cgroup destruction regardless of css refcnt. The above comment seems a bit misleading. > + * other options because of the kmem allocations lifetime. > + */ > + css_get(&memcg->css); > + > + /* see comment in memcg_uncharge_kmem() */ > + wmb(); > memcg_kmem_mark_dead(memcg); Is the wmb() trying to prevent reordering between css_get() and memcg_kmem_mark_dead()? If so, it isn't necessary - the compiler isn't allowed to reorder two atomic ops (they're all asm volatiles) and the visibility order is guaranteed by the nature of the two operations going on here - both perform modify-and-test on one end of the operations. It could be argued that having memory barriers is better for completeness of mark/test interface but then those barriers should really moved into memcg_kmem_mark_dead() and its clearing counterpart. While it's all clever and dandy, my recommendation would be just using a lock for synchronization. It isn't a hot path. Why be clever? Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Date: Wed, 22 May 2013 16:36:27 +0800 Message-ID: <519C838B.9060609@huawei.com> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20130517180822.GC12632-9pTldWuhBndy/B6EtB590w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Tejun Heo Cc: Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org On 2013/5/18 2:08, Tejun Heo wrote: > Hey, > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: >> + /* >> + * Releases a reference taken in kmem_cgroup_css_offline in case >> + * this last uncharge is racing with the offlining code or it is >> + * outliving the memcg existence. >> + * >> + * The memory barrier imposed by test&clear is paired with the >> + * explicit one in kmem_cgroup_css_offline. > > Paired with the wmb to achieve what? > >> + */ >> if (memcg_kmem_test_and_clear_dead(memcg)) >> - mem_cgroup_put(memcg); >> + css_put(&memcg->css); > > The other side is wmb, so there gotta be something which wants to read > which were written before wmb here but the only thing after the > barrier is css_put() which doesn't need such thing, so I'm lost on > what the barrier pair is achieving here. > > In general, please be *very* explicit about what's going on whenever > something is depending on barrier pairs. It'll make it easier for > both the author and reviewers to actually understand what's going on > and why it's necessary. > > ... >> @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) >> return mem_cgroup_sockets_init(memcg, ss); >> } >> >> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) >> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) >> { >> - mem_cgroup_sockets_destroy(memcg); >> + if (!memcg_kmem_is_active(memcg)) >> + return; >> >> + /* >> + * kmem charges can outlive the cgroup. In the case of slab >> + * pages, for instance, a page contain objects from various >> + * processes. As we prevent from taking a reference for every >> + * such allocation we have to be careful when doing uncharge >> + * (see memcg_uncharge_kmem) and here during offlining. >> + * >> + * The idea is that that only the _last_ uncharge which sees >> + * the dead memcg will drop the last reference. An additional >> + * reference is taken here before the group is marked dead >> + * which is then paired with css_put during uncharge resp. here. >> + * >> + * Although this might sound strange as this path is called when >> + * the reference has already dropped down to 0 and shouldn't be >> + * incremented anymore (css_tryget would fail) we do not have > > Hmmm? offline is called on cgroup destruction regardless of css > refcnt. The above comment seems a bit misleading. > The comment is wrong. I'll fix it. >> + * other options because of the kmem allocations lifetime. >> + */ >> + css_get(&memcg->css); >> + >> + /* see comment in memcg_uncharge_kmem() */ >> + wmb(); >> memcg_kmem_mark_dead(memcg); > > Is the wmb() trying to prevent reordering between css_get() and > memcg_kmem_mark_dead()? If so, it isn't necessary - the compiler > isn't allowed to reorder two atomic ops (they're all asm volatiles) > and the visibility order is guaranteed by the nature of the two > operations going on here - both perform modify-and-test on one end of > the operations. > Yeah, I think you're right. > It could be argued that having memory barriers is better for > completeness of mark/test interface but then those barriers should > really moved into memcg_kmem_mark_dead() and its clearing counterpart. > > While it's all clever and dandy, my recommendation would be just using > a lock for synchronization. It isn't a hot path. Why be clever? > I don't quite like adding a lock not to protect data but just ensure code orders. Michal, what's your preference? I want to be sure that everyone is happy so the next version will hopefully be the last version. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Date: Fri, 24 May 2013 09:54:20 +0200 Message-ID: <20130524075420.GA24813@dhcp22.suse.cz> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> <519C838B.9060609@huawei.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <519C838B.9060609-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Tejun Heo , Li Zefan Cc: Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Sorry, I have missed this. CCing would help. Anyway putting myself to CC now :P On Wed 22-05-13 16:36:27, Li Zefan wrote: > On 2013/5/18 2:08, Tejun Heo wrote: > > Hey, > > > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > >> + /* > >> + * Releases a reference taken in kmem_cgroup_css_offline in case > >> + * this last uncharge is racing with the offlining code or it is > >> + * outliving the memcg existence. > >> + * > >> + * The memory barrier imposed by test&clear is paired with the > >> + * explicit one in kmem_cgroup_css_offline. > > > > Paired with the wmb to achieve what? https://lkml.org/lkml/2013/4/4/190 " ! > + css_get(&memcg->css); ! I think that you need a write memory barrier here because css_get ! nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses ! memcg_kmem_test_and_clear_dead which imply a full memory barrier but it ! should see the elevated reference count. No? ! ! > + /* ! > + * We need to call css_get() first, because memcg_uncharge_kmem() ! > + * will call css_put() if it sees the memcg is dead. ! > + */ ! > memcg_kmem_mark_dead(memcg); " Does it make sense to you Tejun? > > > >> + */ > >> if (memcg_kmem_test_and_clear_dead(memcg)) > >> - mem_cgroup_put(memcg); > >> + css_put(&memcg->css); > > > > The other side is wmb, so there gotta be something which wants to read > > which were written before wmb here but the only thing after the > > barrier is css_put() which doesn't need such thing, so I'm lost on > > what the barrier pair is achieving here. > > > > In general, please be *very* explicit about what's going on whenever > > something is depending on barrier pairs. It'll make it easier for > > both the author and reviewers to actually understand what's going on > > and why it's necessary. > > > > ... > >> @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > >> return mem_cgroup_sockets_init(memcg, ss); > >> } > >> > >> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) > >> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) > >> { > >> - mem_cgroup_sockets_destroy(memcg); > >> + if (!memcg_kmem_is_active(memcg)) > >> + return; > >> > >> + /* > >> + * kmem charges can outlive the cgroup. In the case of slab > >> + * pages, for instance, a page contain objects from various > >> + * processes. As we prevent from taking a reference for every > >> + * such allocation we have to be careful when doing uncharge > >> + * (see memcg_uncharge_kmem) and here during offlining. > >> + * > >> + * The idea is that that only the _last_ uncharge which sees > >> + * the dead memcg will drop the last reference. An additional > >> + * reference is taken here before the group is marked dead > >> + * which is then paired with css_put during uncharge resp. here. > >> + * > >> + * Although this might sound strange as this path is called when > >> + * the reference has already dropped down to 0 and shouldn't be > >> + * incremented anymore (css_tryget would fail) we do not have > > > > Hmmm? offline is called on cgroup destruction regardless of css > > refcnt. The above comment seems a bit misleading. > > > > The comment is wrong. I'll fix it. Ohh, right. "Althouth this might sound strange as this path is called from css_offline when the reference might have dropped down to 0 and shouldn't ..." Sounds better? > >> + * other options because of the kmem allocations lifetime. > >> + */ > >> + css_get(&memcg->css); > >> + > >> + /* see comment in memcg_uncharge_kmem() */ > >> + wmb(); > >> memcg_kmem_mark_dead(memcg); > > > > Is the wmb() trying to prevent reordering between css_get() and > > memcg_kmem_mark_dead()? If so, it isn't necessary - the compiler > > isn't allowed to reorder two atomic ops (they're all asm volatiles) > > and the visibility order is guaranteed by the nature of the two > > operations going on here - both perform modify-and-test on one end of > > the operations. As I have copied my comment from the earlier thread above. css_get does atomic_add which doesn't imply any barrier AFAIK and memcg_kmem_mark_dead uses a simple set_bit which doesn't imply it either. What am I missing? > > > > Yeah, I think you're right. > > > It could be argued that having memory barriers is better for > > completeness of mark/test interface but then those barriers should > > really moved into memcg_kmem_mark_dead() and its clearing counterpart. > > > > While it's all clever and dandy, my recommendation would be just using > > a lock for synchronization. It isn't a hot path. Why be clever? > > > > I don't quite like adding a lock not to protect data but just ensure code > orders. Agreed. > Michal, what's your preference? I want to be sure that everyone is happy > so the next version will hopefully be the last version. I still do not see why the barrier is not needed and the lock seems too big hammer. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Date: Thu, 30 May 2013 14:48:52 +0900 Message-ID: <20130530054852.GA9305@mtj.dyndns.org> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> <519C838B.9060609@huawei.com> <20130524075420.GA24813@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=54CaeHxsmdOSGtJRVw3bV5EJgz8Oh9EhUOmREt3Pxbw=; b=XEpI21rciz6fpMHfVV6Lr19gEu5qd6Z8dsN/QIbSTRhGek5u0IGcWnfeKhJFPeK6NT u+Ksa36pj+OKBj0F6r3H3s6pvdvk1UG2MbLsAs4c1Q5I00v7Jp6ogONAAWqt5smpdP0D HItn06hfxC8SQ1JrLsNmMNPSfWYGsqvB5McsXAfFdvjYcGi2TB8WbjfwSLZHUXi+WVtb 6LB+JOifO/NBS+3Lq0DB6x6x3RHBB2yFZVyxoiCrdKILsuWOpupblq9oGY4p3pMeQmsv LfeY73ZE6fAJpP9BWhUiAxdkaZ+mJm/8iiK85ek93DtMcBaKVRpyDp2Qw/FdcKcGCdCA xrTA== Content-Disposition: inline In-Reply-To: <20130524075420.GA24813@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: Li Zefan , Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Hello, Sorry about the delay. Have been and still am traveling. On Fri, May 24, 2013 at 09:54:20AM +0200, Michal Hocko wrote: > > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > > >> + /* > > >> + * Releases a reference taken in kmem_cgroup_css_offline in case > > >> + * this last uncharge is racing with the offlining code or it is > > >> + * outliving the memcg existence. > > >> + * > > >> + * The memory barrier imposed by test&clear is paired with the > > >> + * explicit one in kmem_cgroup_css_offline. > > > > > > Paired with the wmb to achieve what? > > https://lkml.org/lkml/2013/4/4/190 > " > ! > + css_get(&memcg->css); > ! I think that you need a write memory barrier here because css_get > ! nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses > ! memcg_kmem_test_and_clear_dead which imply a full memory barrier but it > ! should see the elevated reference count. No? > ! > ! > + /* > ! > + * We need to call css_get() first, because memcg_uncharge_kmem() > ! > + * will call css_put() if it sees the memcg is dead. > ! > + */ > ! > memcg_kmem_mark_dead(memcg); > " > > Does it make sense to you Tejun? Yeah, you're right. We need them. It's still a bummer that mark_dead has the appearance of proper encapsulation while not really taking care of synchronization. I think it'd make more sense for mark_dead to have the barrier (which BTW should probably be smp_wmb() instead of wmb()) inside it or for the function to be just open-coded. More on this topic later. > > The comment is wrong. I'll fix it. > > Ohh, right. "Althouth this might sound strange as this path is called from > css_offline when the reference might have dropped down to 0 and shouldn't ..." > > Sounds better? Yeap. > > I don't quite like adding a lock not to protect data but just ensure code > > orders. > > Agreed. > > > Michal, what's your preference? I want to be sure that everyone is happy > > so the next version will hopefully be the last version. > > I still do not see why the barrier is not needed and the lock seems too > big hammer. Yes, the barrier is necessary but I still think it's unnecessarily elaborate. Among the locking constructs, the barriesr are the worst - they don't enforce any structures, people often misunderstand / make mistakes about them, bugs from misusages are extremely difficult to trigger and reproduce especially on x86. It's a horrible construct and should only be used if no other options can meet the performance requirements required for the path. So, to me, "lock is too big a hammer" looks to be approaching the problem from the completely wrong direction when the code path clearly isn't hot enough to justify memory barrier tricks. We don't and shouldn't try to choose the mechanism with the lowest possible overhead for the given situation. We should be as simple and straight-forward as the situation allows. That's the only way to sustain long-term maintainability. So, let's please do something which is apparent. I don't really care what it is - a shared spinlock, test_and_lock bitops, whatever. It's not gonna show up in any profile anyway. There's absolutely no reason to mess with memory barriers. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Date: Thu, 30 May 2013 17:12:20 +0200 Message-ID: <20130530151220.GB18155@dhcp22.suse.cz> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> <519C838B.9060609@huawei.com> <20130524075420.GA24813@dhcp22.suse.cz> <20130530054852.GA9305@mtj.dyndns.org> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=XYItylFmWQi7p277cTyHdDk1HUIngQSJDiwVKJlvuG0=; b=jBxAJ3TuN36vsTn+CmyvDLH+tDNC1SN4F01agaoepbIXKZ5UOGSRhuhqKIQRrSwwNd hnYpeFtmlFJvQFCJQnVioEQmhvGHPBLAFbWUKkrdE4+Gvmi4nrPzRWO3L3wmyLlglG+V VY3tuTBdN330j9mSgjt1UdZiNE42N290paL2PEQSaq1z8uy4GSzGKha4BjNbU1gzRERM XdrG8sxwmKcQKZcgHmu2mJsw+NiP+pbXrf7fjj1IGasvvEKQE/0N/bLbPE1S3G1QhmCH xImF6nYWpg9JyT7kLMYdjALaIZr3pB08mCcnBGKkf68IekeRNqxbyJX8FaL3zX42TAlP 9XZA== Content-Disposition: inline In-Reply-To: <20130530054852.GA9305-9pTldWuhBndy/B6EtB590w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Tejun Heo Cc: Li Zefan , Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org On Thu 30-05-13 14:48:52, Tejun Heo wrote: > Hello, > > Sorry about the delay. Have been and still am traveling. > > On Fri, May 24, 2013 at 09:54:20AM +0200, Michal Hocko wrote: > > > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > > > >> + /* > > > >> + * Releases a reference taken in kmem_cgroup_css_offline in case > > > >> + * this last uncharge is racing with the offlining code or it is > > > >> + * outliving the memcg existence. > > > >> + * > > > >> + * The memory barrier imposed by test&clear is paired with the > > > >> + * explicit one in kmem_cgroup_css_offline. > > > > > > > > Paired with the wmb to achieve what? > > > > https://lkml.org/lkml/2013/4/4/190 > > " > > ! > + css_get(&memcg->css); > > ! I think that you need a write memory barrier here because css_get > > ! nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses > > ! memcg_kmem_test_and_clear_dead which imply a full memory barrier but it > > ! should see the elevated reference count. No? > > ! > > ! > + /* > > ! > + * We need to call css_get() first, because memcg_uncharge_kmem() > > ! > + * will call css_put() if it sees the memcg is dead. > > ! > + */ > > ! > memcg_kmem_mark_dead(memcg); > > " > > > > Does it make sense to you Tejun? > > Yeah, you're right. We need them. It's still a bummer that mark_dead > has the appearance of proper encapsulation while not really taking > care of synchronization. No objection to put barrier there. You are right it is more natural. > I think it'd make more sense for mark_dead to have the barrier (which > BTW should probably be smp_wmb() instead of wmb()) Yes, smp_wmb sounds like a better fit. > inside it or for the function to be just open-coded. More on this > topic later. > > > > The comment is wrong. I'll fix it. > > > > Ohh, right. "Althouth this might sound strange as this path is called from > > css_offline when the reference might have dropped down to 0 and shouldn't ..." > > > > Sounds better? > > Yeap. > > > > I don't quite like adding a lock not to protect data but just ensure code > > > orders. > > > > Agreed. > > > > > Michal, what's your preference? I want to be sure that everyone is happy > > > so the next version will hopefully be the last version. > > > > I still do not see why the barrier is not needed and the lock seems too > > big hammer. > > Yes, the barrier is necessary but I still think it's unnecessarily > elaborate. Among the locking constructs, the barriesr are the worst - > they don't enforce any structures, people often misunderstand / make > mistakes about them, bugs from misusages are extremely difficult to > trigger and reproduce especially on x86. It's a horrible construct > and should only be used if no other options can meet the performance > requirements required for the path. I am all for simplifying the code. I guess it deserves a separate patch though and it is a bit unrelated to the scope of the series. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx125.postini.com [74.125.245.125]) by kanga.kvack.org (Postfix) with SMTP id 3445F6B0033 for ; Fri, 17 May 2013 03:04:26 -0400 (EDT) Message-ID: <5195D64A.9090000@huawei.com> Date: Fri, 17 May 2013 15:03:38 +0800 From: Li Zefan MIME-Version: 1.0 Subject: [PATCH 3/9] memcg: use css_get() in sock_update_memcg() References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Use css_get/css_put instead of mem_cgroup_get/put. Note, if at the same time someone is moving @current to a different cgroup and removing the old cgroup, css_tryget() may return false, and sock->sk_cgrp won't be initialized, which is fine. Signed-off-by: Li Zefan Acked-by: KAMEZAWA Hiroyuki Acked-by: Michal Hocko --- mm/memcontrol.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4d0458d..f1320d3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -561,15 +561,15 @@ void sock_update_memcg(struct sock *sk) */ if (sk->sk_cgrp) { BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg)); - mem_cgroup_get(sk->sk_cgrp->memcg); + css_get(&sk->sk_cgrp->memcg->css); return; } rcu_read_lock(); memcg = mem_cgroup_from_task(current); cg_proto = sk->sk_prot->proto_cgroup(memcg); - if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) { - mem_cgroup_get(memcg); + if (!mem_cgroup_is_root(memcg) && + memcg_proto_active(cg_proto) && css_tryget(&memcg->css)) { sk->sk_cgrp = cg_proto; } rcu_read_unlock(); @@ -583,7 +583,7 @@ void sock_release_memcg(struct sock *sk) struct mem_cgroup *memcg; WARN_ON(!sk->sk_cgrp->memcg); memcg = sk->sk_cgrp->memcg; - mem_cgroup_put(memcg); + css_put(&sk->sk_cgrp->memcg->css); } } -- 1.8.0.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx179.postini.com [74.125.245.179]) by kanga.kvack.org (Postfix) with SMTP id 08CAE6B0036 for ; Fri, 17 May 2013 03:04:31 -0400 (EDT) Message-ID: <5195D626.3010406@huawei.com> Date: Fri, 17 May 2013 15:03:02 +0800 From: Li Zefan MIME-Version: 1.0 Subject: [PATCH 1/9] Revert "memcg: avoid dangling reference count in creation failure." References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org This reverts commit e4715f01be697a3730c78f8ffffb595591d6a88c mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops an additional reference from all parents so the additional mem_cgrroup_put(parent) potentially causes use-after-free. Cc: # 3.9.x Signed-off-by: Michal Hocko Signed-off-by: Li Zefan Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cb1c9de..5918e90 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6297,8 +6297,6 @@ mem_cgroup_css_online(struct cgroup *cont) * call __mem_cgroup_free, so return directly */ mem_cgroup_put(memcg); - if (parent->use_hierarchy) - mem_cgroup_put(parent); } return error; } -- 1.8.0.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx182.postini.com [74.125.245.182]) by kanga.kvack.org (Postfix) with SMTP id 79F456B0037 for ; Fri, 17 May 2013 03:04:41 -0400 (EDT) Message-ID: <5195D658.5000301@huawei.com> Date: Fri, 17 May 2013 15:03:52 +0800 From: Li Zefan MIME-Version: 1.0 Subject: [PATCH 4/9] memcg: don't use mem_cgroup_get() when creating a kmemcg cache References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put(). There are two things being done in the current code: First, we acquired a css_ref to make sure that the underlying cgroup would not go away. That is a short lived reference, and it is put as soon as the cache is created. At this point, we acquire a long-lived per-cache memcg reference count to guarantee that the memcg will still be alive. so it is: enqueue: css_get create : memcg_get, css_put destroy: memcg_put So we only need to get rid of the memcg_get, change the memcg_put to css_put, and get rid of the now extra css_put. (This changelog is mostly written by Glauber) Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f1320d3..63526f9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3227,7 +3227,7 @@ void memcg_release_cache(struct kmem_cache *s) list_del(&s->memcg_params->list); mutex_unlock(&memcg->slab_caches_mutex); - mem_cgroup_put(memcg); + css_put(&memcg->css); out: kfree(s->memcg_params); } @@ -3387,16 +3387,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg, mutex_lock(&memcg_cache_mutex); new_cachep = cachep->memcg_params->memcg_caches[idx]; - if (new_cachep) + if (new_cachep) { + css_put(&memcg->css); goto out; + } new_cachep = kmem_cache_dup(memcg, cachep); if (new_cachep == NULL) { new_cachep = cachep; + css_put(&memcg->css); goto out; } - mem_cgroup_get(memcg); atomic_set(&new_cachep->memcg_params->nr_pages , 0); cachep->memcg_params->memcg_caches[idx] = new_cachep; @@ -3484,8 +3486,6 @@ static void memcg_create_cache_work_func(struct work_struct *w) cw = container_of(w, struct create_work, work); memcg_create_kmem_cache(cw->memcg, cw->cachep); - /* Drop the reference gotten when we enqueued. */ - css_put(&cw->memcg->css); kfree(cw); } -- 1.8.0.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id BBFAF6B0033 for ; Fri, 17 May 2013 03:06:00 -0400 (EDT) Message-ID: <5195D6A4.8060109@huawei.com> Date: Fri, 17 May 2013 15:05:08 +0800 From: Li Zefan MIME-Version: 1.0 Subject: [PATCH 8/9] memcg: kill memcg refcnt References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Now memcg has the same life cycle as its corresponding cgroup. Kill the useless refcnt. Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c6d267a..348126a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -303,8 +303,6 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; - atomic_t refcnt; - int swappiness; /* OOM-Killer disable */ int oom_kill_disable; @@ -508,8 +506,6 @@ enum res_type { */ static DEFINE_MUTEX(memcg_create_mutex); -static void mem_cgroup_put(struct mem_cgroup *memcg); - static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) { @@ -6170,17 +6166,6 @@ static void free_rcu(struct rcu_head *rcu_head) schedule_work(&memcg->work_freeing); } -static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) -{ - if (atomic_sub_and_test(count, &memcg->refcnt)) - call_rcu(&memcg->rcu_freeing, free_rcu); -} - -static void mem_cgroup_put(struct mem_cgroup *memcg) -{ - __mem_cgroup_put(memcg, 1); -} - /* * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled. */ @@ -6240,7 +6225,6 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->last_scanned_node = MAX_NUMNODES; INIT_LIST_HEAD(&memcg->oom_notify); - atomic_set(&memcg->refcnt, 1); memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); @@ -6332,7 +6316,7 @@ static void mem_cgroup_css_free(struct cgroup *cont) mem_cgroup_sockets_destroy(memcg); - mem_cgroup_put(memcg); + call_rcu(&memcg->rcu_freeing, free_rcu); } #ifdef CONFIG_MMU -- 1.8.0.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx206.postini.com [74.125.245.206]) by kanga.kvack.org (Postfix) with SMTP id 851256B0036 for ; Fri, 17 May 2013 03:07:39 -0400 (EDT) Message-ID: <5195D710.2040501@huawei.com> Date: Fri, 17 May 2013 15:06:56 +0800 From: Li Zefan MIME-Version: 1.0 Subject: Re: [PATCH 0/12][V3] memcg: make memcg's life cycle the same as cgroup References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org The subject should be "[PATCH 0/9][v3] ..." On 2013/5/17 15:02, Li Zefan wrote: > Hi Andrew, > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx136.postini.com [74.125.245.136]) by kanga.kvack.org (Postfix) with SMTP id B03AB6B0032 for ; Fri, 17 May 2013 14:08:28 -0400 (EDT) Received: by mail-qe0-f53.google.com with SMTP id cz11so2919790qeb.12 for ; Fri, 17 May 2013 11:08:27 -0700 (PDT) Date: Fri, 17 May 2013 11:08:22 -0700 From: Tejun Heo Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Message-ID: <20130517180822.GC12632@mtj.dyndns.org> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5195D666.6030408@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: To: Li Zefan Cc: Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Hey, On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > + /* > + * Releases a reference taken in kmem_cgroup_css_offline in case > + * this last uncharge is racing with the offlining code or it is > + * outliving the memcg existence. > + * > + * The memory barrier imposed by test&clear is paired with the > + * explicit one in kmem_cgroup_css_offline. Paired with the wmb to achieve what? > + */ > if (memcg_kmem_test_and_clear_dead(memcg)) > - mem_cgroup_put(memcg); > + css_put(&memcg->css); The other side is wmb, so there gotta be something which wants to read which were written before wmb here but the only thing after the barrier is css_put() which doesn't need such thing, so I'm lost on what the barrier pair is achieving here. In general, please be *very* explicit about what's going on whenever something is depending on barrier pairs. It'll make it easier for both the author and reviewers to actually understand what's going on and why it's necessary. ... > @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > return mem_cgroup_sockets_init(memcg, ss); > } > > -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) > +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) > { > - mem_cgroup_sockets_destroy(memcg); > + if (!memcg_kmem_is_active(memcg)) > + return; > > + /* > + * kmem charges can outlive the cgroup. In the case of slab > + * pages, for instance, a page contain objects from various > + * processes. As we prevent from taking a reference for every > + * such allocation we have to be careful when doing uncharge > + * (see memcg_uncharge_kmem) and here during offlining. > + * > + * The idea is that that only the _last_ uncharge which sees > + * the dead memcg will drop the last reference. An additional > + * reference is taken here before the group is marked dead > + * which is then paired with css_put during uncharge resp. here. > + * > + * Although this might sound strange as this path is called when > + * the reference has already dropped down to 0 and shouldn't be > + * incremented anymore (css_tryget would fail) we do not have Hmmm? offline is called on cgroup destruction regardless of css refcnt. The above comment seems a bit misleading. > + * other options because of the kmem allocations lifetime. > + */ > + css_get(&memcg->css); > + > + /* see comment in memcg_uncharge_kmem() */ > + wmb(); > memcg_kmem_mark_dead(memcg); Is the wmb() trying to prevent reordering between css_get() and memcg_kmem_mark_dead()? If so, it isn't necessary - the compiler isn't allowed to reorder two atomic ops (they're all asm volatiles) and the visibility order is guaranteed by the nature of the two operations going on here - both perform modify-and-test on one end of the operations. It could be argued that having memory barriers is better for completeness of mark/test interface but then those barriers should really moved into memcg_kmem_mark_dead() and its clearing counterpart. While it's all clever and dandy, my recommendation would be just using a lock for synchronization. It isn't a hot path. Why be clever? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx118.postini.com [74.125.245.118]) by kanga.kvack.org (Postfix) with SMTP id C51096B0044 for ; Wed, 22 May 2013 04:38:10 -0400 (EDT) Message-ID: <519C838B.9060609@huawei.com> Date: Wed, 22 May 2013 16:36:27 +0800 From: Li Zefan MIME-Version: 1.0 Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> In-Reply-To: <20130517180822.GC12632@mtj.dyndns.org> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org On 2013/5/18 2:08, Tejun Heo wrote: > Hey, > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: >> + /* >> + * Releases a reference taken in kmem_cgroup_css_offline in case >> + * this last uncharge is racing with the offlining code or it is >> + * outliving the memcg existence. >> + * >> + * The memory barrier imposed by test&clear is paired with the >> + * explicit one in kmem_cgroup_css_offline. > > Paired with the wmb to achieve what? > >> + */ >> if (memcg_kmem_test_and_clear_dead(memcg)) >> - mem_cgroup_put(memcg); >> + css_put(&memcg->css); > > The other side is wmb, so there gotta be something which wants to read > which were written before wmb here but the only thing after the > barrier is css_put() which doesn't need such thing, so I'm lost on > what the barrier pair is achieving here. > > In general, please be *very* explicit about what's going on whenever > something is depending on barrier pairs. It'll make it easier for > both the author and reviewers to actually understand what's going on > and why it's necessary. > > ... >> @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) >> return mem_cgroup_sockets_init(memcg, ss); >> } >> >> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) >> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) >> { >> - mem_cgroup_sockets_destroy(memcg); >> + if (!memcg_kmem_is_active(memcg)) >> + return; >> >> + /* >> + * kmem charges can outlive the cgroup. In the case of slab >> + * pages, for instance, a page contain objects from various >> + * processes. As we prevent from taking a reference for every >> + * such allocation we have to be careful when doing uncharge >> + * (see memcg_uncharge_kmem) and here during offlining. >> + * >> + * The idea is that that only the _last_ uncharge which sees >> + * the dead memcg will drop the last reference. An additional >> + * reference is taken here before the group is marked dead >> + * which is then paired with css_put during uncharge resp. here. >> + * >> + * Although this might sound strange as this path is called when >> + * the reference has already dropped down to 0 and shouldn't be >> + * incremented anymore (css_tryget would fail) we do not have > > Hmmm? offline is called on cgroup destruction regardless of css > refcnt. The above comment seems a bit misleading. > The comment is wrong. I'll fix it. >> + * other options because of the kmem allocations lifetime. >> + */ >> + css_get(&memcg->css); >> + >> + /* see comment in memcg_uncharge_kmem() */ >> + wmb(); >> memcg_kmem_mark_dead(memcg); > > Is the wmb() trying to prevent reordering between css_get() and > memcg_kmem_mark_dead()? If so, it isn't necessary - the compiler > isn't allowed to reorder two atomic ops (they're all asm volatiles) > and the visibility order is guaranteed by the nature of the two > operations going on here - both perform modify-and-test on one end of > the operations. > Yeah, I think you're right. > It could be argued that having memory barriers is better for > completeness of mark/test interface but then those barriers should > really moved into memcg_kmem_mark_dead() and its clearing counterpart. > > While it's all clever and dandy, my recommendation would be just using > a lock for synchronization. It isn't a hot path. Why be clever? > I don't quite like adding a lock not to protect data but just ensure code orders. Michal, what's your preference? I want to be sure that everyone is happy so the next version will hopefully be the last version. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx150.postini.com [74.125.245.150]) by kanga.kvack.org (Postfix) with SMTP id 9F8F96B0034 for ; Fri, 24 May 2013 03:54:24 -0400 (EDT) Date: Fri, 24 May 2013 09:54:20 +0200 From: Michal Hocko Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Message-ID: <20130524075420.GA24813@dhcp22.suse.cz> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> <519C838B.9060609@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <519C838B.9060609@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo , Li Zefan Cc: Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Sorry, I have missed this. CCing would help. Anyway putting myself to CC now :P On Wed 22-05-13 16:36:27, Li Zefan wrote: > On 2013/5/18 2:08, Tejun Heo wrote: > > Hey, > > > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > >> + /* > >> + * Releases a reference taken in kmem_cgroup_css_offline in case > >> + * this last uncharge is racing with the offlining code or it is > >> + * outliving the memcg existence. > >> + * > >> + * The memory barrier imposed by test&clear is paired with the > >> + * explicit one in kmem_cgroup_css_offline. > > > > Paired with the wmb to achieve what? https://lkml.org/lkml/2013/4/4/190 " ! > + css_get(&memcg->css); ! I think that you need a write memory barrier here because css_get ! nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses ! memcg_kmem_test_and_clear_dead which imply a full memory barrier but it ! should see the elevated reference count. No? ! ! > + /* ! > + * We need to call css_get() first, because memcg_uncharge_kmem() ! > + * will call css_put() if it sees the memcg is dead. ! > + */ ! > memcg_kmem_mark_dead(memcg); " Does it make sense to you Tejun? > > > >> + */ > >> if (memcg_kmem_test_and_clear_dead(memcg)) > >> - mem_cgroup_put(memcg); > >> + css_put(&memcg->css); > > > > The other side is wmb, so there gotta be something which wants to read > > which were written before wmb here but the only thing after the > > barrier is css_put() which doesn't need such thing, so I'm lost on > > what the barrier pair is achieving here. > > > > In general, please be *very* explicit about what's going on whenever > > something is depending on barrier pairs. It'll make it easier for > > both the author and reviewers to actually understand what's going on > > and why it's necessary. > > > > ... > >> @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > >> return mem_cgroup_sockets_init(memcg, ss); > >> } > >> > >> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) > >> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) > >> { > >> - mem_cgroup_sockets_destroy(memcg); > >> + if (!memcg_kmem_is_active(memcg)) > >> + return; > >> > >> + /* > >> + * kmem charges can outlive the cgroup. In the case of slab > >> + * pages, for instance, a page contain objects from various > >> + * processes. As we prevent from taking a reference for every > >> + * such allocation we have to be careful when doing uncharge > >> + * (see memcg_uncharge_kmem) and here during offlining. > >> + * > >> + * The idea is that that only the _last_ uncharge which sees > >> + * the dead memcg will drop the last reference. An additional > >> + * reference is taken here before the group is marked dead > >> + * which is then paired with css_put during uncharge resp. here. > >> + * > >> + * Although this might sound strange as this path is called when > >> + * the reference has already dropped down to 0 and shouldn't be > >> + * incremented anymore (css_tryget would fail) we do not have > > > > Hmmm? offline is called on cgroup destruction regardless of css > > refcnt. The above comment seems a bit misleading. > > > > The comment is wrong. I'll fix it. Ohh, right. "Althouth this might sound strange as this path is called from css_offline when the reference might have dropped down to 0 and shouldn't ..." Sounds better? > >> + * other options because of the kmem allocations lifetime. > >> + */ > >> + css_get(&memcg->css); > >> + > >> + /* see comment in memcg_uncharge_kmem() */ > >> + wmb(); > >> memcg_kmem_mark_dead(memcg); > > > > Is the wmb() trying to prevent reordering between css_get() and > > memcg_kmem_mark_dead()? If so, it isn't necessary - the compiler > > isn't allowed to reorder two atomic ops (they're all asm volatiles) > > and the visibility order is guaranteed by the nature of the two > > operations going on here - both perform modify-and-test on one end of > > the operations. As I have copied my comment from the earlier thread above. css_get does atomic_add which doesn't imply any barrier AFAIK and memcg_kmem_mark_dead uses a simple set_bit which doesn't imply it either. What am I missing? > > > > Yeah, I think you're right. > > > It could be argued that having memory barriers is better for > > completeness of mark/test interface but then those barriers should > > really moved into memcg_kmem_mark_dead() and its clearing counterpart. > > > > While it's all clever and dandy, my recommendation would be just using > > a lock for synchronization. It isn't a hot path. Why be clever? > > > > I don't quite like adding a lock not to protect data but just ensure code > orders. Agreed. > Michal, what's your preference? I want to be sure that everyone is happy > so the next version will hopefully be the last version. I still do not see why the barrier is not needed and the lock seems too big hammer. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx205.postini.com [74.125.245.205]) by kanga.kvack.org (Postfix) with SMTP id 6F3306B0034 for ; Thu, 30 May 2013 11:12:25 -0400 (EDT) Received: by mail-wi0-f173.google.com with SMTP id hi5so4666852wib.12 for ; Thu, 30 May 2013 08:12:23 -0700 (PDT) Date: Thu, 30 May 2013 17:12:20 +0200 From: Michal Hocko Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Message-ID: <20130530151220.GB18155@dhcp22.suse.cz> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> <519C838B.9060609@huawei.com> <20130524075420.GA24813@dhcp22.suse.cz> <20130530054852.GA9305@mtj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130530054852.GA9305@mtj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: Li Zefan , Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org On Thu 30-05-13 14:48:52, Tejun Heo wrote: > Hello, > > Sorry about the delay. Have been and still am traveling. > > On Fri, May 24, 2013 at 09:54:20AM +0200, Michal Hocko wrote: > > > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > > > >> + /* > > > >> + * Releases a reference taken in kmem_cgroup_css_offline in case > > > >> + * this last uncharge is racing with the offlining code or it is > > > >> + * outliving the memcg existence. > > > >> + * > > > >> + * The memory barrier imposed by test&clear is paired with the > > > >> + * explicit one in kmem_cgroup_css_offline. > > > > > > > > Paired with the wmb to achieve what? > > > > https://lkml.org/lkml/2013/4/4/190 > > " > > ! > + css_get(&memcg->css); > > ! I think that you need a write memory barrier here because css_get > > ! nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses > > ! memcg_kmem_test_and_clear_dead which imply a full memory barrier but it > > ! should see the elevated reference count. No? > > ! > > ! > + /* > > ! > + * We need to call css_get() first, because memcg_uncharge_kmem() > > ! > + * will call css_put() if it sees the memcg is dead. > > ! > + */ > > ! > memcg_kmem_mark_dead(memcg); > > " > > > > Does it make sense to you Tejun? > > Yeah, you're right. We need them. It's still a bummer that mark_dead > has the appearance of proper encapsulation while not really taking > care of synchronization. No objection to put barrier there. You are right it is more natural. > I think it'd make more sense for mark_dead to have the barrier (which > BTW should probably be smp_wmb() instead of wmb()) Yes, smp_wmb sounds like a better fit. > inside it or for the function to be just open-coded. More on this > topic later. > > > > The comment is wrong. I'll fix it. > > > > Ohh, right. "Althouth this might sound strange as this path is called from > > css_offline when the reference might have dropped down to 0 and shouldn't ..." > > > > Sounds better? > > Yeap. > > > > I don't quite like adding a lock not to protect data but just ensure code > > > orders. > > > > Agreed. > > > > > Michal, what's your preference? I want to be sure that everyone is happy > > > so the next version will hopefully be the last version. > > > > I still do not see why the barrier is not needed and the lock seems too > > big hammer. > > Yes, the barrier is necessary but I still think it's unnecessarily > elaborate. Among the locking constructs, the barriesr are the worst - > they don't enforce any structures, people often misunderstand / make > mistakes about them, bugs from misusages are extremely difficult to > trigger and reproduce especially on x86. It's a horrible construct > and should only be used if no other options can meet the performance > requirements required for the path. I am all for simplifying the code. I guess it deserves a separate patch though and it is a bit unrelated to the scope of the series. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755238Ab3EQHD2 (ORCPT ); Fri, 17 May 2013 03:03:28 -0400 Received: from szxga01-in.huawei.com ([119.145.14.64]:46318 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753669Ab3EQHD0 (ORCPT ); Fri, 17 May 2013 03:03:26 -0400 Message-ID: <5195D5F8.7000609@huawei.com> Date: Fri, 17 May 2013 15:02:16 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: [PATCH 0/12][V3] memcg: make memcg's life cycle the same as cgroup Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Andrew, All the patches in this patchset has been acked by Michal and Kamezawa-san, and it's ready to be merged into -mm. Changes since v2: - rebased to 3.10-rc1 - collected some acks - the two memcg bug fixes has been merged into mainline - the cgroup core patch has been merged into mainline Changes since v1: - wrote better changelog and added acked-by and reviewed-by tags - revised some comments as suggested by Michal - added a wmb() in kmem_cgroup_css_offline(), pointed out by Michal - fixed a bug which causes a css_put() never be called Now memcg has its own refcnt, so when a cgroup is destroyed, the memcg can still be alive. This patchset converts memcg to always use css_get/put, so memcg will have the same life cycle as its corresponding cgroup. The historical reason that memcg didn't use css_get in some cases, is that cgroup couldn't be removed if there're still css refs. The situation has changed so that rmdir a cgroup will succeed regardless css refs, but won't be freed until css refs goes down to 0. Since the introduction of kmemcg, the memcg refcnt handling grows even more complicated. This patchset greately simplifies memcg's life cycle management. Also, after those changes, we can convert memcg to use cgroup->id, and then we can kill css_id. The first 2 patches are bug fixes that can go into 3.10, and the rest are for 3.10. Li Zefan (7): memcg: use css_get() in sock_update_memcg() memcg: don't use mem_cgroup_get() when creating a kmemcg cache memcg: use css_get/put when charging/uncharging kmem memcg: use css_get/put for swap memcg memcg: don't need to get a reference to the parent memcg: kill memcg refcnt memcg: don't need to free memcg via RCU or workqueue Michal Hocko (2): Revert "memcg: avoid dangling reference count in creation failure." memcg, kmem: fix reference count handling on the error path --- mm/memcontrol.c | 204 ++++++++++++++++++++++-------------------------------------- 1 file changed, 73 insertions(+), 131 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755307Ab3EQHET (ORCPT ); Fri, 17 May 2013 03:04:19 -0400 Received: from szxga02-in.huawei.com ([119.145.14.65]:33452 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754589Ab3EQHES (ORCPT ); Fri, 17 May 2013 03:04:18 -0400 Message-ID: <5195D63B.4010202@huawei.com> Date: Fri, 17 May 2013 15:03:23 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: [PATCH 2/9] memcg, kmem: fix reference count handling on the error path References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem fails. This is not correct because only memcg_propagate_kmem takes an additional reference while mem_cgroup_sockets_init is allowed to fail as well (although no current implementation fails) but it doesn't take any reference. This all suggests that it should be memcg_propagate_kmem that should clean up after itself so this patch moves mem_cgroup_put over there. Unfortunately this is not that easy (as pointed out by Li Zefan) because memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it is marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if memcg_propagate_kmem fails so the additional reference is dropped in that case in kmem_cgroup_destroy which means that the reference would be dropped two times. The easiest way then would be to simply remove mem_cgrroup_put from mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right thing. Cc: # 3.8+ Signed-off-by: Michal Hocko Signed-off-by: Li Zefan Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5918e90..4d0458d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6290,14 +6290,6 @@ mem_cgroup_css_online(struct cgroup *cont) error = memcg_init_kmem(memcg, &mem_cgroup_subsys); mutex_unlock(&memcg_create_mutex); - if (error) { - /* - * We call put now because our (and parent's) refcnts - * are already in place. mem_cgroup_put() will internally - * call __mem_cgroup_free, so return directly - */ - mem_cgroup_put(memcg); - } return error; } -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755332Ab3EQHEV (ORCPT ); Fri, 17 May 2013 03:04:21 -0400 Received: from szxga02-in.huawei.com ([119.145.14.65]:33457 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753807Ab3EQHEU (ORCPT ); Fri, 17 May 2013 03:04:20 -0400 Message-ID: <5195D626.3010406@huawei.com> Date: Fri, 17 May 2013 15:03:02 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: [PATCH 1/9] Revert "memcg: avoid dangling reference count in creation failure." References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This reverts commit e4715f01be697a3730c78f8ffffb595591d6a88c mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops an additional reference from all parents so the additional mem_cgrroup_put(parent) potentially causes use-after-free. Cc: # 3.9.x Signed-off-by: Michal Hocko Signed-off-by: Li Zefan Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cb1c9de..5918e90 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6297,8 +6297,6 @@ mem_cgroup_css_online(struct cgroup *cont) * call __mem_cgroup_free, so return directly */ mem_cgroup_put(memcg); - if (parent->use_hierarchy) - mem_cgroup_put(parent); } return error; } -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755363Ab3EQHEe (ORCPT ); Fri, 17 May 2013 03:04:34 -0400 Received: from szxga01-in.huawei.com ([119.145.14.64]:47041 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753807Ab3EQHEc (ORCPT ); Fri, 17 May 2013 03:04:32 -0400 Message-ID: <5195D658.5000301@huawei.com> Date: Fri, 17 May 2013 15:03:52 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: [PATCH 4/9] memcg: don't use mem_cgroup_get() when creating a kmemcg cache References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put(). There are two things being done in the current code: First, we acquired a css_ref to make sure that the underlying cgroup would not go away. That is a short lived reference, and it is put as soon as the cache is created. At this point, we acquire a long-lived per-cache memcg reference count to guarantee that the memcg will still be alive. so it is: enqueue: css_get create : memcg_get, css_put destroy: memcg_put So we only need to get rid of the memcg_get, change the memcg_put to css_put, and get rid of the now extra css_put. (This changelog is mostly written by Glauber) Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f1320d3..63526f9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3227,7 +3227,7 @@ void memcg_release_cache(struct kmem_cache *s) list_del(&s->memcg_params->list); mutex_unlock(&memcg->slab_caches_mutex); - mem_cgroup_put(memcg); + css_put(&memcg->css); out: kfree(s->memcg_params); } @@ -3387,16 +3387,18 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg, mutex_lock(&memcg_cache_mutex); new_cachep = cachep->memcg_params->memcg_caches[idx]; - if (new_cachep) + if (new_cachep) { + css_put(&memcg->css); goto out; + } new_cachep = kmem_cache_dup(memcg, cachep); if (new_cachep == NULL) { new_cachep = cachep; + css_put(&memcg->css); goto out; } - mem_cgroup_get(memcg); atomic_set(&new_cachep->memcg_params->nr_pages , 0); cachep->memcg_params->memcg_caches[idx] = new_cachep; @@ -3484,8 +3486,6 @@ static void memcg_create_cache_work_func(struct work_struct *w) cw = container_of(w, struct create_work, work); memcg_create_kmem_cache(cw->memcg, cw->cachep); - /* Drop the reference gotten when we enqueued. */ - css_put(&cw->memcg->css); kfree(cw); } -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755391Ab3EQHFI (ORCPT ); Fri, 17 May 2013 03:05:08 -0400 Received: from szxga02-in.huawei.com ([119.145.14.65]:34054 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754558Ab3EQHFC (ORCPT ); Fri, 17 May 2013 03:05:02 -0400 Message-ID: <5195D666.6030408@huawei.com> Date: Fri, 17 May 2013 15:04:06 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Use css_get/put instead of mem_cgroup_get/put. We can't do a simple replacement, because here mem_cgroup_put() is called during mem_cgroup_css_free(), while mem_cgroup_css_free() won't be called until css refcnt goes down to 0. Instead we increment css refcnt in mem_cgroup_css_offline(), and then check if there's still kmem charges. If not, css refcnt will be decremented immediately, otherwise the refcnt won't be decremented when kmem charges goes down to 0. v2: - added wmb() in kmem_cgroup_css_offline(), pointed out by Michal - revised comments as suggested by Michal - fixed to check if kmem is activated in kmem_cgroup_css_offline() Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 41 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 63526f9..2f0d117 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3033,8 +3033,16 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size) if (res_counter_uncharge(&memcg->kmem, size)) return; + /* + * Releases a reference taken in kmem_cgroup_css_offline in case + * this last uncharge is racing with the offlining code or it is + * outliving the memcg existence. + * + * The memory barrier imposed by test&clear is paired with the + * explicit one in kmem_cgroup_css_offline. + */ if (memcg_kmem_test_and_clear_dead(memcg)) - mem_cgroup_put(memcg); + css_put(&memcg->css); } void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep) @@ -5130,14 +5138,6 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val) * starts accounting before all call sites are patched */ memcg_kmem_set_active(memcg); - - /* - * kmem charges can outlive the cgroup. In the case of slab - * pages, for instance, a page contain objects from various - * processes, so it is unfeasible to migrate them away. We - * need to reference count the memcg because of that. - */ - mem_cgroup_get(memcg); } else ret = res_counter_set_limit(&memcg->kmem, val); out: @@ -5170,12 +5170,10 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg) goto out; /* - * destroy(), called if we fail, will issue static_key_slow_inc() and - * mem_cgroup_put() if kmem is enabled. We have to either call them - * unconditionally, or clear the KMEM_ACTIVE flag. I personally find - * this more consistent, since it always leads to the same destroy path + * __mem_cgroup_free() will issue static_key_slow_dec() because this + * memcg is active already. If the later initialization fails then the + * cgroup core triggers the cleanup so we do not have to do it here. */ - mem_cgroup_get(memcg); static_key_slow_inc(&memcg_kmem_enabled_key); mutex_lock(&set_limit_mutex); @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) return mem_cgroup_sockets_init(memcg, ss); } -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) { - mem_cgroup_sockets_destroy(memcg); + if (!memcg_kmem_is_active(memcg)) + return; + /* + * kmem charges can outlive the cgroup. In the case of slab + * pages, for instance, a page contain objects from various + * processes. As we prevent from taking a reference for every + * such allocation we have to be careful when doing uncharge + * (see memcg_uncharge_kmem) and here during offlining. + * + * The idea is that that only the _last_ uncharge which sees + * the dead memcg will drop the last reference. An additional + * reference is taken here before the group is marked dead + * which is then paired with css_put during uncharge resp. here. + * + * Although this might sound strange as this path is called when + * the reference has already dropped down to 0 and shouldn't be + * incremented anymore (css_tryget would fail) we do not have + * other options because of the kmem allocations lifetime. + */ + css_get(&memcg->css); + + /* see comment in memcg_uncharge_kmem() */ + wmb(); memcg_kmem_mark_dead(memcg); if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0) return; - /* - * Charges already down to 0, undo mem_cgroup_get() done in the charge - * path here, being careful not to race with memcg_uncharge_kmem: it is - * possible that the charges went down to 0 between mark_dead and the - * res_counter read, so in that case, we don't need the put - */ if (memcg_kmem_test_and_clear_dead(memcg)) - mem_cgroup_put(memcg); + css_put(&memcg->css); } #else static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) @@ -5882,7 +5896,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) return 0; } -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) { } #endif @@ -6315,6 +6329,8 @@ static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + kmem_cgroup_css_offline(memcg); + mem_cgroup_invalidate_reclaim_iterators(memcg); mem_cgroup_reparent_charges(memcg); mem_cgroup_destroy_all_caches(memcg); @@ -6324,7 +6340,7 @@ static void mem_cgroup_css_free(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - kmem_cgroup_destroy(memcg); + mem_cgroup_sockets_destroy(memcg); mem_cgroup_put(memcg); } -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755397Ab3EQHF2 (ORCPT ); Fri, 17 May 2013 03:05:28 -0400 Received: from szxga01-in.huawei.com ([119.145.14.64]:47638 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754918Ab3EQHFZ (ORCPT ); Fri, 17 May 2013 03:05:25 -0400 Message-ID: <5195D68A.8000205@huawei.com> Date: Fri, 17 May 2013 15:04:42 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: [PATCH 7/9] memcg: don't need to get a reference to the parent References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The cgroup core guarantees it's always safe to access the parent. v2: - added a comment in mem_cgroup_put() as suggested by Michal - removed mem_cgroup_get(), otherwise gcc will warn that it's not used Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b11ea88..c6d267a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -508,7 +508,6 @@ enum res_type { */ static DEFINE_MUTEX(memcg_create_mutex); -static void mem_cgroup_get(struct mem_cgroup *memcg); static void mem_cgroup_put(struct mem_cgroup *memcg); static inline @@ -6171,19 +6170,10 @@ static void free_rcu(struct rcu_head *rcu_head) schedule_work(&memcg->work_freeing); } -static void mem_cgroup_get(struct mem_cgroup *memcg) -{ - atomic_inc(&memcg->refcnt); -} - static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) { - if (atomic_sub_and_test(count, &memcg->refcnt)) { - struct mem_cgroup *parent = parent_mem_cgroup(memcg); + if (atomic_sub_and_test(count, &memcg->refcnt)) call_rcu(&memcg->rcu_freeing, free_rcu); - if (parent) - mem_cgroup_put(parent); - } } static void mem_cgroup_put(struct mem_cgroup *memcg) @@ -6286,12 +6276,9 @@ mem_cgroup_css_online(struct cgroup *cont) res_counter_init(&memcg->kmem, &parent->kmem); /* - * We increment refcnt of the parent to ensure that we can - * safely access it on res_counter_charge/uncharge. - * This refcnt will be decremented when freeing this - * mem_cgroup(see mem_cgroup_put). + * No need to take a reference to the parent because cgroup + * core guarantees its existence. */ - mem_cgroup_get(parent); } else { res_counter_init(&memcg->res, NULL); res_counter_init(&memcg->memsw, NULL); -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755202Ab3EQHER (ORCPT ); Fri, 17 May 2013 03:04:17 -0400 Received: from szxga01-in.huawei.com ([119.145.14.64]:46852 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753807Ab3EQHEQ (ORCPT ); Fri, 17 May 2013 03:04:16 -0400 Message-ID: <5195D64A.9090000@huawei.com> Date: Fri, 17 May 2013 15:03:38 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: [PATCH 3/9] memcg: use css_get() in sock_update_memcg() References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Use css_get/css_put instead of mem_cgroup_get/put. Note, if at the same time someone is moving @current to a different cgroup and removing the old cgroup, css_tryget() may return false, and sock->sk_cgrp won't be initialized, which is fine. Signed-off-by: Li Zefan Acked-by: KAMEZAWA Hiroyuki Acked-by: Michal Hocko --- mm/memcontrol.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4d0458d..f1320d3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -561,15 +561,15 @@ void sock_update_memcg(struct sock *sk) */ if (sk->sk_cgrp) { BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg)); - mem_cgroup_get(sk->sk_cgrp->memcg); + css_get(&sk->sk_cgrp->memcg->css); return; } rcu_read_lock(); memcg = mem_cgroup_from_task(current); cg_proto = sk->sk_prot->proto_cgroup(memcg); - if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) { - mem_cgroup_get(memcg); + if (!mem_cgroup_is_root(memcg) && + memcg_proto_active(cg_proto) && css_tryget(&memcg->css)) { sk->sk_cgrp = cg_proto; } rcu_read_unlock(); @@ -583,7 +583,7 @@ void sock_release_memcg(struct sock *sk) struct mem_cgroup *memcg; WARN_ON(!sk->sk_cgrp->memcg); memcg = sk->sk_cgrp->memcg; - mem_cgroup_put(memcg); + css_put(&sk->sk_cgrp->memcg->css); } } -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755361Ab3EQHFG (ORCPT ); Fri, 17 May 2013 03:05:06 -0400 Received: from szxga01-in.huawei.com ([119.145.14.64]:47481 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754629Ab3EQHFE (ORCPT ); Fri, 17 May 2013 03:05:04 -0400 Message-ID: <5195D679.5080100@huawei.com> Date: Fri, 17 May 2013 15:04:25 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: [PATCH 6/9] memcg: use css_get/put for swap memcg References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Use css_get/put instead of mem_cgroup_get/put. A simple replacement will do. The historical reason that memcg has its own refcnt instead of always using css_get/put, is that cgroup couldn't be removed if there're still css refs, so css refs can't be used as long-lived reference. The situation has changed so that rmdir a cgroup will succeed regardless css refs, but won't be freed until css refs goes down to 0. Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2f0d117..b11ea88 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4185,12 +4185,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype, unlock_page_cgroup(pc); /* * even after unlock, we have memcg->res.usage here and this memcg - * will never be freed. + * will never be freed, so it's safe to call css_get(). */ memcg_check_events(memcg, page); if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) { mem_cgroup_swap_statistics(memcg, true); - mem_cgroup_get(memcg); + css_get(&memcg->css); } /* * Migration does not charge the res_counter for the @@ -4290,7 +4290,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) /* * record memcg information, if swapout && memcg != NULL, - * mem_cgroup_get() was called in uncharge(). + * css_get() was called in uncharge(). */ if (do_swap_account && swapout && memcg) swap_cgroup_record(ent, css_id(&memcg->css)); @@ -4321,7 +4321,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent) if (!mem_cgroup_is_root(memcg)) res_counter_uncharge(&memcg->memsw, PAGE_SIZE); mem_cgroup_swap_statistics(memcg, false); - mem_cgroup_put(memcg); + css_put(&memcg->css); } rcu_read_unlock(); } @@ -4355,11 +4355,14 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry, * This function is only called from task migration context now. * It postpones res_counter and refcount handling till the end * of task migration(mem_cgroup_clear_mc()) for performance - * improvement. But we cannot postpone mem_cgroup_get(to) - * because if the process that has been moved to @to does - * swap-in, the refcount of @to might be decreased to 0. + * improvement. But we cannot postpone css_get(to) because if + * the process that has been moved to @to does swap-in, the + * refcount of @to might be decreased to 0. + * + * We are in attach() phase, so the cgroup is guaranteed to be + * alive, so we can just call css_get(). */ - mem_cgroup_get(to); + css_get(&to->css); return 0; } return -EINVAL; @@ -6651,6 +6654,7 @@ static void __mem_cgroup_clear_mc(void) { struct mem_cgroup *from = mc.from; struct mem_cgroup *to = mc.to; + int i; /* we must uncharge all the leftover precharges from mc.to */ if (mc.precharge) { @@ -6671,7 +6675,9 @@ static void __mem_cgroup_clear_mc(void) if (!mem_cgroup_is_root(mc.from)) res_counter_uncharge(&mc.from->memsw, PAGE_SIZE * mc.moved_swap); - __mem_cgroup_put(mc.from, mc.moved_swap); + + for (i = 0; i < mc.moved_swap; i++) + css_put(&mc.from->css); if (!mem_cgroup_is_root(mc.to)) { /* @@ -6681,7 +6687,7 @@ static void __mem_cgroup_clear_mc(void) res_counter_uncharge(&mc.to->res, PAGE_SIZE * mc.moved_swap); } - /* we've already done mem_cgroup_get(mc.to) */ + /* we've already done css_get(mc.to) */ mc.moved_swap = 0; } memcg_oom_recover(from); -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755322Ab3EQHHF (ORCPT ); Fri, 17 May 2013 03:07:05 -0400 Received: from szxga02-in.huawei.com ([119.145.14.65]:26606 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755259Ab3EQHHB (ORCPT ); Fri, 17 May 2013 03:07:01 -0400 Message-ID: <5195D6BF.3020907@huawei.com> Date: Fri, 17 May 2013 15:05:35 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: [PATCH 9/9] memcg: don't need to free memcg via RCU or workqueue References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now memcg has the same life cycle with its corresponding cgroup, and a cgroup is freed via RCU and then mem_cgroup_css_free() will be called in a work function, so we can simply call __mem_cgroup_free() in mem_cgroup_css_free(). This actually reverts 59927fb984de1703c67bc640c3e522d8b5276c73 ("memcg: free mem_cgroup by RCU to fix oops"). Cc: Hugh Dickins Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 51 +++++---------------------------------------------- 1 file changed, 5 insertions(+), 46 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 348126a..eb27b58 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -267,28 +267,10 @@ struct mem_cgroup { /* vmpressure notifications */ struct vmpressure vmpressure; - union { - /* - * the counter to account for mem+swap usage. - */ - struct res_counter memsw; - - /* - * rcu_freeing is used only when freeing struct mem_cgroup, - * so put it into a union to avoid wasting more memory. - * It must be disjoint from the css field. It could be - * in a union with the res field, but res plays a much - * larger part in mem_cgroup life than memsw, and might - * be of interest, even at time of free, when debugging. - * So share rcu_head with the less interesting memsw. - */ - struct rcu_head rcu_freeing; - /* - * We also need some space for a worker in deferred freeing. - * By the time we call it, rcu_freeing is no longer in use. - */ - struct work_struct work_freeing; - }; + /* + * the counter to account for mem+swap usage. + */ + struct res_counter memsw; /* * the counter to account for kernel memory usage. @@ -6143,29 +6125,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) vfree(memcg); } - -/* - * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU, - * but in process context. The work_freeing structure is overlaid - * on the rcu_freeing structure, which itself is overlaid on memsw. - */ -static void free_work(struct work_struct *work) -{ - struct mem_cgroup *memcg; - - memcg = container_of(work, struct mem_cgroup, work_freeing); - __mem_cgroup_free(memcg); -} - -static void free_rcu(struct rcu_head *rcu_head) -{ - struct mem_cgroup *memcg; - - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); - INIT_WORK(&memcg->work_freeing, free_work); - schedule_work(&memcg->work_freeing); -} - /* * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled. */ @@ -6316,7 +6275,7 @@ static void mem_cgroup_css_free(struct cgroup *cont) mem_cgroup_sockets_destroy(memcg); - call_rcu(&memcg->rcu_freeing, free_rcu); + __mem_cgroup_free(memcg); } #ifdef CONFIG_MMU -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754918Ab3EQHFp (ORCPT ); Fri, 17 May 2013 03:05:45 -0400 Received: from szxga02-in.huawei.com ([119.145.14.65]:25776 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754734Ab3EQHFn (ORCPT ); Fri, 17 May 2013 03:05:43 -0400 Message-ID: <5195D6A4.8060109@huawei.com> Date: Fri, 17 May 2013 15:05:08 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: [PATCH 8/9] memcg: kill memcg refcnt References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now memcg has the same life cycle as its corresponding cgroup. Kill the useless refcnt. Signed-off-by: Li Zefan Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c6d267a..348126a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -303,8 +303,6 @@ struct mem_cgroup { bool oom_lock; atomic_t under_oom; - atomic_t refcnt; - int swappiness; /* OOM-Killer disable */ int oom_kill_disable; @@ -508,8 +506,6 @@ enum res_type { */ static DEFINE_MUTEX(memcg_create_mutex); -static void mem_cgroup_put(struct mem_cgroup *memcg); - static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) { @@ -6170,17 +6166,6 @@ static void free_rcu(struct rcu_head *rcu_head) schedule_work(&memcg->work_freeing); } -static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) -{ - if (atomic_sub_and_test(count, &memcg->refcnt)) - call_rcu(&memcg->rcu_freeing, free_rcu); -} - -static void mem_cgroup_put(struct mem_cgroup *memcg) -{ - __mem_cgroup_put(memcg, 1); -} - /* * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled. */ @@ -6240,7 +6225,6 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->last_scanned_node = MAX_NUMNODES; INIT_LIST_HEAD(&memcg->oom_notify); - atomic_set(&memcg->refcnt, 1); memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); @@ -6332,7 +6316,7 @@ static void mem_cgroup_css_free(struct cgroup *cont) mem_cgroup_sockets_destroy(memcg); - mem_cgroup_put(memcg); + call_rcu(&memcg->rcu_freeing, free_rcu); } #ifdef CONFIG_MMU -- 1.8.0.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755360Ab3EQHHc (ORCPT ); Fri, 17 May 2013 03:07:32 -0400 Received: from szxga01-in.huawei.com ([119.145.14.64]:49784 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755246Ab3EQHHa (ORCPT ); Fri, 17 May 2013 03:07:30 -0400 Message-ID: <5195D710.2040501@huawei.com> Date: Fri, 17 May 2013 15:06:56 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Andrew Morton CC: Tejun Heo , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: Re: [PATCH 0/12][V3] memcg: make memcg's life cycle the same as cgroup References: <5195D5F8.7000609@huawei.com> In-Reply-To: <5195D5F8.7000609@huawei.com> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The subject should be "[PATCH 0/9][v3] ..." On 2013/5/17 15:02, Li Zefan wrote: > Hi Andrew, > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754333Ab3EQSIb (ORCPT ); Fri, 17 May 2013 14:08:31 -0400 Received: from mail-qe0-f47.google.com ([209.85.128.47]:55368 "EHLO mail-qe0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753391Ab3EQSI2 (ORCPT ); Fri, 17 May 2013 14:08:28 -0400 Date: Fri, 17 May 2013 11:08:22 -0700 From: Tejun Heo To: Li Zefan Cc: Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Message-ID: <20130517180822.GC12632@mtj.dyndns.org> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5195D666.6030408@huawei.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey, On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > + /* > + * Releases a reference taken in kmem_cgroup_css_offline in case > + * this last uncharge is racing with the offlining code or it is > + * outliving the memcg existence. > + * > + * The memory barrier imposed by test&clear is paired with the > + * explicit one in kmem_cgroup_css_offline. Paired with the wmb to achieve what? > + */ > if (memcg_kmem_test_and_clear_dead(memcg)) > - mem_cgroup_put(memcg); > + css_put(&memcg->css); The other side is wmb, so there gotta be something which wants to read which were written before wmb here but the only thing after the barrier is css_put() which doesn't need such thing, so I'm lost on what the barrier pair is achieving here. In general, please be *very* explicit about what's going on whenever something is depending on barrier pairs. It'll make it easier for both the author and reviewers to actually understand what's going on and why it's necessary. ... > @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > return mem_cgroup_sockets_init(memcg, ss); > } > > -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) > +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) > { > - mem_cgroup_sockets_destroy(memcg); > + if (!memcg_kmem_is_active(memcg)) > + return; > > + /* > + * kmem charges can outlive the cgroup. In the case of slab > + * pages, for instance, a page contain objects from various > + * processes. As we prevent from taking a reference for every > + * such allocation we have to be careful when doing uncharge > + * (see memcg_uncharge_kmem) and here during offlining. > + * > + * The idea is that that only the _last_ uncharge which sees > + * the dead memcg will drop the last reference. An additional > + * reference is taken here before the group is marked dead > + * which is then paired with css_put during uncharge resp. here. > + * > + * Although this might sound strange as this path is called when > + * the reference has already dropped down to 0 and shouldn't be > + * incremented anymore (css_tryget would fail) we do not have Hmmm? offline is called on cgroup destruction regardless of css refcnt. The above comment seems a bit misleading. > + * other options because of the kmem allocations lifetime. > + */ > + css_get(&memcg->css); > + > + /* see comment in memcg_uncharge_kmem() */ > + wmb(); > memcg_kmem_mark_dead(memcg); Is the wmb() trying to prevent reordering between css_get() and memcg_kmem_mark_dead()? If so, it isn't necessary - the compiler isn't allowed to reorder two atomic ops (they're all asm volatiles) and the visibility order is guaranteed by the nature of the two operations going on here - both perform modify-and-test on one end of the operations. It could be argued that having memory barriers is better for completeness of mark/test interface but then those barriers should really moved into memcg_kmem_mark_dead() and its clearing counterpart. While it's all clever and dandy, my recommendation would be just using a lock for synchronization. It isn't a hot path. Why be clever? Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754961Ab3EVIiA (ORCPT ); Wed, 22 May 2013 04:38:00 -0400 Received: from szxga02-in.huawei.com ([119.145.14.65]:5179 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754865Ab3EVIhz (ORCPT ); Wed, 22 May 2013 04:37:55 -0400 Message-ID: <519C838B.9060609@huawei.com> Date: Wed, 22 May 2013 16:36:27 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Tejun Heo CC: Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> In-Reply-To: <20130517180822.GC12632@mtj.dyndns.org> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2013/5/18 2:08, Tejun Heo wrote: > Hey, > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: >> + /* >> + * Releases a reference taken in kmem_cgroup_css_offline in case >> + * this last uncharge is racing with the offlining code or it is >> + * outliving the memcg existence. >> + * >> + * The memory barrier imposed by test&clear is paired with the >> + * explicit one in kmem_cgroup_css_offline. > > Paired with the wmb to achieve what? > >> + */ >> if (memcg_kmem_test_and_clear_dead(memcg)) >> - mem_cgroup_put(memcg); >> + css_put(&memcg->css); > > The other side is wmb, so there gotta be something which wants to read > which were written before wmb here but the only thing after the > barrier is css_put() which doesn't need such thing, so I'm lost on > what the barrier pair is achieving here. > > In general, please be *very* explicit about what's going on whenever > something is depending on barrier pairs. It'll make it easier for > both the author and reviewers to actually understand what's going on > and why it's necessary. > > ... >> @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) >> return mem_cgroup_sockets_init(memcg, ss); >> } >> >> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) >> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) >> { >> - mem_cgroup_sockets_destroy(memcg); >> + if (!memcg_kmem_is_active(memcg)) >> + return; >> >> + /* >> + * kmem charges can outlive the cgroup. In the case of slab >> + * pages, for instance, a page contain objects from various >> + * processes. As we prevent from taking a reference for every >> + * such allocation we have to be careful when doing uncharge >> + * (see memcg_uncharge_kmem) and here during offlining. >> + * >> + * The idea is that that only the _last_ uncharge which sees >> + * the dead memcg will drop the last reference. An additional >> + * reference is taken here before the group is marked dead >> + * which is then paired with css_put during uncharge resp. here. >> + * >> + * Although this might sound strange as this path is called when >> + * the reference has already dropped down to 0 and shouldn't be >> + * incremented anymore (css_tryget would fail) we do not have > > Hmmm? offline is called on cgroup destruction regardless of css > refcnt. The above comment seems a bit misleading. > The comment is wrong. I'll fix it. >> + * other options because of the kmem allocations lifetime. >> + */ >> + css_get(&memcg->css); >> + >> + /* see comment in memcg_uncharge_kmem() */ >> + wmb(); >> memcg_kmem_mark_dead(memcg); > > Is the wmb() trying to prevent reordering between css_get() and > memcg_kmem_mark_dead()? If so, it isn't necessary - the compiler > isn't allowed to reorder two atomic ops (they're all asm volatiles) > and the visibility order is guaranteed by the nature of the two > operations going on here - both perform modify-and-test on one end of > the operations. > Yeah, I think you're right. > It could be argued that having memory barriers is better for > completeness of mark/test interface but then those barriers should > really moved into memcg_kmem_mark_dead() and its clearing counterpart. > > While it's all clever and dandy, my recommendation would be just using > a lock for synchronization. It isn't a hot path. Why be clever? > I don't quite like adding a lock not to protect data but just ensure code orders. Michal, what's your preference? I want to be sure that everyone is happy so the next version will hopefully be the last version. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760071Ab3EXHy0 (ORCPT ); Fri, 24 May 2013 03:54:26 -0400 Received: from cantor2.suse.de ([195.135.220.15]:36833 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758598Ab3EXHyY (ORCPT ); Fri, 24 May 2013 03:54:24 -0400 Date: Fri, 24 May 2013 09:54:20 +0200 From: Michal Hocko To: Tejun Heo , Li Zefan Cc: Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Message-ID: <20130524075420.GA24813@dhcp22.suse.cz> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> <519C838B.9060609@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <519C838B.9060609@huawei.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sorry, I have missed this. CCing would help. Anyway putting myself to CC now :P On Wed 22-05-13 16:36:27, Li Zefan wrote: > On 2013/5/18 2:08, Tejun Heo wrote: > > Hey, > > > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > >> + /* > >> + * Releases a reference taken in kmem_cgroup_css_offline in case > >> + * this last uncharge is racing with the offlining code or it is > >> + * outliving the memcg existence. > >> + * > >> + * The memory barrier imposed by test&clear is paired with the > >> + * explicit one in kmem_cgroup_css_offline. > > > > Paired with the wmb to achieve what? https://lkml.org/lkml/2013/4/4/190 " ! > + css_get(&memcg->css); ! I think that you need a write memory barrier here because css_get ! nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses ! memcg_kmem_test_and_clear_dead which imply a full memory barrier but it ! should see the elevated reference count. No? ! ! > + /* ! > + * We need to call css_get() first, because memcg_uncharge_kmem() ! > + * will call css_put() if it sees the memcg is dead. ! > + */ ! > memcg_kmem_mark_dead(memcg); " Does it make sense to you Tejun? > > > >> + */ > >> if (memcg_kmem_test_and_clear_dead(memcg)) > >> - mem_cgroup_put(memcg); > >> + css_put(&memcg->css); > > > > The other side is wmb, so there gotta be something which wants to read > > which were written before wmb here but the only thing after the > > barrier is css_put() which doesn't need such thing, so I'm lost on > > what the barrier pair is achieving here. > > > > In general, please be *very* explicit about what's going on whenever > > something is depending on barrier pairs. It'll make it easier for > > both the author and reviewers to actually understand what's going on > > and why it's necessary. > > > > ... > >> @@ -5858,23 +5856,39 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > >> return mem_cgroup_sockets_init(memcg, ss); > >> } > >> > >> -static void kmem_cgroup_destroy(struct mem_cgroup *memcg) > >> +static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) > >> { > >> - mem_cgroup_sockets_destroy(memcg); > >> + if (!memcg_kmem_is_active(memcg)) > >> + return; > >> > >> + /* > >> + * kmem charges can outlive the cgroup. In the case of slab > >> + * pages, for instance, a page contain objects from various > >> + * processes. As we prevent from taking a reference for every > >> + * such allocation we have to be careful when doing uncharge > >> + * (see memcg_uncharge_kmem) and here during offlining. > >> + * > >> + * The idea is that that only the _last_ uncharge which sees > >> + * the dead memcg will drop the last reference. An additional > >> + * reference is taken here before the group is marked dead > >> + * which is then paired with css_put during uncharge resp. here. > >> + * > >> + * Although this might sound strange as this path is called when > >> + * the reference has already dropped down to 0 and shouldn't be > >> + * incremented anymore (css_tryget would fail) we do not have > > > > Hmmm? offline is called on cgroup destruction regardless of css > > refcnt. The above comment seems a bit misleading. > > > > The comment is wrong. I'll fix it. Ohh, right. "Althouth this might sound strange as this path is called from css_offline when the reference might have dropped down to 0 and shouldn't ..." Sounds better? > >> + * other options because of the kmem allocations lifetime. > >> + */ > >> + css_get(&memcg->css); > >> + > >> + /* see comment in memcg_uncharge_kmem() */ > >> + wmb(); > >> memcg_kmem_mark_dead(memcg); > > > > Is the wmb() trying to prevent reordering between css_get() and > > memcg_kmem_mark_dead()? If so, it isn't necessary - the compiler > > isn't allowed to reorder two atomic ops (they're all asm volatiles) > > and the visibility order is guaranteed by the nature of the two > > operations going on here - both perform modify-and-test on one end of > > the operations. As I have copied my comment from the earlier thread above. css_get does atomic_add which doesn't imply any barrier AFAIK and memcg_kmem_mark_dead uses a simple set_bit which doesn't imply it either. What am I missing? > > > > Yeah, I think you're right. > > > It could be argued that having memory barriers is better for > > completeness of mark/test interface but then those barriers should > > really moved into memcg_kmem_mark_dead() and its clearing counterpart. > > > > While it's all clever and dandy, my recommendation would be just using > > a lock for synchronization. It isn't a hot path. Why be clever? > > > > I don't quite like adding a lock not to protect data but just ensure code > orders. Agreed. > Michal, what's your preference? I want to be sure that everyone is happy > so the next version will hopefully be the last version. I still do not see why the barrier is not needed and the lock seems too big hammer. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S967538Ab3E3FtG (ORCPT ); Thu, 30 May 2013 01:49:06 -0400 Received: from mail-pd0-f174.google.com ([209.85.192.174]:43000 "EHLO mail-pd0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S967488Ab3E3Fs5 (ORCPT ); Thu, 30 May 2013 01:48:57 -0400 Date: Thu, 30 May 2013 14:48:52 +0900 From: Tejun Heo To: Michal Hocko Cc: Li Zefan , Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Message-ID: <20130530054852.GA9305@mtj.dyndns.org> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> <519C838B.9060609@huawei.com> <20130524075420.GA24813@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130524075420.GA24813@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Sorry about the delay. Have been and still am traveling. On Fri, May 24, 2013 at 09:54:20AM +0200, Michal Hocko wrote: > > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > > >> + /* > > >> + * Releases a reference taken in kmem_cgroup_css_offline in case > > >> + * this last uncharge is racing with the offlining code or it is > > >> + * outliving the memcg existence. > > >> + * > > >> + * The memory barrier imposed by test&clear is paired with the > > >> + * explicit one in kmem_cgroup_css_offline. > > > > > > Paired with the wmb to achieve what? > > https://lkml.org/lkml/2013/4/4/190 > " > ! > + css_get(&memcg->css); > ! I think that you need a write memory barrier here because css_get > ! nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses > ! memcg_kmem_test_and_clear_dead which imply a full memory barrier but it > ! should see the elevated reference count. No? > ! > ! > + /* > ! > + * We need to call css_get() first, because memcg_uncharge_kmem() > ! > + * will call css_put() if it sees the memcg is dead. > ! > + */ > ! > memcg_kmem_mark_dead(memcg); > " > > Does it make sense to you Tejun? Yeah, you're right. We need them. It's still a bummer that mark_dead has the appearance of proper encapsulation while not really taking care of synchronization. I think it'd make more sense for mark_dead to have the barrier (which BTW should probably be smp_wmb() instead of wmb()) inside it or for the function to be just open-coded. More on this topic later. > > The comment is wrong. I'll fix it. > > Ohh, right. "Althouth this might sound strange as this path is called from > css_offline when the reference might have dropped down to 0 and shouldn't ..." > > Sounds better? Yeap. > > I don't quite like adding a lock not to protect data but just ensure code > > orders. > > Agreed. > > > Michal, what's your preference? I want to be sure that everyone is happy > > so the next version will hopefully be the last version. > > I still do not see why the barrier is not needed and the lock seems too > big hammer. Yes, the barrier is necessary but I still think it's unnecessarily elaborate. Among the locking constructs, the barriesr are the worst - they don't enforce any structures, people often misunderstand / make mistakes about them, bugs from misusages are extremely difficult to trigger and reproduce especially on x86. It's a horrible construct and should only be used if no other options can meet the performance requirements required for the path. So, to me, "lock is too big a hammer" looks to be approaching the problem from the completely wrong direction when the code path clearly isn't hot enough to justify memory barrier tricks. We don't and shouldn't try to choose the mechanism with the lowest possible overhead for the given situation. We should be as simple and straight-forward as the situation allows. That's the only way to sustain long-term maintainability. So, let's please do something which is apparent. I don't really care what it is - a shared spinlock, test_and_lock bitops, whatever. It's not gonna show up in any profile anyway. There's absolutely no reason to mess with memory barriers. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933573Ab3E3PMb (ORCPT ); Thu, 30 May 2013 11:12:31 -0400 Received: from mail-we0-f173.google.com ([74.125.82.173]:62932 "EHLO mail-we0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932376Ab3E3PMZ (ORCPT ); Thu, 30 May 2013 11:12:25 -0400 Date: Thu, 30 May 2013 17:12:20 +0200 From: Michal Hocko To: Tejun Heo Cc: Li Zefan , Andrew Morton , Glauber Costa , KAMEZAWA Hiroyuki , Johannes Weiner , LKML , Cgroups , linux-mm@kvack.org Subject: Re: [PATCH 5/9] memcg: use css_get/put when charging/uncharging kmem Message-ID: <20130530151220.GB18155@dhcp22.suse.cz> References: <5195D5F8.7000609@huawei.com> <5195D666.6030408@huawei.com> <20130517180822.GC12632@mtj.dyndns.org> <519C838B.9060609@huawei.com> <20130524075420.GA24813@dhcp22.suse.cz> <20130530054852.GA9305@mtj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130530054852.GA9305@mtj.dyndns.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 30-05-13 14:48:52, Tejun Heo wrote: > Hello, > > Sorry about the delay. Have been and still am traveling. > > On Fri, May 24, 2013 at 09:54:20AM +0200, Michal Hocko wrote: > > > > On Fri, May 17, 2013 at 03:04:06PM +0800, Li Zefan wrote: > > > >> + /* > > > >> + * Releases a reference taken in kmem_cgroup_css_offline in case > > > >> + * this last uncharge is racing with the offlining code or it is > > > >> + * outliving the memcg existence. > > > >> + * > > > >> + * The memory barrier imposed by test&clear is paired with the > > > >> + * explicit one in kmem_cgroup_css_offline. > > > > > > > > Paired with the wmb to achieve what? > > > > https://lkml.org/lkml/2013/4/4/190 > > " > > ! > + css_get(&memcg->css); > > ! I think that you need a write memory barrier here because css_get > > ! nor memcg_kmem_mark_dead implies it. memcg_uncharge_kmem uses > > ! memcg_kmem_test_and_clear_dead which imply a full memory barrier but it > > ! should see the elevated reference count. No? > > ! > > ! > + /* > > ! > + * We need to call css_get() first, because memcg_uncharge_kmem() > > ! > + * will call css_put() if it sees the memcg is dead. > > ! > + */ > > ! > memcg_kmem_mark_dead(memcg); > > " > > > > Does it make sense to you Tejun? > > Yeah, you're right. We need them. It's still a bummer that mark_dead > has the appearance of proper encapsulation while not really taking > care of synchronization. No objection to put barrier there. You are right it is more natural. > I think it'd make more sense for mark_dead to have the barrier (which > BTW should probably be smp_wmb() instead of wmb()) Yes, smp_wmb sounds like a better fit. > inside it or for the function to be just open-coded. More on this > topic later. > > > > The comment is wrong. I'll fix it. > > > > Ohh, right. "Althouth this might sound strange as this path is called from > > css_offline when the reference might have dropped down to 0 and shouldn't ..." > > > > Sounds better? > > Yeap. > > > > I don't quite like adding a lock not to protect data but just ensure code > > > orders. > > > > Agreed. > > > > > Michal, what's your preference? I want to be sure that everyone is happy > > > so the next version will hopefully be the last version. > > > > I still do not see why the barrier is not needed and the lock seems too > > big hammer. > > Yes, the barrier is necessary but I still think it's unnecessarily > elaborate. Among the locking constructs, the barriesr are the worst - > they don't enforce any structures, people often misunderstand / make > mistakes about them, bugs from misusages are extremely difficult to > trigger and reproduce especially on x86. It's a horrible construct > and should only be used if no other options can meet the performance > requirements required for the path. I am all for simplifying the code. I guess it deserves a separate patch though and it is a bit unrelated to the scope of the series. -- Michal Hocko SUSE Labs