From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [RFC] memcg/cgroup: do not fail fail on pre_destroy callbacks Date: Wed, 17 Oct 2012 15:30:42 +0200 Message-ID: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Return-path: Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Hi, memcg is the only controller which might fail in its pre_destroy callback which makes the cgroup core more complicated for no good reason. This is an attempt to change this unfortunate state. I am sending this a RFC because I would like to hear back whether the approach is correct. I thought that the changes would be more invasive but it seems that the current code was mostly prepared for this and it needs just some small tweaks (so I might be missing something important here). The first two patches are just clean ups. They could be merged even without the rest. The real change, although the code is not changed that much, is the 3rd patch. It changes the way how we handle mem_cgroup_move_parent failures. We have to realize that all those failures are *temporal*. Because we are either racing with the page removal or the page is temporarily off the LRU because of migration resp. global reclaim. As a result we do not fail mem_cgroup_force_empty_list if the page cannot be moved to the parent and rather retry until the LRU is empty. The 4th patch is for cgroup core. I have moved cgroup_call_pre_destroy inside the cgroup_lock which is not very nice because the callbacks can take some time. Maybe we can move this call at the very end of the function? All I need for memcg is that cgroup_call_pre_destroy has been called and that no new cgroups can be attached to the group. The cgroup_lock is necessary for the later condition but if we move after CGRP_REMOVED flag is set then we are safe as well. The last two patches are trivial follow ups for the cgroups core change because now we know that nobody will interfere with us so we can drop those empty && no child condition. Comments, thoughts? Michal Hocko (6): memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts memcg: root_cgroup cannot reach mem_cgroup_move_parent memcg: Simplify mem_cgroup_force_empty_list error handling cgroups: forbid pre_destroy callback to fail memcg: make mem_cgroup_reparent_charges non failing hugetlb: do not fail in hugetlb_cgroup_pre_destroy Cumulative diffstat: kernel/cgroup.c | 30 ++++--------- mm/hugetlb_cgroup.c | 11 ++--- mm/memcontrol.c | 124 +++++++++++++++++++++++++++------------------------ 3 files changed, 78 insertions(+), 87 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [PATCH 1/6] memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts Date: Wed, 17 Oct 2012 15:30:43 +0200 Message-ID: <1350480648-10905-2-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Return-path: In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh mem_cgroup_force_empty did two separate things depending on free_all parameter from the very beginning. It either reclaimed as many pages as possible and moved the rest to the parent or just moved charges to the parent. The first variant is used as memory.force_empty callback while the later is used from the mem_cgroup_pre_destroy. The whole games around gotos are far from being nice and there is no reason to keep those two functions inside one. Let's split them and also move the responsibility for css reference counting to their callers to make to code easier. This patch doesn't have any functional changes. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 72 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 42 insertions(+), 30 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e4e9b18..f25e9c0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3733,27 +3733,21 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, } /* - * make mem_cgroup's charge to be 0 if there is no task. + * make mem_cgroup's charge to be 0 if there is no task by moving + * all the charges and pages to the parent. * This enables deleting this mem_cgroup. + * + * Caller is responsible for holding css reference on the memcg. */ -static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all) +static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) { - int ret; - int node, zid, shrink; - int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; struct cgroup *cgrp = memcg->css.cgroup; + int node, zid; + int ret; - css_get(&memcg->css); - - shrink = 0; - /* should free all ? */ - if (free_all) - goto try_to_free; -move_account: do { - ret = -EBUSY; if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) - goto out; + return -EBUSY; /* This is for making all *used* pages to be on LRU. */ lru_add_drain_all(); drain_all_stock_sync(memcg); @@ -3777,27 +3771,34 @@ move_account: cond_resched(); /* "ret" should also be checked to ensure all lists are empty. */ } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret); -out: - css_put(&memcg->css); + return ret; +} + +/* + * Reclaims as many pages from the given memcg as possible and moves + * the rest to the parent. + * + * Caller is responsible for holding css reference for memcg. + */ +static int mem_cgroup_force_empty(struct mem_cgroup *memcg) +{ + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; + struct cgroup *cgrp = memcg->css.cgroup; -try_to_free: /* returns EBUSY if there is a task or if we come here twice. */ - if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children) || shrink) { - ret = -EBUSY; - goto out; - } + if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) + return -EBUSY; + /* we call try-to-free pages for make this cgroup empty */ lru_add_drain_all(); /* try to free all pages in this cgroup */ - shrink = 1; while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) { int progress; - if (signal_pending(current)) { - ret = -EINTR; - goto out; - } + if (signal_pending(current)) + return -EINTR; + progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL, false); if (!progress) { @@ -3808,13 +3809,19 @@ try_to_free: } lru_add_drain(); - /* try move_account...there may be some *locked* pages. */ - goto move_account; + return mem_cgroup_reparent_charges(memcg); } static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) { - return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), true); + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + int ret; + + css_get(&memcg->css); + ret = mem_cgroup_force_empty(memcg); + css_put(&memcg->css); + + return ret; } @@ -5004,8 +5011,13 @@ free_out: static int mem_cgroup_pre_destroy(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + int ret; - return mem_cgroup_force_empty(memcg, false); + css_get(&memcg->css); + ret = mem_cgroup_reparent_charges(memcg); + css_put(&memcg->css); + + return ret; } static void mem_cgroup_destroy(struct cgroup *cont) -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [PATCH 2/6] memcg: root_cgroup cannot reach mem_cgroup_move_parent Date: Wed, 17 Oct 2012 15:30:44 +0200 Message-ID: <1350480648-10905-3-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Return-path: In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh The root cgroup cannot be destroyed so we never hit it idown the mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't even try to do anything if called for the root. This means that mem_cgroup_move_parent doesn't have to bother with the root cgroup and it can assume it can always move charges upwards. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f25e9c0..9ce24b7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2709,9 +2709,7 @@ static int mem_cgroup_move_parent(struct page *page, unsigned long uninitialized_var(flags); int ret; - /* Is ROOT ? */ - if (mem_cgroup_is_root(child)) - return -EINVAL; + VM_BUG_ON(mem_cgroup_is_root(child)); ret = -EBUSY; if (!get_page_unless_zero(page)) @@ -3817,6 +3815,8 @@ static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); int ret; + if (mem_cgroup_is_root(memcg)) + return -EINVAL; css_get(&memcg->css); ret = mem_cgroup_force_empty(memcg); css_put(&memcg->css); -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [PATCH 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling Date: Wed, 17 Oct 2012 15:30:45 +0200 Message-ID: <1350480648-10905-4-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Return-path: In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh mem_cgroup_force_empty_list currently tries to remove all pages from the given LRU. To prevent from temoporary failures (EBUSY returned by mem_cgroup_move_parent) it uses a margin to the current LRU pages and returns the true if there are still some pages left on the list. If we consider that mem_cgroup_move_parent fails only when we are racing with somebody else removing the page (resp. uncharging it) or when the page is migrated then it is obvious that all those failures are only temporal and so we can safely retry later. Let's get rid of the safety margin and make the loop really wait for the empty LRU. The caller should still make sure that all charges have been removed from the res_counter because mem_cgroup_replace_page_cache might add a page to the LRU after the check (it doesn't touch res_counter though). This catches most of the cases except for shmem which might call mem_cgroup_replace_page_cache with a page which is not charged and on the LRU yet but this was the case also without this patch. In order to fix this we need a guarantee that try_get_mem_cgroup_from_page falls back to the current mm's cgroup so it needs css_tryget to fail. This will be fixed up in a later patch because it nees a help from cgroup core. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 52 +++++++++++++++++++++++++++------------------------- 1 file changed, 27 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9ce24b7..f57ba4c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2697,9 +2697,13 @@ out: } /* - * move charges to its parent. + * move charges to its parent or the root cgroup if the group + * has no parent (aka use_hierarchy==0). + * Although this might fail the failure is always temporary and it + * signals a race with a page removal/uncharge or migration. In the + * first case the page will vanish from the LRU on the next attempt + * and the call should be retried later. */ - static int mem_cgroup_move_parent(struct page *page, struct page_cgroup *pc, struct mem_cgroup *child) @@ -2726,8 +2730,10 @@ static int mem_cgroup_move_parent(struct page *page, if (!parent) parent = root_mem_cgroup; - if (nr_pages > 1) + if (nr_pages > 1) { + VM_BUG_ON(!PageTransHuge(page)); flags = compound_lock_irqsave(page); + } ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent); @@ -3679,15 +3685,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, /* * Traverse a specified page_cgroup list and try to drop them all. This doesn't - * reclaim the pages page themselves - it just removes the page_cgroups. - * Returns true if some page_cgroups were not freed, indicating that the caller - * must retry this operation. + * reclaim the pages page themselves - pages are moved to the parent (or root) + * group. */ -static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, +static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, int node, int zid, enum lru_list lru) { struct mem_cgroup_per_zone *mz; - unsigned long flags, loop; + unsigned long flags; struct list_head *list; struct page *busy; struct zone *zone; @@ -3696,11 +3701,8 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, mz = mem_cgroup_zoneinfo(memcg, node, zid); list = &mz->lruvec.lists[lru]; - loop = mz->lru_size[lru]; - /* give some margin against EBUSY etc...*/ - loop += 256; busy = NULL; - while (loop--) { + do { struct page_cgroup *pc; struct page *page; @@ -3726,8 +3728,7 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, cond_resched(); } else busy = NULL; - } - return !list_empty(list); + } while (!list_empty(list)); } /* @@ -3741,7 +3742,6 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) { struct cgroup *cgrp = memcg->css.cgroup; int node, zid; - int ret; do { if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) @@ -3749,28 +3749,30 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) /* This is for making all *used* pages to be on LRU. */ lru_add_drain_all(); drain_all_stock_sync(memcg); - ret = 0; mem_cgroup_start_move(memcg); for_each_node_state(node, N_HIGH_MEMORY) { - for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) { + for (zid = 0; zid < MAX_NR_ZONES; zid++) { enum lru_list lru; for_each_lru(lru) { - ret = mem_cgroup_force_empty_list(memcg, + mem_cgroup_force_empty_list(memcg, node, zid, lru); - if (ret) - break; } } - if (ret) - break; } mem_cgroup_end_move(memcg); memcg_oom_recover(memcg); cond_resched(); - /* "ret" should also be checked to ensure all lists are empty. */ - } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret); - return ret; + /* + * This is a safety check because mem_cgroup_force_empty_list + * could have raced with mem_cgroup_replace_page_cache callers + * so the lru seemed empty but the page could have been added + * right after the check. RES_USAGE should be safe as we always + * charge before adding to the LRU. + */ + } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); + + return 0; } /* -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing Date: Wed, 17 Oct 2012 15:30:47 +0200 Message-ID: <1350480648-10905-6-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Return-path: In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Now that pre_destroy callbacks are called from within cgroup_lock and the cgroup has been checked to be empty without any children then there is no other way to fail. mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css because all css' are marked dead already. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f57ba4c..7c75da3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3738,14 +3738,12 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, * * Caller is responsible for holding css reference on the memcg. */ -static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) +static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg) { struct cgroup *cgrp = memcg->css.cgroup; int node, zid; do { - if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) - return -EBUSY; /* This is for making all *used* pages to be on LRU. */ lru_add_drain_all(); drain_all_stock_sync(memcg); @@ -3771,8 +3769,6 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) * charge before adding to the LRU. */ } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); - - return 0; } /* @@ -3809,7 +3805,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) } lru_add_drain(); - return mem_cgroup_reparent_charges(memcg); + mem_cgroup_reparent_charges(memcg); + + return 0; } static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) @@ -5013,13 +5011,9 @@ free_out: static int mem_cgroup_pre_destroy(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - int ret; - css_get(&memcg->css); - ret = mem_cgroup_reparent_charges(memcg); - css_put(&memcg->css); - - return ret; + mem_cgroup_reparent_charges(memcg); + return 0; } static void mem_cgroup_destroy(struct cgroup *cont) -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Wed, 17 Oct 2012 15:30:46 +0200 Message-ID: <1350480648-10905-5-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Return-path: In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Now that mem_cgroup_pre_destroy callback doesn't fail finally we can safely move on and forbit all the callbacks to fail. The last missing piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so that css_tryget fails so no new charges for the memcg can happen. The callbacks are also called from within cgroup_lock to guarantee that no new tasks show up. We could theoretically call them outside of the lock but then we have to move after CGRP_REMOVED flag is set. Signed-off-by: Michal Hocko --- kernel/cgroup.c | 30 +++++++++--------------------- 1 file changed, 9 insertions(+), 21 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index b7d9606..00729c1 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -855,7 +855,7 @@ static struct inode *cgroup_new_inode(umode_t mode, struct super_block *sb) * Call subsys's pre_destroy handler. * This is called before css refcnt check. */ -static int cgroup_call_pre_destroy(struct cgroup *cgrp) +static void cgroup_call_pre_destroy(struct cgroup *cgrp) { struct cgroup_subsys *ss; int ret = 0; @@ -864,15 +864,8 @@ static int cgroup_call_pre_destroy(struct cgroup *cgrp) if (!ss->pre_destroy) continue; - ret = ss->pre_destroy(cgrp); - if (ret) { - /* ->pre_destroy() failure is being deprecated */ - WARN_ON_ONCE(!ss->__DEPRECATED_clear_css_refs); - break; - } + BUG_ON(ss->pre_destroy(cgrp)); } - - return ret; } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -4161,7 +4154,6 @@ again: mutex_unlock(&cgroup_mutex); return -EBUSY; } - mutex_unlock(&cgroup_mutex); /* * In general, subsystem has no css->refcnt after pre_destroy(). But @@ -4174,17 +4166,6 @@ again: */ set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); - /* - * Call pre_destroy handlers of subsys. Notify subsystems - * that rmdir() request comes. - */ - ret = cgroup_call_pre_destroy(cgrp); - if (ret) { - clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); - return ret; - } - - mutex_lock(&cgroup_mutex); parent = cgrp->parent; if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) { clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); @@ -4206,6 +4187,13 @@ again: return -EINTR; goto again; } + + /* + * Call pre_destroy handlers of subsys. Notify subsystems + * that rmdir() request comes. + */ + cgroup_call_pre_destroy(cgrp); + /* NO css_tryget() can success after here. */ finish_wait(&cgroup_rmdir_waitq, &wait); clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: [PATCH 6/6] hugetlb: do not fail in hugetlb_cgroup_pre_destroy Date: Wed, 17 Oct 2012 15:30:48 +0200 Message-ID: <1350480648-10905-7-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Return-path: In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Now that pre_destroy callbacks are called from within cgroup_lock and the cgroup has been checked to be empty without any children then there is no other way to fail. Signed-off-by: Michal Hocko --- mm/hugetlb_cgroup.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index a3f358f..dc595c6 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -159,14 +159,9 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup) { struct hstate *h; struct page *page; - int ret = 0, idx = 0; + int idx = 0; do { - if (cgroup_task_count(cgroup) || - !list_empty(&cgroup->children)) { - ret = -EBUSY; - goto out; - } for_each_hstate(h) { spin_lock(&hugetlb_lock); list_for_each_entry(page, &h->hugepage_activelist, lru) @@ -177,8 +172,8 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup) } cond_resched(); } while (hugetlb_cgroup_have_usage(cgroup)); -out: - return ret; + + return 0; } int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Glauber Costa Subject: Re: [RFC] memcg/cgroup: do not fail fail on pre_destroy callbacks Date: Wed, 17 Oct 2012 19:30:48 +0400 Message-ID: <507ECF28.1060602@parallels.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On 10/17/2012 05:30 PM, Michal Hocko wrote: > Hi, > memcg is the only controller which might fail in its pre_destroy > callback which makes the cgroup core more complicated for no good > reason. This is an attempt to change this unfortunate state. > > I am sending this a RFC because I would like to hear back whether the > approach is correct. I thought that the changes would be more invasive > but it seems that the current code was mostly prepared for this and it > needs just some small tweaks (so I might be missing something important > here). > > The first two patches are just clean ups. They could be merged even > without the rest. > > The real change, although the code is not changed that much, is the 3rd > patch. It changes the way how we handle mem_cgroup_move_parent failures. > We have to realize that all those failures are *temporal*. Because we > are either racing with the page removal or the page is temporarily off > the LRU because of migration resp. global reclaim. As a result we do > not fail mem_cgroup_force_empty_list if the page cannot be moved to the > parent and rather retry until the LRU is empty. > > The 4th patch is for cgroup core. I have moved cgroup_call_pre_destroy > inside the cgroup_lock which is not very nice because the callbacks > can take some time. Maybe we can move this call at the very end of the > function? > All I need for memcg is that cgroup_call_pre_destroy has been called and > that no new cgroups can be attached to the group. The cgroup_lock is > necessary for the later condition but if we move after CGRP_REMOVED flag > is set then we are safe as well. > > The last two patches are trivial follow ups for the cgroups core change > because now we know that nobody will interfere with us so we can drop > those empty && no child condition. > > Comments, thoughts? > I personally don't see anything fundamentally wrong with this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kamezawa Hiroyuki Subject: Re: [RFC] memcg/cgroup: do not fail fail on pre_destroy callbacks Date: Thu, 18 Oct 2012 09:29:58 +0900 Message-ID: <507F4D86.106@jp.fujitsu.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1350480648-10905-1-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Michal Hocko Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo , Li Zefan , Johannes Weiner , Balbir Singh (2012/10/17 22:30), Michal Hocko wrote: > Hi, > memcg is the only controller which might fail in its pre_destroy > callback which makes the cgroup core more complicated for no good > reason. This is an attempt to change this unfortunate state. > > I am sending this a RFC because I would like to hear back whether the > approach is correct. I thought that the changes would be more invasive > but it seems that the current code was mostly prepared for this and it > needs just some small tweaks (so I might be missing something important > here). > > The first two patches are just clean ups. They could be merged even > without the rest. > > The real change, although the code is not changed that much, is the 3rd > patch. It changes the way how we handle mem_cgroup_move_parent failures. > We have to realize that all those failures are *temporal*. Because we > are either racing with the page removal or the page is temporarily off > the LRU because of migration resp. global reclaim. As a result we do > not fail mem_cgroup_force_empty_list if the page cannot be moved to the > parent and rather retry until the LRU is empty. > > The 4th patch is for cgroup core. I have moved cgroup_call_pre_destroy > inside the cgroup_lock which is not very nice because the callbacks > can take some time. Maybe we can move this call at the very end of the > function? > All I need for memcg is that cgroup_call_pre_destroy has been called and > that no new cgroups can be attached to the group. The cgroup_lock is > necessary for the later condition but if we move after CGRP_REMOVED flag > is set then we are safe as well. > > The last two patches are trivial follow ups for the cgroups core change > because now we know that nobody will interfere with us so we can drop > those empty && no child condition. > > Comments, thoughts? > > Michal Hocko (6): > memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts > memcg: root_cgroup cannot reach mem_cgroup_move_parent > memcg: Simplify mem_cgroup_force_empty_list error handling > cgroups: forbid pre_destroy callback to fail > memcg: make mem_cgroup_reparent_charges non failing > hugetlb: do not fail in hugetlb_cgroup_pre_destroy > > Cumulative diffstat: > kernel/cgroup.c | 30 ++++--------- > mm/hugetlb_cgroup.c | 11 ++--- > mm/memcontrol.c | 124 +++++++++++++++++++++++++++------------------------ > 3 files changed, 78 insertions(+), 87 deletions(-) Thank you very much ! The whole patch seems good to me and I like this approach. Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: Re: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing Date: Thu, 18 Oct 2012 16:30:19 +0800 Message-ID: <507FBE1B.4080906@huawei.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-6-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1350480648-10905-6-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh > static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) > @@ -5013,13 +5011,9 @@ free_out: > static int mem_cgroup_pre_destroy(struct cgroup *cont) > { > struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); > - int ret; > > - css_get(&memcg->css); > - ret = mem_cgroup_reparent_charges(memcg); > - css_put(&memcg->css); > - > - return ret; > + mem_cgroup_reparent_charges(memcg); > + return 0; > } > Why don't you make pre_destroy() return void? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing Date: Thu, 18 Oct 2012 10:42:12 +0200 Message-ID: <20121018084212.GA24295@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-6-git-send-email-mhocko@suse.cz> <507FBE1B.4080906@huawei.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <507FBE1B.4080906@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Li Zefan Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Thu 18-10-12 16:30:19, Li Zefan wrote: > > static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) > > @@ -5013,13 +5011,9 @@ free_out: > > static int mem_cgroup_pre_destroy(struct cgroup *cont) > > { > > struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); > > - int ret; > > > > - css_get(&memcg->css); > > - ret = mem_cgroup_reparent_charges(memcg); > > - css_put(&memcg->css); > > - > > - return ret; > > + mem_cgroup_reparent_charges(memcg); > > + return 0; > > } > > > > Why don't you make pre_destroy() return void? Yes I plan to do that later after I have feedback for this RFC. I am especially interested whether the cgroup core patch is OK, resp. has to be reworked to pull pre_destroy outside of cgroup_lock Thanks -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 1/6] memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts Date: Thu, 18 Oct 2012 14:56:07 -0700 Message-ID: <20121018215607.GN13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-2-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=KVzf7wBisjE7iRHoRaXKH5tr4hkk9PEtf3LpDPF9UDA=; b=sU+nHDpu3OzGiCxfV6gW1nLEqOcALkcblXNWsvDrjIdWnR5bP0BMGGI5EoKwOzy2Ok BqAJ1O8zzhgRAe1EiN2V3euaWt+L7Rxvp58BAEOyWX3m3Pe3tOZBN+8Im/SmPLWMoGFE CmMazxGCzD5Ek4dwuHCg6oCPIVKhVOWzhM7k4OTO2QZA4c1Reotm1AiNve6zxtY82QB3 LMohX6tYl4v3ftKAgfPEcjp7O+H6VgSLT46Ljc88qM5dggKS4rh+H1wqBHXry2gcFVhY o/cvJcem1N5giQGu461QqCXAjXgnz3vcVTbL+nncmcsfMijSLFv+nsXU9Fm+TAcLHiNf BD9A== Content-Disposition: inline In-Reply-To: <1350480648-10905-2-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Wed, Oct 17, 2012 at 03:30:43PM +0200, Michal Hocko wrote: > mem_cgroup_force_empty did two separate things depending on free_all > parameter from the very beginning. It either reclaimed as many pages as > possible and moved the rest to the parent or just moved charges to the > parent. The first variant is used as memory.force_empty callback while > the later is used from the mem_cgroup_pre_destroy. > > The whole games around gotos are far from being nice and there is no > reason to keep those two functions inside one. Let's split them and > also move the responsibility for css reference counting to their callers > to make to code easier. > > This patch doesn't have any functional changes. > > Signed-off-by: Michal Hocko Reviewed-by: Tejun Heo Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 2/6] memcg: root_cgroup cannot reach mem_cgroup_move_parent Date: Thu, 18 Oct 2012 14:58:07 -0700 Message-ID: <20121018215807.GO13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-3-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=wRE42eXMU2vU/1HGj9MX8ZaxSBTMqbochMf2P65Q924=; b=B6Qkonn6BfevtEbRZPsuN230cYbr3SgKrHsPjUlzBayj5FW+/3HmXb7lGuEC/3O7OP G61YKiIZ2jF05Uh5clNSy4PGYUPuqiS/uBR+2WYTJUx2W76PzQuhkkKUeYMpvuaRVTmf wslqYPSe/GVzlY6JWw9hcStO4wX4QpEHiYQS7jgf3jRjW7vl036RrwWp2v81AaxvrlGC B7mNASD+UMVmw1glPDn95NI6bonKbfUfMhS2FvKEWjRJSVxiMuxAU6kXaksQQwbf2eFa 6bWvZgLnUkPLQUV9EExs82HXW7EuOH/a3mr9uuDo0oCD+jkIfi1Nsx6xUTXINtfM4Imk 336A== Content-Disposition: inline In-Reply-To: <1350480648-10905-3-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Wed, Oct 17, 2012 at 03:30:44PM +0200, Michal Hocko wrote: > The root cgroup cannot be destroyed so we never hit it idown the > mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't > even try to do anything if called for the root. > > This means that mem_cgroup_move_parent doesn't have to bother with the > root cgroup and it can assume it can always move charges upwards. > > Signed-off-by: Michal Hocko Reviewed-by: Tejun Heo Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling Date: Thu, 18 Oct 2012 15:16:54 -0700 Message-ID: <20121018221654.GP13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-4-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=KCD82SeodUsaNZWIl2VrLsV48kcw0pZ2nulrlKCjQWQ=; b=ehhG50pX1Ph0zF2yMPEzGLsiyKbWUpSFldTPlXPRH/tinK/49X4sg/+uGcpLbbvoC+ Bi/pO7npu4Yhx+/GRadCEno3TJOmHb77LfFzI6P35zaE3XSCldCxExMqTjJoLJ20slE/ V6sNZRvEh6/JE/RGiiiMb1DY6QrRw0X814f9DehZOHOOAWPCeioKG+nMggaqNLT7CJyz LeOiTpQH2FhT6l905RCF2Yv30izkgV8XV6Hy4xv1BjiYQPouSfbsrjN2a2vzrzlubMHQ zH10SPAVNOQmE+Nj+9nAo7eL3+iVJfwhj9Ot3DBDDGL2oycUNuOldmBrjfRaEo6+HuG/ C5eg== Content-Disposition: inline In-Reply-To: <1350480648-10905-4-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Hello, Michal. On Wed, Oct 17, 2012 at 03:30:45PM +0200, Michal Hocko wrote: > mem_cgroup_force_empty_list currently tries to remove all pages from > the given LRU. To prevent from temoporary failures (EBUSY returned by > mem_cgroup_move_parent) it uses a margin to the current LRU pages and > returns the true if there are still some pages left on the list. > > If we consider that mem_cgroup_move_parent fails only when we are racing > with somebody else removing the page (resp. uncharging it) or when the > page is migrated then it is obvious that all those failures are only > temporal and so we can safely retry later. > Let's get rid of the safety margin and make the loop really wait for the > empty LRU. The caller should still make sure that all charges have been > removed from the res_counter because mem_cgroup_replace_page_cache might > add a page to the LRU after the check (it doesn't touch res_counter > though). > This catches most of the cases except for shmem which might call > mem_cgroup_replace_page_cache with a page which is not charged and on > the LRU yet but this was the case also without this patch. In order to > fix this we need a guarantee that try_get_mem_cgroup_from_page falls > back to the current mm's cgroup so it needs css_tryget to fail. This > will be fixed up in a later patch because it nees a help from cgroup > core. > > Signed-off-by: Michal Hocko In the sense that "I looked at it and nothing seemed too scary". Reviewed-by: Tejun Heo Some nitpicks below. > /* > - * move charges to its parent. > + * move charges to its parent or the root cgroup if the group > + * has no parent (aka use_hierarchy==0). > + * Although this might fail the failure is always temporary and it > + * signals a race with a page removal/uncharge or migration. In the > + * first case the page will vanish from the LRU on the next attempt > + * and the call should be retried later. > */ > - Maybe convert to proper /** function comment while at it? I also think it would be helpful to actually comment on each possible failure case explaining why the failure condition is temporal. > /* > * Traverse a specified page_cgroup list and try to drop them all. This doesn't > - * reclaim the pages page themselves - it just removes the page_cgroups. > - * Returns true if some page_cgroups were not freed, indicating that the caller > - * must retry this operation. > + * reclaim the pages page themselves - pages are moved to the parent (or root) > + * group. > */ Ditto. > -static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > +static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > int node, int zid, enum lru_list lru) > { > struct mem_cgroup_per_zone *mz; > - unsigned long flags, loop; > + unsigned long flags; > struct list_head *list; > struct page *busy; > struct zone *zone; > @@ -3696,11 +3701,8 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > mz = mem_cgroup_zoneinfo(memcg, node, zid); > list = &mz->lruvec.lists[lru]; > > - loop = mz->lru_size[lru]; > - /* give some margin against EBUSY etc...*/ > - loop += 256; > busy = NULL; > - while (loop--) { > + do { > struct page_cgroup *pc; > struct page *page; > > @@ -3726,8 +3728,7 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > cond_resched(); > } else > busy = NULL; > - } > - return !list_empty(list); > + } while (!list_empty(list)); > } Is there anything which can keep failing until migration to another cgroup is complete? I think there is, e.g., if mmap_sem is busy or memcg is co-mounted with other controllers and another controller's ->attach() is blocking on something. If so, busy-looping blindly probably isn't a good idea and we would want at least msleep between retries (e.g. have two lists, throw failed ones to the other and sleep shortly when switching the front and back lists). > + /* > + * This is a safety check because mem_cgroup_force_empty_list > + * could have raced with mem_cgroup_replace_page_cache callers > + * so the lru seemed empty but the page could have been added > + * right after the check. RES_USAGE should be safe as we always > + * charge before adding to the LRU. > + */ > + } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); Maybe we want to trigger some warning if retry count gets too high? At least for now? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Thu, 18 Oct 2012 15:41:48 -0700 Message-ID: <20121018224148.GR13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=D0oCIRKDf2hwW94fy+BaDQlE8fblehxGUVF65yflQzw=; b=rA1ln5NN+M6FOmdcj6btEW3jemNcm+Cu3TU0vINBzuTl8oqwkQkBdrNNgw6sL7zcIi lF39IrXhkhrvwG3dQKh7pUX21FDOLtlUDfCKMsrNFLa/+nc2uscvbLLDV2YtH9EFA43b 7IqQVz1KYu0bC9XwQa0f7yThHmZyOf7aF/D/f9xN6HQHSUXkxOWHzL4auK8StmYRqN8s CSJWqdgmFi7NeO+281Tkoc3KBzTcy+6avjCu4Nd/kNjp0I9dN1ebDESwRkkuvxeu97xE DZ7ZHbdKPrV7c60326iPLTlmRYFrlN/5cjdoj9wG4lHgBStBz6K+iX44rFbSXFEy8KqE 4MAA== Content-Disposition: inline In-Reply-To: <1350480648-10905-5-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Hello, Michal. On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > safely move on and forbit all the callbacks to fail. The last missing > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > that css_tryget fails so no new charges for the memcg can happen. > The callbacks are also called from within cgroup_lock to guarantee that > no new tasks show up. We could theoretically call them outside of the > lock but then we have to move after CGRP_REMOVED flag is set. > > Signed-off-by: Michal Hocko So, the plan is to do something like the following once memcg is ready. http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 Note that the patch is broken in a couple places but it does show the general direction. I'd prefer if patch #3 simply makes pre_destroy() return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. Then, I can pull the branch in and drop all the unnecessary cruft. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Thu, 18 Oct 2012 15:46:06 -0700 Message-ID: <20121018224606.GS13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=wj4ZITQnRGYI2tMHIu6BCScsfPAt61viaDgHspKtLYg=; b=nkF2qHMnwQgYuKtMIgkg5cDW51IlF5y/C3fvPyXdDev1WPFxxwohmCu1551qqIVIhw Cf6Rue/bMnncxkcJw9hxmBjGTK4YMmfnIV/JX0fvD4pDYTx8PbCJ4unrh/1zx3AMOSql Ar1j+oWGkRQnKy0a7M9RyuGRP9BPs1o7HYGlJgJuHCtXZgiwB/ZkgncejKmn23k0vLiV 7ocsoDgB7flaPEg/atm4XQnr8cL5kVAkPM2noBka5ja4pdS0LZAU1G6gNEIIOXyAR/r3 mJ68d1Ap8HTS71FHpMzLu/7kS/Vn4/ZUPsFQXGkJPskSCN5oDbCDRbChFfmELyLoa7s5 JhCw== Content-Disposition: inline In-Reply-To: <20121018224148.GR13370-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Thu, Oct 18, 2012 at 03:41:48PM -0700, Tejun Heo wrote: > Note that the patch is broken in a couple places but it does show the > general direction. I'd prefer if patch #3 simply makes pre_destroy() > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > Then, I can pull the branch in and drop all the unnecessary cruft. But you need the locking change for further memcg cleanup. To avoid interlocked pulls from both sides, I think it's okay to push this one with the rest of memcg changes. I can do the cleanup on top of this whole series, but please do drop .__DEPRECATED_clear_css_refs from memcg. Acked-by: Tejun Heo Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing Date: Thu, 18 Oct 2012 15:48:07 -0700 Message-ID: <20121018224807.GT13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-6-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=6Vf78MAEPy5TvvcvhAlOuaCfXNz/qnWGeqJ1rYpm61w=; b=ZHIzYjS6QJnZXUsbiSdSmMWJYbrMKhpl41brqP7KQSQa/Qq1u+Z5fk2tP7kX1ehxGD Xj06WasGMen0VYv0SiA6CcZUWS80GW2fcReoFp1V6srSGQNaShDXNlD+GQRuIn9u9fgy G/GjPA6TAoFg4y8cEgwM3Pl9Quowj5PT3OEKK3Gd3NMpUb3mZPr9vAEGQkfIc8vaQCEo h01fH68w7n6oL6oI2ggGCDO4Pccy0urmrYHFglvlwQPeRjW0WSNJaGnC+zKgGFsJvu1l oFIa7vk451tFpCcg6teNLU5SZpeUp4jsqzj/U/Fj2sYx4mbXuEvbDEursxcur+RQoZ6M X5Zg== Content-Disposition: inline In-Reply-To: <1350480648-10905-6-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Wed, Oct 17, 2012 at 03:30:47PM +0200, Michal Hocko wrote: > Now that pre_destroy callbacks are called from within cgroup_lock and > the cgroup has been checked to be empty without any children then there > is no other way to fail. > mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css > because all css' are marked dead already. > > Signed-off-by: Michal Hocko Reviewed-by: Tejun Heo Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 6/6] hugetlb: do not fail in hugetlb_cgroup_pre_destroy Date: Thu, 18 Oct 2012 15:48:25 -0700 Message-ID: <20121018224825.GU13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-7-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=jQ8MW9YSqKxMODGccc1dpuDLt3Y/qb5VC9jRocG2/fU=; b=HzOAy4LbB6DxBvIVPmxUeQ+9uLfiBF0/8CWOPPN2y54uyP20dBAmsp5Ror1Iu1B1mP Jx4CCcm0TPaP3qqmGTaLeH7qOZjsh9l/ke5zqRIEAEkEmS/I4ZaaCR9yn5Jw4mtbreFe ZfGv5/gbWm+Ynch6AIEmJ9uwCNZLJ6ybAPqFYMaL5JnI2saHR1lKt7eZzS8OV/D6fOcl hD4PXbiHLjXlC88zu+bq/38z1FKRHIZCMPzSVKRNtGPlQIsWjEFFy+VjWuBGitDVuqz3 iOeh7/b+fp/xSnM+KsCG0tRRm9q00h4UpzFDwfBSe8BSxe0XJyBQLJw/xwBE/SjypOFu 6vvA== Content-Disposition: inline In-Reply-To: <1350480648-10905-7-git-send-email-mhocko-AlSwsSmVLrQ@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Wed, Oct 17, 2012 at 03:30:48PM +0200, Michal Hocko wrote: > Now that pre_destroy callbacks are called from within cgroup_lock and > the cgroup has been checked to be empty without any children then there > is no other way to fail. > > Signed-off-by: Michal Hocko Reviewed-by: Tejun Heo Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Zefan Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Fri, 19 Oct 2012 17:33:18 +0800 Message-ID: <50811E5E.1090205@huawei.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1350480648-10905-5-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On 2012/10/17 21:30, Michal Hocko wrote: > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > safely move on and forbit all the callbacks to fail. The last missing > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > that css_tryget fails so no new charges for the memcg can happen. > The callbacks are also called from within cgroup_lock to guarantee that > no new tasks show up. I'm afraid this won't work. See commit 3fa59dfbc3b223f02c26593be69ce6fc9a940405 ("cgroup: fix potential deadlock in pre_destroy") > We could theoretically call them outside of the > lock but then we have to move after CGRP_REMOVED flag is set. > > Signed-off-by: Michal Hocko > --- > kernel/cgroup.c | 30 +++++++++--------------------- > 1 file changed, 9 insertions(+), 21 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Fri, 19 Oct 2012 13:09:49 +0200 Message-ID: <20121019110949.GC799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <50811E5E.1090205@huawei.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <50811E5E.1090205-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Li Zefan Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Fri 19-10-12 17:33:18, Li Zefan wrote: > On 2012/10/17 21:30, Michal Hocko wrote: > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > safely move on and forbit all the callbacks to fail. The last missing > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > that css_tryget fails so no new charges for the memcg can happen. > > > The callbacks are also called from within cgroup_lock to guarantee that > > no new tasks show up. > > I'm afraid this won't work. See commit 3fa59dfbc3b223f02c26593be69ce6fc9a940405 > ("cgroup: fix potential deadlock in pre_destroy") Very good point. Thanks for poiting this out. So we should call pre_destroy at the very end? What about the following? Or should be rather drop the lock after check_for_release(parent) or sooner but after CGRP_REMOVED is set? --- >From 70ea8718aba1c1784b94bfb26aa2307195c07c0b Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 17 Oct 2012 13:42:06 +0200 Subject: [PATCH] cgroups: forbid pre_destroy callback to fail Now that mem_cgroup_pre_destroy callback doesn't fail finally we can safely move on and forbit all the callbacks to fail. The last missing piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so that css_tryget fails so no new charges for the memcg can happen. We cannot, however, move cgroup_call_pre_destroy right after because we cannot call mem_cgroup_pre_destroy with the cgroup_lock held (see 3fa59dfb cgroup: fix potential deadlock in pre_destroy) so we have to move it after the lock is released. Changes since v1 - Li Zefan pointed out that mem_cgroup_pre_destroy cannot be called with cgroup_lock held Signed-off-by: Michal Hocko --- kernel/cgroup.c | 30 +++++++++--------------------- 1 file changed, 9 insertions(+), 21 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index b7d9606..4c6adbd 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -855,7 +855,7 @@ static struct inode *cgroup_new_inode(umode_t mode, struct super_block *sb) * Call subsys's pre_destroy handler. * This is called before css refcnt check. */ -static int cgroup_call_pre_destroy(struct cgroup *cgrp) +static void cgroup_call_pre_destroy(struct cgroup *cgrp) { struct cgroup_subsys *ss; int ret = 0; @@ -864,15 +864,8 @@ static int cgroup_call_pre_destroy(struct cgroup *cgrp) if (!ss->pre_destroy) continue; - ret = ss->pre_destroy(cgrp); - if (ret) { - /* ->pre_destroy() failure is being deprecated */ - WARN_ON_ONCE(!ss->__DEPRECATED_clear_css_refs); - break; - } + BUG_ON(ss->pre_destroy(cgrp)); } - - return ret; } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -4161,7 +4154,6 @@ again: mutex_unlock(&cgroup_mutex); return -EBUSY; } - mutex_unlock(&cgroup_mutex); /* * In general, subsystem has no css->refcnt after pre_destroy(). But @@ -4174,17 +4166,6 @@ again: */ set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); - /* - * Call pre_destroy handlers of subsys. Notify subsystems - * that rmdir() request comes. - */ - ret = cgroup_call_pre_destroy(cgrp); - if (ret) { - clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); - return ret; - } - - mutex_lock(&cgroup_mutex); parent = cgrp->parent; if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) { clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); @@ -4206,6 +4187,7 @@ again: return -EINTR; goto again; } + /* NO css_tryget() can success after here. */ finish_wait(&cgroup_rmdir_waitq, &wait); clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); @@ -4244,6 +4226,12 @@ again: spin_unlock(&cgrp->event_list_lock); mutex_unlock(&cgroup_mutex); + + /* + * Call pre_destroy handlers of subsys. Notify subsystems + * that rmdir() request comes. + */ + cgroup_call_pre_destroy(cgrp); return 0; } -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling Date: Fri, 19 Oct 2012 15:24:38 +0200 Message-ID: <20121019132438.GD799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-4-git-send-email-mhocko@suse.cz> <20121018221654.GP13370@google.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121018221654.GP13370-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Tejun Heo Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Thu 18-10-12 15:16:54, Tejun Heo wrote: > Hello, Michal. > > On Wed, Oct 17, 2012 at 03:30:45PM +0200, Michal Hocko wrote: > > mem_cgroup_force_empty_list currently tries to remove all pages from > > the given LRU. To prevent from temoporary failures (EBUSY returned by > > mem_cgroup_move_parent) it uses a margin to the current LRU pages and > > returns the true if there are still some pages left on the list. > > > > If we consider that mem_cgroup_move_parent fails only when we are racing > > with somebody else removing the page (resp. uncharging it) or when the > > page is migrated then it is obvious that all those failures are only > > temporal and so we can safely retry later. > > Let's get rid of the safety margin and make the loop really wait for the > > empty LRU. The caller should still make sure that all charges have been > > removed from the res_counter because mem_cgroup_replace_page_cache might > > add a page to the LRU after the check (it doesn't touch res_counter > > though). > > This catches most of the cases except for shmem which might call > > mem_cgroup_replace_page_cache with a page which is not charged and on > > the LRU yet but this was the case also without this patch. In order to > > fix this we need a guarantee that try_get_mem_cgroup_from_page falls > > back to the current mm's cgroup so it needs css_tryget to fail. This > > will be fixed up in a later patch because it nees a help from cgroup > > core. > > > > Signed-off-by: Michal Hocko > > In the sense that "I looked at it and nothing seemed too scary". > > Reviewed-by: Tejun Heo Thanks > > Some nitpicks below. > > > /* > > - * move charges to its parent. > > + * move charges to its parent or the root cgroup if the group > > + * has no parent (aka use_hierarchy==0). > > + * Although this might fail the failure is always temporary and it > > + * signals a race with a page removal/uncharge or migration. In the > > + * first case the page will vanish from the LRU on the next attempt > > + * and the call should be retried later. > > */ > > - > > Maybe convert to proper /** function comment while at it? these are internal functions and we usually do not create kerneldoc for them. But I can surely change it - it would deserve a bigger clean up then. > I also think it would be helpful to actually comment on each possible > failure case explaining why the failure condition is temporal. What about: " * Although this might fail (get_page_unless_zero, isolate_lru_page or * mem_cgroup_move_account fails) the failure is always temporary and * it signals a race with a page removal/uncharge or migration. In the * first case the page is on the way out and it will vanish from the LRU * on the next attempt and the call should be retried later. * Isolation from the LRU fails only if page has been isolated from * the LRU since we looked at it and that usually means either global * reclaim or migration going on. The page will either get back to the * LRU or vanish. * Finaly mem_cgroup_move_account fails only if the page got uncharged * (!PageCgroupUsed) or moved to a different group. The page will * disappear in the next attempt. " Better? Or should it rather be in the changelog? > > > /* > > * Traverse a specified page_cgroup list and try to drop them all. This doesn't > > - * reclaim the pages page themselves - it just removes the page_cgroups. > > - * Returns true if some page_cgroups were not freed, indicating that the caller > > - * must retry this operation. > > + * reclaim the pages page themselves - pages are moved to the parent (or root) > > + * group. > > */ > > Ditto. > > > -static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > +static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > int node, int zid, enum lru_list lru) > > { > > struct mem_cgroup_per_zone *mz; > > - unsigned long flags, loop; > > + unsigned long flags; > > struct list_head *list; > > struct page *busy; > > struct zone *zone; > > @@ -3696,11 +3701,8 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > mz = mem_cgroup_zoneinfo(memcg, node, zid); > > list = &mz->lruvec.lists[lru]; > > > > - loop = mz->lru_size[lru]; > > - /* give some margin against EBUSY etc...*/ > > - loop += 256; > > busy = NULL; > > - while (loop--) { > > + do { > > struct page_cgroup *pc; > > struct page *page; > > > > @@ -3726,8 +3728,7 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > cond_resched(); > > } else > > busy = NULL; > > - } > > - return !list_empty(list); > > + } while (!list_empty(list)); > > } > > Is there anything which can keep failing until migration to another > cgroup is complete? This is not about migration to another cgroup. Remember there are no tasks in the group so we have no origin for the migration. I was talking about migrate_pages. > I think there is, e.g., if mmap_sem is busy or memcg is co-mounted > with other controllers and another controller's ->attach() is blocking > on something. I am not sure I understand your concern. There are no tasks and we will break out the loop if some appear. And yes we can retry a lot in pathological cases. But this is a group removal path which is not hot. > If so, busy-looping blindly probably isn't a good idea and we would > want at least msleep between retries (e.g. have two lists, throw > failed ones to the other and sleep shortly when switching the front > and back lists). we do cond_resched if we fail. > > + /* > > + * This is a safety check because mem_cgroup_force_empty_list > > + * could have raced with mem_cgroup_replace_page_cache callers > > + * so the lru seemed empty but the page could have been added > > + * right after the check. RES_USAGE should be safe as we always > > + * charge before adding to the LRU. > > + */ > > + } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); > > Maybe we want to trigger some warning if retry count gets too high? > At least for now? We can but is this really worth it? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Fri, 19 Oct 2012 15:32:45 +0200 Message-ID: <20121019133244.GE799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121018224148.GR13370-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Tejun Heo Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Thu 18-10-12 15:41:48, Tejun Heo wrote: > Hello, Michal. > > On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > safely move on and forbit all the callbacks to fail. The last missing > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > that css_tryget fails so no new charges for the memcg can happen. > > The callbacks are also called from within cgroup_lock to guarantee that > > no new tasks show up. We could theoretically call them outside of the > > lock but then we have to move after CGRP_REMOVED flag is set. > > > > Signed-off-by: Michal Hocko > > So, the plan is to do something like the following once memcg is > ready. > > http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 > > Note that the patch is broken in a couple places but it does show the > general direction. I'd prefer if patch #3 simply makes pre_destroy() > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. We can still fail inn #3 without this patch becasuse there are is no guarantee that a new task is attached to the group. And I wanted to keep memcg and generic cgroup parts separated. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Fri, 19 Oct 2012 15:34:15 +0200 Message-ID: <20121019133415.GF799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121018224606.GS13370@google.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121018224606.GS13370-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Tejun Heo Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Thu 18-10-12 15:46:06, Tejun Heo wrote: > On Thu, Oct 18, 2012 at 03:41:48PM -0700, Tejun Heo wrote: > > Note that the patch is broken in a couple places but it does show the > > general direction. I'd prefer if patch #3 simply makes pre_destroy() > > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > > Then, I can pull the branch in and drop all the unnecessary cruft. > > But you need the locking change for further memcg cleanup. To avoid > interlocked pulls from both sides, I think it's okay to push this one > with the rest of memcg changes. I can do the cleanup on top of this > whole series, but please do drop .__DEPRECATED_clear_css_refs from > memcg. OK I will drop that one. > Acked-by: Tejun Heo Do you still agree with the v2 based on Li's feedback? Thanks -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing Date: Fri, 19 Oct 2012 15:49:34 +0200 Message-ID: <20121019134934.GG799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-6-git-send-email-mhocko@suse.cz> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1350480648-10905-6-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh This is an updated version of the patch. I have dropped .__DEPRECATED_clear_css_refs in this one as it makes the best sense to me. I didn't add Tejun's Reviewed-by because of this change. Could you recheck, please? --- >From 6c1f2e76e254e7638ad8cc87f319e3492ac80c5b Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 17 Oct 2012 14:15:09 +0200 Subject: [PATCH] memcg: make mem_cgroup_reparent_charges non failing Now that pre_destroy callbacks are called from within cgroup_lock and the cgroup has been checked to be empty without any children then there is no other way to fail. mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css because all css' are marked dead already. mem_cgroup_subsys.__DEPRECATED_clear_css_refs can be dropped as mem_cgroup_pre_destroy cannot fail now. Changes since v1 - drop __DEPRECATED_clear_css_refs Signed-off-by: Michal Hocko --- mm/memcontrol.c | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f57ba4c..b4d854e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3738,14 +3738,12 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, * * Caller is responsible for holding css reference on the memcg. */ -static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) +static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg) { struct cgroup *cgrp = memcg->css.cgroup; int node, zid; do { - if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) - return -EBUSY; /* This is for making all *used* pages to be on LRU. */ lru_add_drain_all(); drain_all_stock_sync(memcg); @@ -3771,8 +3769,6 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) * charge before adding to the LRU. */ } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); - - return 0; } /* @@ -3809,7 +3805,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) } lru_add_drain(); - return mem_cgroup_reparent_charges(memcg); + mem_cgroup_reparent_charges(memcg); + + return 0; } static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) @@ -5013,13 +5011,9 @@ free_out: static int mem_cgroup_pre_destroy(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - int ret; - css_get(&memcg->css); - ret = mem_cgroup_reparent_charges(memcg); - css_put(&memcg->css); - - return ret; + mem_cgroup_reparent_charges(memcg); + return 0; } static void mem_cgroup_destroy(struct cgroup *cont) @@ -5621,7 +5615,6 @@ struct cgroup_subsys mem_cgroup_subsys = { .base_cftypes = mem_cgroup_files, .early_init = 0, .use_id = 1, - .__DEPRECATED_clear_css_refs = true, }; #ifdef CONFIG_MEMCG_SWAP -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling Date: Fri, 19 Oct 2012 12:49:46 -0700 Message-ID: <20121019194946.GM13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-4-git-send-email-mhocko@suse.cz> <20121018221654.GP13370@google.com> <20121019132438.GD799@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=5gjJJOgE6HzKDi3GyHdESwMX5CPXhk1tRVCSLZGVsTE=; b=GdgojG0Ftlo/NYNr6NZ8u92wkFMTKR2ecwrPzy9z9foQzuBFKIJSDRnLGUIFE4fZMq 4jTwuaviIzLlmxM0uZpEPHZp4r6khAFdkQP84Ipfy/fvAJ216J7xX0ncHR+L4U8eKetc UdqkGJklSoYAKbanbd9y/y8Fi6fUhSKV5zvF4mxQK278T9AJA5Ob+Q71w33JKrHTmfyp tigN82ZNjt6z2w53PwtBswyCfhHGb7bzgyW/dUBwZ68QbXLrfyZu/YEJaGegPNv8XVFG dTRIewi1A0NpERjE7m+YcJREM8HlyaWDE/fYqbMVMNIkXLRuPiFUmEOamyXnS55fHkwb 1ASg== Content-Disposition: inline In-Reply-To: <20121019132438.GD799-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Hello, Michal. On Fri, Oct 19, 2012 at 03:24:38PM +0200, Michal Hocko wrote: > > Maybe convert to proper /** function comment while at it? > > these are internal functions and we usually do not create kerneldoc for > them. But I can surely change it - it would deserve a bigger clean up > then. Yeah, I got into the habit of making function comments kerneldoc if the function is important / scary enough. It's upto you but I think that would be an improvement here. > What about: > " > * Although this might fail (get_page_unless_zero, isolate_lru_page or > * mem_cgroup_move_account fails) the failure is always temporary and > * it signals a race with a page removal/uncharge or migration. In the > * first case the page is on the way out and it will vanish from the LRU > * on the next attempt and the call should be retried later. > * Isolation from the LRU fails only if page has been isolated from > * the LRU since we looked at it and that usually means either global > * reclaim or migration going on. The page will either get back to the > * LRU or vanish. > * Finaly mem_cgroup_move_account fails only if the page got uncharged > * (!PageCgroupUsed) or moved to a different group. The page will > * disappear in the next attempt. > " > > Better? Or should it rather be in the changelog? Looks good to me and I personally think it deserves to be a comment. > > Is there anything which can keep failing until migration to another > > cgroup is complete? > > This is not about migration to another cgroup. Remember there are no > tasks in the group so we have no origin for the migration. I was talking > about migrate_pages. > > > I think there is, e.g., if mmap_sem is busy or memcg is co-mounted > > with other controllers and another controller's ->attach() is blocking > > on something. > > I am not sure I understand your concern. There are no tasks and we will > break out the loop if some appear. And yes we can retry a lot in > pathological cases. But this is a group removal path which is not hot. Ah, okay, I misunderstood that it could wait for task cgroup migration. > > If so, busy-looping blindly probably isn't a good idea and we would > > want at least msleep between retries (e.g. have two lists, throw > > failed ones to the other and sleep shortly when switching the front > > and back lists). > > we do cond_resched if we fail. If it won't ever spin for someone else sleeping, I think it should be fine. > > Maybe we want to trigger some warning if retry count gets too high? > > At least for now? > > We can but is this really worth it? I don't know. My sense of danger here is likely to be way off compared to yours so if you think it's a fairly safe loop, it probably is. It just reminds me of the busy looping we had in freezer. It was correct but actually manifested as a problem - when a system was going down for emergency hibernation from low battery, that busy loop not too rarely drained the small reserve making the machine lose power before completing hibernation. So, it could be that I'm a bit paranoid here. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Fri, 19 Oct 2012 13:17:36 -0700 Message-ID: <20121019201736.GQ13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <50811E5E.1090205@huawei.com> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=RXGwCGy0eZuC3XKzo4P3ScIX5bwbLADIEXYpFsh8Bmk=; b=WN039gaeoyNY5u5oVqO8ueDzkALjrf6PSZrvPnvNoiu2ZYRZvzfVV4UKiS5HcW8hn5 I9bGl4BsWzZzqSLpVH/MT9/UYY+r2HACvwjEfogn4Qr3WjWBDPac//YYmt2OTGOkfME2 vCU8eWTu4pMDIgZWGj9FE6T3MQitt3n2YDFT3QH/yaU6lxpKB/n+aNQsrZuqx2yBDrZ0 wF47rRFfEFs6eb/ChYJzV78Q/OJWf3L4uX7luroLBJfnEurfXlBRlfy7e7gfFLSSeO95 KJm4LhqIQ9SBNuwPoZUe7PgOJCobvCOUoaYecQ3YH6nBGE0RK/G+f9STLiA4X3c/srk9 JlpQ== Content-Disposition: inline In-Reply-To: <50811E5E.1090205-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Li Zefan Cc: Michal Hocko , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Fri, Oct 19, 2012 at 05:33:18PM +0800, Li Zefan wrote: > On 2012/10/17 21:30, Michal Hocko wrote: > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > safely move on and forbit all the callbacks to fail. The last missing > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > that css_tryget fails so no new charges for the memcg can happen. > > > The callbacks are also called from within cgroup_lock to guarantee that > > no new tasks show up. > > I'm afraid this won't work. See commit 3fa59dfbc3b223f02c26593be69ce6fc9a940405 > ("cgroup: fix potential deadlock in pre_destroy") Yeah, you're right. Argh... we really should unexport cgroup_lock soon. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Fri, 19 Oct 2012 13:24:05 -0700 Message-ID: <20121019202405.GR13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=WPfSbMXvC9+CNy9mwwBPtw0WQMcRYfhvzVgg9GKl8aE=; b=v3WnGqho/d0j1LUFYv9HK4rskpWy8ocnDbKZRJVQcCdDuVrjfRgwveQTEaa3mslXIw YksqvJQLuZaQJh13zQhVuGi5WHzkEccmAKfTZ8waRXTPMdG6FMUKXR0YUnMWKGNIh7Ws Pb2cuicNd2qJL1bF8cxQnp/o7silh2Aac3CUJXPn9a1zUyQiXs4kt7dITAz9pXb15dEU st2KrdFH1beNuu7bzqs0JvCbY4i+kOzThwj301WvmBQbkLCpHFgqpfqeaKQYYGkKTJ94 n1tQVLd+EF5xbEvDOOlFPCk7fCGjFLKnSfMhezOxq4Nb32Q6iR/AZh0F8qn14sJZlReP zDtw== Content-Disposition: inline In-Reply-To: <20121019133244.GE799-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Hello, Michal. On Fri, Oct 19, 2012 at 03:32:45PM +0200, Michal Hocko wrote: > On Thu 18-10-12 15:41:48, Tejun Heo wrote: > > Hello, Michal. > > > > On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > > safely move on and forbit all the callbacks to fail. The last missing > > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > > that css_tryget fails so no new charges for the memcg can happen. > > > The callbacks are also called from within cgroup_lock to guarantee that > > > no new tasks show up. We could theoretically call them outside of the > > > lock but then we have to move after CGRP_REMOVED flag is set. > > > > > > Signed-off-by: Michal Hocko > > > > So, the plan is to do something like the following once memcg is > > ready. > > > > http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 > > > > Note that the patch is broken in a couple places but it does show the > > general direction. I'd prefer if patch #3 simply makes pre_destroy() > > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > > We can still fail inn #3 without this patch becasuse there are is no > guarantee that a new task is attached to the group. And I wanted to keep > memcg and generic cgroup parts separated. Yes, but all other controllers are broken that way too and the worst thing which will hapen is triggering WARN_ON_ONCE(). Let's note the failure in the commit and remove DEPREDATED_clear_css_refs in the previous patch. Then, I can pull from you, clean up pre_destroy mess and then you can pull back for further cleanups. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Mon, 22 Oct 2012 12:30:21 +0200 Message-ID: <20121022103021.GA6367@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121019202405.GR13370-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Tejun Heo Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Fri 19-10-12 13:24:05, Tejun Heo wrote: > Hello, Michal. > > On Fri, Oct 19, 2012 at 03:32:45PM +0200, Michal Hocko wrote: > > On Thu 18-10-12 15:41:48, Tejun Heo wrote: > > > Hello, Michal. > > > > > > On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > > > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > > > safely move on and forbit all the callbacks to fail. The last missing > > > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > > > that css_tryget fails so no new charges for the memcg can happen. > > > > The callbacks are also called from within cgroup_lock to guarantee that > > > > no new tasks show up. We could theoretically call them outside of the > > > > lock but then we have to move after CGRP_REMOVED flag is set. > > > > > > > > Signed-off-by: Michal Hocko > > > > > > So, the plan is to do something like the following once memcg is > > > ready. > > > > > > http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 > > > > > > Note that the patch is broken in a couple places but it does show the > > > general direction. I'd prefer if patch #3 simply makes pre_destroy() > > > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > > > > We can still fail inn #3 without this patch becasuse there are is no > > guarantee that a new task is attached to the group. And I wanted to keep > > memcg and generic cgroup parts separated. > > Yes, but all other controllers are broken that way too It's just hugetlb and memcg that have pre_destroy. > and the worst thing which will hapen is triggering WARN_ON_ONCE(). The patch does BUG_ON(ss->pre_destroy(cgrp)). I am not sure WARN_ON_ONCE is appropriate here because we would like to have it at least per controller warning. I do not see any reason why to make this more complicated but I am open to suggestions. > Let's note the failure in the commit and remove > DEPREDATED_clear_css_refs in the previous patch. Then, I can pull > from you, clean up pre_destroy mess and then you can pull back for > further cleanups. Well this will get complicated as there are dependencies between memcg parts (based on Andrew's tree) and your tree. My tree is not pullable as all the patches go via Andrew. I am not sure how to get out of this. There is only one cgroup patch so what about pushing all of this via Andrew and do the follow up cleanups once they get merged? We are not in hurry, are we? Anyway does it really make sense to drop DEPREDATED_clear_css_refs already in the previous patch when it is _not_ guaranteed that pre_destroy succeeds? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Wed, 24 Oct 2012 12:25:35 -0700 Message-ID: <20121024192535.GG12182@atj.dyndns.org> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> <20121022103021.GA6367@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=MLIagzEoqT5/U8AAdMNS6Mf9WGdS+4BxlZi2eaaiyUc=; b=FujGUU8whqMjxobD1Y++b+uxG+c8PWK2MY63lxkXT4rrSEuSQnTMBkU1hFthW/U94C a3dH4vzNZ7oHBpZokgnuXqOnJAakONurOj8Yrp7s6GOr8NF0YpTPkALn6wrbXdvE+/R2 MjKhYGaZqDodE0qe4RDvHth4AQWQZwLSyo0iHfX0y2+uHszwHC60cljnkPHciUEVyor1 bNDtO36EaqYRJHx0jDajzIMae1XQUrvY/ayQ3+pzStZrdGWlHhyKdQsgp4/33hZ4femw duAJlj2BA0SGkRFE1M2dna3I8FhW1vxRB5EeT91KlhRIKrzFxJaqVYAFRM9cMHfdzzzI 1Ngw== Content-Disposition: inline In-Reply-To: <20121022103021.GA6367@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Hello, Michal. On Mon, Oct 22, 2012 at 12:30:21PM +0200, Michal Hocko wrote: > > > We can still fail inn #3 without this patch becasuse there are is no > > > guarantee that a new task is attached to the group. And I wanted to keep > > > memcg and generic cgroup parts separated. > > > > Yes, but all other controllers are broken that way too > > It's just hugetlb and memcg that have pre_destroy. > > > and the worst thing which will hapen is triggering WARN_ON_ONCE(). > > The patch does BUG_ON(ss->pre_destroy(cgrp)). I am not sure WARN_ON_ONCE is > appropriate here because we would like to have it at least per > controller warning. I do not see any reason why to make this more > complicated but I am open to suggestions. Once it's dropped from memcg, the next patch can update cgroup core accordingly and the bug will exist for a single commit and the failure mode would be triggering of WARN_ON_ONCE(). Seems pretty simple to me. > > Let's note the failure in the commit and remove > > DEPREDATED_clear_css_refs in the previous patch. Then, I can pull > > from you, clean up pre_destroy mess and then you can pull back for > > further cleanups. > > Well this will get complicated as there are dependencies between memcg > parts (based on Andrew's tree) and your tree. My tree is not pullable as > all the patches go via Andrew. I am not sure how to get out of this. > There is only one cgroup patch so what about pushing all of this via > Andrew and do the follow up cleanups once they get merged? We are not in > hurry, are we? Let's create a cgroup branch and build things there. I don't think cgroup changes are gonna be a single patch and expect to see at least some bug fixes afterwards and don't wanna keep them floating separate from other cgroup changes. mm being based on top of -next, that should work, right? > Anyway does it really make sense to drop DEPREDATED_clear_css_refs > already in the previous patch when it is _not_ guaranteed that > pre_destroy succeeds? It makes things simpler here by decoupling memcg change with core cgroup changes and the introduced bug isn't too easy to trigger and even when triggered the failure mode isn't critical. It's not gonna break normal common operations or bisection. As long as the issue is clearly documented, I think it should be fine. Just note that this opens up a race window from deficient cgroup API and the following commits will address it. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Thu, 25 Oct 2012 16:37:56 +0200 Message-ID: <20121025143756.GI11105@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> <20121022103021.GA6367@dhcp22.suse.cz> <20121024192535.GG12182@atj.dyndns.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121024192535.GG12182-OlzNCW9NnSVy/B6EtB590w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Tejun Heo Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Wed 24-10-12 12:25:35, Tejun Heo wrote: > Hello, Michal. > > On Mon, Oct 22, 2012 at 12:30:21PM +0200, Michal Hocko wrote: > > > > We can still fail inn #3 without this patch becasuse there are is no > > > > guarantee that a new task is attached to the group. And I wanted to keep > > > > memcg and generic cgroup parts separated. > > > > > > Yes, but all other controllers are broken that way too > > > > It's just hugetlb and memcg that have pre_destroy. > > > > > and the worst thing which will hapen is triggering WARN_ON_ONCE(). > > > > The patch does BUG_ON(ss->pre_destroy(cgrp)). I am not sure WARN_ON_ONCE is > > appropriate here because we would like to have it at least per > > controller warning. I do not see any reason why to make this more > > complicated but I am open to suggestions. > > Once it's dropped from memcg, the next patch can update cgroup core > accordingly and the bug will exist for a single commit and the failure > mode would be triggering of WARN_ON_ONCE(). Seems pretty simple to > me. I am not sure I understand you here. So are you suggesting s/BUG_ON/WARN_ON_ONCE/ in this patch? It is true that this will not break bisectability but it is still not correct (strictly speaking because any load that can race group removal with new tasks addition would hit BUG/WARN and we will remove a group with a task inside). The patchset as posted makes sure that none of the stages adds a regression and I would like to stick with that as much as possible if it doesn't cause too much of a hassle. > > > Let's note the failure in the commit and remove > > > DEPREDATED_clear_css_refs in the previous patch. Then, I can pull > > > from you, clean up pre_destroy mess and then you can pull back for > > > further cleanups. > > > > Well this will get complicated as there are dependencies between memcg > > parts (based on Andrew's tree) and your tree. My tree is not pullable as > > all the patches go via Andrew. I am not sure how to get out of this. > > There is only one cgroup patch so what about pushing all of this via > > Andrew and do the follow up cleanups once they get merged? We are not in > > hurry, are we? > > Let's create a cgroup branch and build things there. I don't think > cgroup changes are gonna be a single patch and expect to see at least > some bug fixes afterwards and don't wanna keep them floating separate > from other cgroup changes. > mm being based on top of -next, that should work, right? Well, a tree based on -next is, ehm, impractical. I can create a bug on top of my -mm git branch (where I merge your cgroup common changes) for development and then when we are ready we can send it as a series and push it via Andrew. Would that work for you? Or we can push the core part via Andrew, wait for the merge and work on the follow up cleanups later? It is not like the follow up part is really urgent, isn't it? I would just like the memcg part settled first because this can potentially conflict with other memcg work. [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Thu, 25 Oct 2012 10:42:20 -0700 Message-ID: <20121025174220.GJ11442@htj.dyndns.org> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> <20121022103021.GA6367@dhcp22.suse.cz> <20121024192535.GG12182@atj.dyndns.org> <20121025143756.GI11105@dhcp22.suse.cz> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=4avZoNj2AydfQBjJ2YevFNb7ek2IBuZdTTHImXgQteA=; b=YnqIT9ltBETWVo5v8t/nviqo4oKQDMam+v46z2tvWPK4Bws+bpUGhMGAa6xPEpmrud M0RSQbpnQAQFrCaxrYsMyw+X9ed1AyozUKvwWke6guQBSeZZy2Wj9a6teytTxqB3+qNb lO0b3bA0bEplWifKssolXX0HuSjPnTl/pn2rg5OzrTAt4GVroxlSSRGqOxLNISS/A6gp t2UA6t4kmr8H6ag1Ijt79NjWSlKPCUKMrquOVoNrHc78hWHuWXABRUhxjotw+fHLgfr5 oYws0K/6cWxoY5QIgQ6nguj5eJFkVoL3UzgiYlzIU9eLDRWviiZ+kYcyVXN7gTybovsD UjIA== Content-Disposition: inline In-Reply-To: <20121025143756.GI11105-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Michal Hocko Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Hey, Michal. On Thu, Oct 25, 2012 at 04:37:56PM +0200, Michal Hocko wrote: > I am not sure I understand you here. So are you suggesting > s/BUG_ON/WARN_ON_ONCE/ in this patch? Oh, no, I meant that we can do upto patch 3 of this series and then follow up with proper cgroup core update and then stack further memcg cleanups on top. > > Let's create a cgroup branch and build things there. I don't think > > cgroup changes are gonna be a single patch and expect to see at least > > some bug fixes afterwards and don't wanna keep them floating separate > > from other cgroup changes. > > > mm being based on top of -next, that should work, right? > > Well, a tree based on -next is, ehm, impractical. I can create a bug on > top of my -mm git branch (where I merge your cgroup common changes) for > development and then when we are ready we can send it as a series and > push it via Andrew. Would that work for you? > Or we can push the core part via Andrew, wait for the merge and work on > the follow up cleanups later? > It is not like the follow up part is really urgent, isn't it? I would > just like the memcg part settled first because this can potentially > conflict with other memcg work. Argh... can we pretty *please* just do a plain git branch? I don't care where it is but I want to be able to pull it into cgroup core and yes I do wanna make this happen in this devel cycle. We've been sitting on it far too long waiting for memcg. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Thu, 25 Oct 2012 20:48:34 +0200 Message-ID: <20121025184834.GB20618@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> <20121022103021.GA6367@dhcp22.suse.cz> <20121024192535.GG12182@atj.dyndns.org> <20121025143756.GI11105@dhcp22.suse.cz> <20121025174220.GJ11442@htj.dyndns.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20121025174220.GJ11442@htj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Thu 25-10-12 10:42:20, Tejun Heo wrote: > Hey, Michal. > > On Thu, Oct 25, 2012 at 04:37:56PM +0200, Michal Hocko wrote: > > I am not sure I understand you here. So are you suggesting > > s/BUG_ON/WARN_ON_ONCE/ in this patch? > > Oh, no, I meant that we can do upto patch 3 of this series and then > follow up with proper cgroup core update and then stack further > memcg cleanups on top. I thought the later cleanups would be on top of the series. > > > Let's create a cgroup branch and build things there. I don't think > > > cgroup changes are gonna be a single patch and expect to see at least > > > some bug fixes afterwards and don't wanna keep them floating separate > > > from other cgroup changes. > > > > > mm being based on top of -next, that should work, right? > > > > Well, a tree based on -next is, ehm, impractical. I can create a bug on > > top of my -mm git branch (where I merge your cgroup common changes) for > > development and then when we are ready we can send it as a series and > > push it via Andrew. Would that work for you? > > Or we can push the core part via Andrew, wait for the merge and work on > > the follow up cleanups later? > > It is not like the follow up part is really urgent, isn't it? I would > > just like the memcg part settled first because this can potentially > > conflict with other memcg work. > > Argh... can we pretty *please* just do a plain git branch? I don't > care where it is but I want to be able to pull it into cgroup core and Hohumm, I have tried to apply the series on top of Linus' 3.6 and there were no conflicts so I can create a branch which you can pull into your cgroup branch (which I can then merge into -mm git tree). This would however mean that those patches wouldn't fly through Andrew's tree. Is this really what we want and what does it give to us? > yes I do wanna make this happen in this devel cycle. We've been > sitting on it far too long waiting for memcg. I can surely imagine that (for the memcg part) but it needs throughout review. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx197.postini.com [74.125.245.197]) by kanga.kvack.org (Postfix) with SMTP id B1F886B002B for ; Wed, 17 Oct 2012 20:30:41 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp (unknown [10.0.50.73]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id D74E73EE0BD for ; Thu, 18 Oct 2012 09:30:39 +0900 (JST) Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id BE39645DEB5 for ; Thu, 18 Oct 2012 09:30:39 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 9CE2845DEBA for ; Thu, 18 Oct 2012 09:30:39 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 9114F1DB803E for ; Thu, 18 Oct 2012 09:30:39 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.240.81.134]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 47BB61DB803B for ; Thu, 18 Oct 2012 09:30:39 +0900 (JST) Message-ID: <507F4D86.106@jp.fujitsu.com> Date: Thu, 18 Oct 2012 09:29:58 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [RFC] memcg/cgroup: do not fail fail on pre_destroy callbacks References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , Balbir Singh (2012/10/17 22:30), Michal Hocko wrote: > Hi, > memcg is the only controller which might fail in its pre_destroy > callback which makes the cgroup core more complicated for no good > reason. This is an attempt to change this unfortunate state. > > I am sending this a RFC because I would like to hear back whether the > approach is correct. I thought that the changes would be more invasive > but it seems that the current code was mostly prepared for this and it > needs just some small tweaks (so I might be missing something important > here). > > The first two patches are just clean ups. They could be merged even > without the rest. > > The real change, although the code is not changed that much, is the 3rd > patch. It changes the way how we handle mem_cgroup_move_parent failures. > We have to realize that all those failures are *temporal*. Because we > are either racing with the page removal or the page is temporarily off > the LRU because of migration resp. global reclaim. As a result we do > not fail mem_cgroup_force_empty_list if the page cannot be moved to the > parent and rather retry until the LRU is empty. > > The 4th patch is for cgroup core. I have moved cgroup_call_pre_destroy > inside the cgroup_lock which is not very nice because the callbacks > can take some time. Maybe we can move this call at the very end of the > function? > All I need for memcg is that cgroup_call_pre_destroy has been called and > that no new cgroups can be attached to the group. The cgroup_lock is > necessary for the later condition but if we move after CGRP_REMOVED flag > is set then we are safe as well. > > The last two patches are trivial follow ups for the cgroups core change > because now we know that nobody will interfere with us so we can drop > those empty && no child condition. > > Comments, thoughts? > > Michal Hocko (6): > memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts > memcg: root_cgroup cannot reach mem_cgroup_move_parent > memcg: Simplify mem_cgroup_force_empty_list error handling > cgroups: forbid pre_destroy callback to fail > memcg: make mem_cgroup_reparent_charges non failing > hugetlb: do not fail in hugetlb_cgroup_pre_destroy > > Cumulative diffstat: > kernel/cgroup.c | 30 ++++--------- > mm/hugetlb_cgroup.c | 11 ++--- > mm/memcontrol.c | 124 +++++++++++++++++++++++++++------------------------ > 3 files changed, 78 insertions(+), 87 deletions(-) Thank you very much ! The whole patch seems good to me and I like this approach. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx192.postini.com [74.125.245.192]) by kanga.kvack.org (Postfix) with SMTP id 3471C6B0044 for ; Thu, 18 Oct 2012 17:58:12 -0400 (EDT) Received: by mail-da0-f41.google.com with SMTP id i14so4302020dad.14 for ; Thu, 18 Oct 2012 14:58:11 -0700 (PDT) Date: Thu, 18 Oct 2012 14:58:07 -0700 From: Tejun Heo Subject: Re: [PATCH 2/6] memcg: root_cgroup cannot reach mem_cgroup_move_parent Message-ID: <20121018215807.GO13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-3-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1350480648-10905-3-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Wed, Oct 17, 2012 at 03:30:44PM +0200, Michal Hocko wrote: > The root cgroup cannot be destroyed so we never hit it idown the > mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't > even try to do anything if called for the root. > > This means that mem_cgroup_move_parent doesn't have to bother with the > root cgroup and it can assume it can always move charges upwards. > > Signed-off-by: Michal Hocko Reviewed-by: Tejun Heo Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx128.postini.com [74.125.245.128]) by kanga.kvack.org (Postfix) with SMTP id DC1CF6B0062 for ; Thu, 18 Oct 2012 18:46:10 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so9883265pad.14 for ; Thu, 18 Oct 2012 15:46:10 -0700 (PDT) Date: Thu, 18 Oct 2012 15:46:06 -0700 From: Tejun Heo Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121018224606.GS13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121018224148.GR13370@google.com> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Thu, Oct 18, 2012 at 03:41:48PM -0700, Tejun Heo wrote: > Note that the patch is broken in a couple places but it does show the > general direction. I'd prefer if patch #3 simply makes pre_destroy() > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > Then, I can pull the branch in and drop all the unnecessary cruft. But you need the locking change for further memcg cleanup. To avoid interlocked pulls from both sides, I think it's okay to push this one with the rest of memcg changes. I can do the cleanup on top of this whole series, but please do drop .__DEPRECATED_clear_css_refs from memcg. Acked-by: Tejun Heo Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx161.postini.com [74.125.245.161]) by kanga.kvack.org (Postfix) with SMTP id EDF226B0068 for ; Thu, 18 Oct 2012 18:48:11 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so9884456pad.14 for ; Thu, 18 Oct 2012 15:48:11 -0700 (PDT) Date: Thu, 18 Oct 2012 15:48:07 -0700 From: Tejun Heo Subject: Re: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing Message-ID: <20121018224807.GT13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-6-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1350480648-10905-6-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Wed, Oct 17, 2012 at 03:30:47PM +0200, Michal Hocko wrote: > Now that pre_destroy callbacks are called from within cgroup_lock and > the cgroup has been checked to be empty without any children then there > is no other way to fail. > mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css > because all css' are marked dead already. > > Signed-off-by: Michal Hocko Reviewed-by: Tejun Heo Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 1B2736B005D for ; Thu, 18 Oct 2012 18:48:30 -0400 (EDT) Received: by mail-da0-f41.google.com with SMTP id i14so4320161dad.14 for ; Thu, 18 Oct 2012 15:48:29 -0700 (PDT) Date: Thu, 18 Oct 2012 15:48:25 -0700 From: Tejun Heo Subject: Re: [PATCH 6/6] hugetlb: do not fail in hugetlb_cgroup_pre_destroy Message-ID: <20121018224825.GU13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-7-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1350480648-10905-7-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Wed, Oct 17, 2012 at 03:30:48PM +0200, Michal Hocko wrote: > Now that pre_destroy callbacks are called from within cgroup_lock and > the cgroup has been checked to be empty without any children then there > is no other way to fail. > > Signed-off-by: Michal Hocko Reviewed-by: Tejun Heo Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id 637F46B005D for ; Fri, 19 Oct 2012 07:09:53 -0400 (EDT) Date: Fri, 19 Oct 2012 13:09:49 +0200 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121019110949.GC799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <50811E5E.1090205@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50811E5E.1090205@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: To: Li Zefan Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Fri 19-10-12 17:33:18, Li Zefan wrote: > On 2012/10/17 21:30, Michal Hocko wrote: > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > safely move on and forbit all the callbacks to fail. The last missing > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > that css_tryget fails so no new charges for the memcg can happen. > > > The callbacks are also called from within cgroup_lock to guarantee that > > no new tasks show up. > > I'm afraid this won't work. See commit 3fa59dfbc3b223f02c26593be69ce6fc9a940405 > ("cgroup: fix potential deadlock in pre_destroy") Very good point. Thanks for poiting this out. So we should call pre_destroy at the very end? What about the following? Or should be rather drop the lock after check_for_release(parent) or sooner but after CGRP_REMOVED is set? --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx151.postini.com [74.125.245.151]) by kanga.kvack.org (Postfix) with SMTP id 5139C6B005D for ; Fri, 19 Oct 2012 09:24:42 -0400 (EDT) Date: Fri, 19 Oct 2012 15:24:38 +0200 From: Michal Hocko Subject: Re: [PATCH 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling Message-ID: <20121019132438.GD799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-4-git-send-email-mhocko@suse.cz> <20121018221654.GP13370@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121018221654.GP13370@google.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Thu 18-10-12 15:16:54, Tejun Heo wrote: > Hello, Michal. > > On Wed, Oct 17, 2012 at 03:30:45PM +0200, Michal Hocko wrote: > > mem_cgroup_force_empty_list currently tries to remove all pages from > > the given LRU. To prevent from temoporary failures (EBUSY returned by > > mem_cgroup_move_parent) it uses a margin to the current LRU pages and > > returns the true if there are still some pages left on the list. > > > > If we consider that mem_cgroup_move_parent fails only when we are racing > > with somebody else removing the page (resp. uncharging it) or when the > > page is migrated then it is obvious that all those failures are only > > temporal and so we can safely retry later. > > Let's get rid of the safety margin and make the loop really wait for the > > empty LRU. The caller should still make sure that all charges have been > > removed from the res_counter because mem_cgroup_replace_page_cache might > > add a page to the LRU after the check (it doesn't touch res_counter > > though). > > This catches most of the cases except for shmem which might call > > mem_cgroup_replace_page_cache with a page which is not charged and on > > the LRU yet but this was the case also without this patch. In order to > > fix this we need a guarantee that try_get_mem_cgroup_from_page falls > > back to the current mm's cgroup so it needs css_tryget to fail. This > > will be fixed up in a later patch because it nees a help from cgroup > > core. > > > > Signed-off-by: Michal Hocko > > In the sense that "I looked at it and nothing seemed too scary". > > Reviewed-by: Tejun Heo Thanks > > Some nitpicks below. > > > /* > > - * move charges to its parent. > > + * move charges to its parent or the root cgroup if the group > > + * has no parent (aka use_hierarchy==0). > > + * Although this might fail the failure is always temporary and it > > + * signals a race with a page removal/uncharge or migration. In the > > + * first case the page will vanish from the LRU on the next attempt > > + * and the call should be retried later. > > */ > > - > > Maybe convert to proper /** function comment while at it? these are internal functions and we usually do not create kerneldoc for them. But I can surely change it - it would deserve a bigger clean up then. > I also think it would be helpful to actually comment on each possible > failure case explaining why the failure condition is temporal. What about: " * Although this might fail (get_page_unless_zero, isolate_lru_page or * mem_cgroup_move_account fails) the failure is always temporary and * it signals a race with a page removal/uncharge or migration. In the * first case the page is on the way out and it will vanish from the LRU * on the next attempt and the call should be retried later. * Isolation from the LRU fails only if page has been isolated from * the LRU since we looked at it and that usually means either global * reclaim or migration going on. The page will either get back to the * LRU or vanish. * Finaly mem_cgroup_move_account fails only if the page got uncharged * (!PageCgroupUsed) or moved to a different group. The page will * disappear in the next attempt. " Better? Or should it rather be in the changelog? > > > /* > > * Traverse a specified page_cgroup list and try to drop them all. This doesn't > > - * reclaim the pages page themselves - it just removes the page_cgroups. > > - * Returns true if some page_cgroups were not freed, indicating that the caller > > - * must retry this operation. > > + * reclaim the pages page themselves - pages are moved to the parent (or root) > > + * group. > > */ > > Ditto. > > > -static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > +static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > int node, int zid, enum lru_list lru) > > { > > struct mem_cgroup_per_zone *mz; > > - unsigned long flags, loop; > > + unsigned long flags; > > struct list_head *list; > > struct page *busy; > > struct zone *zone; > > @@ -3696,11 +3701,8 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > mz = mem_cgroup_zoneinfo(memcg, node, zid); > > list = &mz->lruvec.lists[lru]; > > > > - loop = mz->lru_size[lru]; > > - /* give some margin against EBUSY etc...*/ > > - loop += 256; > > busy = NULL; > > - while (loop--) { > > + do { > > struct page_cgroup *pc; > > struct page *page; > > > > @@ -3726,8 +3728,7 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > cond_resched(); > > } else > > busy = NULL; > > - } > > - return !list_empty(list); > > + } while (!list_empty(list)); > > } > > Is there anything which can keep failing until migration to another > cgroup is complete? This is not about migration to another cgroup. Remember there are no tasks in the group so we have no origin for the migration. I was talking about migrate_pages. > I think there is, e.g., if mmap_sem is busy or memcg is co-mounted > with other controllers and another controller's ->attach() is blocking > on something. I am not sure I understand your concern. There are no tasks and we will break out the loop if some appear. And yes we can retry a lot in pathological cases. But this is a group removal path which is not hot. > If so, busy-looping blindly probably isn't a good idea and we would > want at least msleep between retries (e.g. have two lists, throw > failed ones to the other and sleep shortly when switching the front > and back lists). we do cond_resched if we fail. > > + /* > > + * This is a safety check because mem_cgroup_force_empty_list > > + * could have raced with mem_cgroup_replace_page_cache callers > > + * so the lru seemed empty but the page could have been added > > + * right after the check. RES_USAGE should be safe as we always > > + * charge before adding to the LRU. > > + */ > > + } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); > > Maybe we want to trigger some warning if retry count gets too high? > At least for now? We can but is this really worth it? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx183.postini.com [74.125.245.183]) by kanga.kvack.org (Postfix) with SMTP id 99A956B005D for ; Fri, 19 Oct 2012 09:32:47 -0400 (EDT) Date: Fri, 19 Oct 2012 15:32:45 +0200 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121019133244.GE799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121018224148.GR13370@google.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Thu 18-10-12 15:41:48, Tejun Heo wrote: > Hello, Michal. > > On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > safely move on and forbit all the callbacks to fail. The last missing > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > that css_tryget fails so no new charges for the memcg can happen. > > The callbacks are also called from within cgroup_lock to guarantee that > > no new tasks show up. We could theoretically call them outside of the > > lock but then we have to move after CGRP_REMOVED flag is set. > > > > Signed-off-by: Michal Hocko > > So, the plan is to do something like the following once memcg is > ready. > > http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 > > Note that the patch is broken in a couple places but it does show the > general direction. I'd prefer if patch #3 simply makes pre_destroy() > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. We can still fail inn #3 without this patch becasuse there are is no guarantee that a new task is attached to the group. And I wanted to keep memcg and generic cgroup parts separated. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx180.postini.com [74.125.245.180]) by kanga.kvack.org (Postfix) with SMTP id 5FCB06B005D for ; Fri, 19 Oct 2012 09:34:17 -0400 (EDT) Date: Fri, 19 Oct 2012 15:34:15 +0200 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121019133415.GF799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121018224606.GS13370@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121018224606.GS13370@google.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Thu 18-10-12 15:46:06, Tejun Heo wrote: > On Thu, Oct 18, 2012 at 03:41:48PM -0700, Tejun Heo wrote: > > Note that the patch is broken in a couple places but it does show the > > general direction. I'd prefer if patch #3 simply makes pre_destroy() > > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > > Then, I can pull the branch in and drop all the unnecessary cruft. > > But you need the locking change for further memcg cleanup. To avoid > interlocked pulls from both sides, I think it's okay to push this one > with the rest of memcg changes. I can do the cleanup on top of this > whole series, but please do drop .__DEPRECATED_clear_css_refs from > memcg. OK I will drop that one. > Acked-by: Tejun Heo Do you still agree with the v2 based on Li's feedback? Thanks -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id 69A3C6B005D for ; Fri, 19 Oct 2012 09:49:37 -0400 (EDT) Date: Fri, 19 Oct 2012 15:49:34 +0200 From: Michal Hocko Subject: Re: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing Message-ID: <20121019134934.GG799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-6-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1350480648-10905-6-git-send-email-mhocko@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh This is an updated version of the patch. I have dropped .__DEPRECATED_clear_css_refs in this one as it makes the best sense to me. I didn't add Tejun's Reviewed-by because of this change. Could you recheck, please? --- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx177.postini.com [74.125.245.177]) by kanga.kvack.org (Postfix) with SMTP id D1A126B0080 for ; Fri, 19 Oct 2012 15:49:51 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so739829pbb.14 for ; Fri, 19 Oct 2012 12:49:51 -0700 (PDT) Date: Fri, 19 Oct 2012 12:49:46 -0700 From: Tejun Heo Subject: Re: [PATCH 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling Message-ID: <20121019194946.GM13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-4-git-send-email-mhocko@suse.cz> <20121018221654.GP13370@google.com> <20121019132438.GD799@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121019132438.GD799@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Hello, Michal. On Fri, Oct 19, 2012 at 03:24:38PM +0200, Michal Hocko wrote: > > Maybe convert to proper /** function comment while at it? > > these are internal functions and we usually do not create kerneldoc for > them. But I can surely change it - it would deserve a bigger clean up > then. Yeah, I got into the habit of making function comments kerneldoc if the function is important / scary enough. It's upto you but I think that would be an improvement here. > What about: > " > * Although this might fail (get_page_unless_zero, isolate_lru_page or > * mem_cgroup_move_account fails) the failure is always temporary and > * it signals a race with a page removal/uncharge or migration. In the > * first case the page is on the way out and it will vanish from the LRU > * on the next attempt and the call should be retried later. > * Isolation from the LRU fails only if page has been isolated from > * the LRU since we looked at it and that usually means either global > * reclaim or migration going on. The page will either get back to the > * LRU or vanish. > * Finaly mem_cgroup_move_account fails only if the page got uncharged > * (!PageCgroupUsed) or moved to a different group. The page will > * disappear in the next attempt. > " > > Better? Or should it rather be in the changelog? Looks good to me and I personally think it deserves to be a comment. > > Is there anything which can keep failing until migration to another > > cgroup is complete? > > This is not about migration to another cgroup. Remember there are no > tasks in the group so we have no origin for the migration. I was talking > about migrate_pages. > > > I think there is, e.g., if mmap_sem is busy or memcg is co-mounted > > with other controllers and another controller's ->attach() is blocking > > on something. > > I am not sure I understand your concern. There are no tasks and we will > break out the loop if some appear. And yes we can retry a lot in > pathological cases. But this is a group removal path which is not hot. Ah, okay, I misunderstood that it could wait for task cgroup migration. > > If so, busy-looping blindly probably isn't a good idea and we would > > want at least msleep between retries (e.g. have two lists, throw > > failed ones to the other and sleep shortly when switching the front > > and back lists). > > we do cond_resched if we fail. If it won't ever spin for someone else sleeping, I think it should be fine. > > Maybe we want to trigger some warning if retry count gets too high? > > At least for now? > > We can but is this really worth it? I don't know. My sense of danger here is likely to be way off compared to yours so if you think it's a fairly safe loop, it probably is. It just reminds me of the busy looping we had in freezer. It was correct but actually manifested as a problem - when a system was going down for emergency hibernation from low battery, that busy loop not too rarely drained the small reserve making the machine lose power before completing hibernation. So, it could be that I'm a bit paranoid here. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx193.postini.com [74.125.245.193]) by kanga.kvack.org (Postfix) with SMTP id 841686B008A for ; Fri, 19 Oct 2012 16:17:41 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so754762pbb.14 for ; Fri, 19 Oct 2012 13:17:40 -0700 (PDT) Date: Fri, 19 Oct 2012 13:17:36 -0700 From: Tejun Heo Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121019201736.GQ13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <50811E5E.1090205@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50811E5E.1090205@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: To: Li Zefan Cc: Michal Hocko , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Fri, Oct 19, 2012 at 05:33:18PM +0800, Li Zefan wrote: > On 2012/10/17 21:30, Michal Hocko wrote: > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > safely move on and forbit all the callbacks to fail. The last missing > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > that css_tryget fails so no new charges for the memcg can happen. > > > The callbacks are also called from within cgroup_lock to guarantee that > > no new tasks show up. > > I'm afraid this won't work. See commit 3fa59dfbc3b223f02c26593be69ce6fc9a940405 > ("cgroup: fix potential deadlock in pre_destroy") Yeah, you're right. Argh... we really should unexport cgroup_lock soon. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx139.postini.com [74.125.245.139]) by kanga.kvack.org (Postfix) with SMTP id 1C0D66B0092 for ; Fri, 19 Oct 2012 16:24:10 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so688161pad.14 for ; Fri, 19 Oct 2012 13:24:09 -0700 (PDT) Date: Fri, 19 Oct 2012 13:24:05 -0700 From: Tejun Heo Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121019202405.GR13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121019133244.GE799@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Hello, Michal. On Fri, Oct 19, 2012 at 03:32:45PM +0200, Michal Hocko wrote: > On Thu 18-10-12 15:41:48, Tejun Heo wrote: > > Hello, Michal. > > > > On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > > safely move on and forbit all the callbacks to fail. The last missing > > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > > that css_tryget fails so no new charges for the memcg can happen. > > > The callbacks are also called from within cgroup_lock to guarantee that > > > no new tasks show up. We could theoretically call them outside of the > > > lock but then we have to move after CGRP_REMOVED flag is set. > > > > > > Signed-off-by: Michal Hocko > > > > So, the plan is to do something like the following once memcg is > > ready. > > > > http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 > > > > Note that the patch is broken in a couple places but it does show the > > general direction. I'd prefer if patch #3 simply makes pre_destroy() > > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > > We can still fail inn #3 without this patch becasuse there are is no > guarantee that a new task is attached to the group. And I wanted to keep > memcg and generic cgroup parts separated. Yes, but all other controllers are broken that way too and the worst thing which will hapen is triggering WARN_ON_ONCE(). Let's note the failure in the commit and remove DEPREDATED_clear_css_refs in the previous patch. Then, I can pull from you, clean up pre_destroy mess and then you can pull back for further cleanups. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx143.postini.com [74.125.245.143]) by kanga.kvack.org (Postfix) with SMTP id 2BD2F6B0062 for ; Mon, 22 Oct 2012 06:30:26 -0400 (EDT) Date: Mon, 22 Oct 2012 12:30:21 +0200 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121022103021.GA6367@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121019202405.GR13370@google.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Fri 19-10-12 13:24:05, Tejun Heo wrote: > Hello, Michal. > > On Fri, Oct 19, 2012 at 03:32:45PM +0200, Michal Hocko wrote: > > On Thu 18-10-12 15:41:48, Tejun Heo wrote: > > > Hello, Michal. > > > > > > On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > > > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > > > safely move on and forbit all the callbacks to fail. The last missing > > > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > > > that css_tryget fails so no new charges for the memcg can happen. > > > > The callbacks are also called from within cgroup_lock to guarantee that > > > > no new tasks show up. We could theoretically call them outside of the > > > > lock but then we have to move after CGRP_REMOVED flag is set. > > > > > > > > Signed-off-by: Michal Hocko > > > > > > So, the plan is to do something like the following once memcg is > > > ready. > > > > > > http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 > > > > > > Note that the patch is broken in a couple places but it does show the > > > general direction. I'd prefer if patch #3 simply makes pre_destroy() > > > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > > > > We can still fail inn #3 without this patch becasuse there are is no > > guarantee that a new task is attached to the group. And I wanted to keep > > memcg and generic cgroup parts separated. > > Yes, but all other controllers are broken that way too It's just hugetlb and memcg that have pre_destroy. > and the worst thing which will hapen is triggering WARN_ON_ONCE(). The patch does BUG_ON(ss->pre_destroy(cgrp)). I am not sure WARN_ON_ONCE is appropriate here because we would like to have it at least per controller warning. I do not see any reason why to make this more complicated but I am open to suggestions. > Let's note the failure in the commit and remove > DEPREDATED_clear_css_refs in the previous patch. Then, I can pull > from you, clean up pre_destroy mess and then you can pull back for > further cleanups. Well this will get complicated as there are dependencies between memcg parts (based on Andrew's tree) and your tree. My tree is not pullable as all the patches go via Andrew. I am not sure how to get out of this. There is only one cgroup patch so what about pushing all of this via Andrew and do the follow up cleanups once they get merged? We are not in hurry, are we? Anyway does it really make sense to drop DEPREDATED_clear_css_refs already in the previous patch when it is _not_ guaranteed that pre_destroy succeeds? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx136.postini.com [74.125.245.136]) by kanga.kvack.org (Postfix) with SMTP id 925A66B0071 for ; Thu, 25 Oct 2012 10:38:27 -0400 (EDT) Date: Thu, 25 Oct 2012 16:37:56 +0200 From: Michal Hocko Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121025143756.GI11105@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> <20121022103021.GA6367@dhcp22.suse.cz> <20121024192535.GG12182@atj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121024192535.GG12182@atj.dyndns.org> Sender: owner-linux-mm@kvack.org List-ID: To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh On Wed 24-10-12 12:25:35, Tejun Heo wrote: > Hello, Michal. > > On Mon, Oct 22, 2012 at 12:30:21PM +0200, Michal Hocko wrote: > > > > We can still fail inn #3 without this patch becasuse there are is no > > > > guarantee that a new task is attached to the group. And I wanted to keep > > > > memcg and generic cgroup parts separated. > > > > > > Yes, but all other controllers are broken that way too > > > > It's just hugetlb and memcg that have pre_destroy. > > > > > and the worst thing which will hapen is triggering WARN_ON_ONCE(). > > > > The patch does BUG_ON(ss->pre_destroy(cgrp)). I am not sure WARN_ON_ONCE is > > appropriate here because we would like to have it at least per > > controller warning. I do not see any reason why to make this more > > complicated but I am open to suggestions. > > Once it's dropped from memcg, the next patch can update cgroup core > accordingly and the bug will exist for a single commit and the failure > mode would be triggering of WARN_ON_ONCE(). Seems pretty simple to > me. I am not sure I understand you here. So are you suggesting s/BUG_ON/WARN_ON_ONCE/ in this patch? It is true that this will not break bisectability but it is still not correct (strictly speaking because any load that can race group removal with new tasks addition would hit BUG/WARN and we will remove a group with a task inside). The patchset as posted makes sure that none of the stages adds a regression and I would like to stick with that as much as possible if it doesn't cause too much of a hassle. > > > Let's note the failure in the commit and remove > > > DEPREDATED_clear_css_refs in the previous patch. Then, I can pull > > > from you, clean up pre_destroy mess and then you can pull back for > > > further cleanups. > > > > Well this will get complicated as there are dependencies between memcg > > parts (based on Andrew's tree) and your tree. My tree is not pullable as > > all the patches go via Andrew. I am not sure how to get out of this. > > There is only one cgroup patch so what about pushing all of this via > > Andrew and do the follow up cleanups once they get merged? We are not in > > hurry, are we? > > Let's create a cgroup branch and build things there. I don't think > cgroup changes are gonna be a single patch and expect to see at least > some bug fixes afterwards and don't wanna keep them floating separate > from other cgroup changes. > mm being based on top of -next, that should work, right? Well, a tree based on -next is, ehm, impractical. I can create a bug on top of my -mm git branch (where I merge your cgroup common changes) for development and then when we are ready we can send it as a series and push it via Andrew. Would that work for you? Or we can push the core part via Andrew, wait for the merge and work on the follow up cleanups later? It is not like the follow up part is really urgent, isn't it? I would just like the memcg part settled first because this can potentially conflict with other memcg work. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx121.postini.com [74.125.245.121]) by kanga.kvack.org (Postfix) with SMTP id EDE076B0073 for ; Thu, 25 Oct 2012 13:42:25 -0400 (EDT) Received: by mail-da0-f41.google.com with SMTP id i14so984293dad.14 for ; Thu, 25 Oct 2012 10:42:25 -0700 (PDT) Date: Thu, 25 Oct 2012 10:42:20 -0700 From: Tejun Heo Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121025174220.GJ11442@htj.dyndns.org> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> <20121022103021.GA6367@dhcp22.suse.cz> <20121024192535.GG12182@atj.dyndns.org> <20121025143756.GI11105@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121025143756.GI11105@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Hey, Michal. On Thu, Oct 25, 2012 at 04:37:56PM +0200, Michal Hocko wrote: > I am not sure I understand you here. So are you suggesting > s/BUG_ON/WARN_ON_ONCE/ in this patch? Oh, no, I meant that we can do upto patch 3 of this series and then follow up with proper cgroup core update and then stack further memcg cleanups on top. > > Let's create a cgroup branch and build things there. I don't think > > cgroup changes are gonna be a single patch and expect to see at least > > some bug fixes afterwards and don't wanna keep them floating separate > > from other cgroup changes. > > > mm being based on top of -next, that should work, right? > > Well, a tree based on -next is, ehm, impractical. I can create a bug on > top of my -mm git branch (where I merge your cgroup common changes) for > development and then when we are ready we can send it as a series and > push it via Andrew. Would that work for you? > Or we can push the core part via Andrew, wait for the merge and work on > the follow up cleanups later? > It is not like the follow up part is really urgent, isn't it? I would > just like the memcg part settled first because this can potentially > conflict with other memcg work. Argh... can we pretty *please* just do a plain git branch? I don't care where it is but I want to be able to pull it into cgroup core and yes I do wanna make this happen in this devel cycle. We've been sitting on it far too long waiting for memcg. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932166Ab2JQNbQ (ORCPT ); Wed, 17 Oct 2012 09:31:16 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46368 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756086Ab2JQNbN (ORCPT ); Wed, 17 Oct 2012 09:31:13 -0400 From: Michal Hocko To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: [RFC] memcg/cgroup: do not fail fail on pre_destroy callbacks Date: Wed, 17 Oct 2012 15:30:42 +0200 Message-Id: <1350480648-10905-1-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, memcg is the only controller which might fail in its pre_destroy callback which makes the cgroup core more complicated for no good reason. This is an attempt to change this unfortunate state. I am sending this a RFC because I would like to hear back whether the approach is correct. I thought that the changes would be more invasive but it seems that the current code was mostly prepared for this and it needs just some small tweaks (so I might be missing something important here). The first two patches are just clean ups. They could be merged even without the rest. The real change, although the code is not changed that much, is the 3rd patch. It changes the way how we handle mem_cgroup_move_parent failures. We have to realize that all those failures are *temporal*. Because we are either racing with the page removal or the page is temporarily off the LRU because of migration resp. global reclaim. As a result we do not fail mem_cgroup_force_empty_list if the page cannot be moved to the parent and rather retry until the LRU is empty. The 4th patch is for cgroup core. I have moved cgroup_call_pre_destroy inside the cgroup_lock which is not very nice because the callbacks can take some time. Maybe we can move this call at the very end of the function? All I need for memcg is that cgroup_call_pre_destroy has been called and that no new cgroups can be attached to the group. The cgroup_lock is necessary for the later condition but if we move after CGRP_REMOVED flag is set then we are safe as well. The last two patches are trivial follow ups for the cgroups core change because now we know that nobody will interfere with us so we can drop those empty && no child condition. Comments, thoughts? Michal Hocko (6): memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts memcg: root_cgroup cannot reach mem_cgroup_move_parent memcg: Simplify mem_cgroup_force_empty_list error handling cgroups: forbid pre_destroy callback to fail memcg: make mem_cgroup_reparent_charges non failing hugetlb: do not fail in hugetlb_cgroup_pre_destroy Cumulative diffstat: kernel/cgroup.c | 30 ++++--------- mm/hugetlb_cgroup.c | 11 ++--- mm/memcontrol.c | 124 +++++++++++++++++++++++++++------------------------ 3 files changed, 78 insertions(+), 87 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932239Ab2JQNbU (ORCPT ); Wed, 17 Oct 2012 09:31:20 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46390 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932206Ab2JQNbS (ORCPT ); Wed, 17 Oct 2012 09:31:18 -0400 From: Michal Hocko To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: [PATCH 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling Date: Wed, 17 Oct 2012 15:30:45 +0200 Message-Id: <1350480648-10905-4-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org mem_cgroup_force_empty_list currently tries to remove all pages from the given LRU. To prevent from temoporary failures (EBUSY returned by mem_cgroup_move_parent) it uses a margin to the current LRU pages and returns the true if there are still some pages left on the list. If we consider that mem_cgroup_move_parent fails only when we are racing with somebody else removing the page (resp. uncharging it) or when the page is migrated then it is obvious that all those failures are only temporal and so we can safely retry later. Let's get rid of the safety margin and make the loop really wait for the empty LRU. The caller should still make sure that all charges have been removed from the res_counter because mem_cgroup_replace_page_cache might add a page to the LRU after the check (it doesn't touch res_counter though). This catches most of the cases except for shmem which might call mem_cgroup_replace_page_cache with a page which is not charged and on the LRU yet but this was the case also without this patch. In order to fix this we need a guarantee that try_get_mem_cgroup_from_page falls back to the current mm's cgroup so it needs css_tryget to fail. This will be fixed up in a later patch because it nees a help from cgroup core. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 52 +++++++++++++++++++++++++++------------------------- 1 file changed, 27 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9ce24b7..f57ba4c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2697,9 +2697,13 @@ out: } /* - * move charges to its parent. + * move charges to its parent or the root cgroup if the group + * has no parent (aka use_hierarchy==0). + * Although this might fail the failure is always temporary and it + * signals a race with a page removal/uncharge or migration. In the + * first case the page will vanish from the LRU on the next attempt + * and the call should be retried later. */ - static int mem_cgroup_move_parent(struct page *page, struct page_cgroup *pc, struct mem_cgroup *child) @@ -2726,8 +2730,10 @@ static int mem_cgroup_move_parent(struct page *page, if (!parent) parent = root_mem_cgroup; - if (nr_pages > 1) + if (nr_pages > 1) { + VM_BUG_ON(!PageTransHuge(page)); flags = compound_lock_irqsave(page); + } ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent); @@ -3679,15 +3685,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, /* * Traverse a specified page_cgroup list and try to drop them all. This doesn't - * reclaim the pages page themselves - it just removes the page_cgroups. - * Returns true if some page_cgroups were not freed, indicating that the caller - * must retry this operation. + * reclaim the pages page themselves - pages are moved to the parent (or root) + * group. */ -static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, +static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, int node, int zid, enum lru_list lru) { struct mem_cgroup_per_zone *mz; - unsigned long flags, loop; + unsigned long flags; struct list_head *list; struct page *busy; struct zone *zone; @@ -3696,11 +3701,8 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, mz = mem_cgroup_zoneinfo(memcg, node, zid); list = &mz->lruvec.lists[lru]; - loop = mz->lru_size[lru]; - /* give some margin against EBUSY etc...*/ - loop += 256; busy = NULL; - while (loop--) { + do { struct page_cgroup *pc; struct page *page; @@ -3726,8 +3728,7 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, cond_resched(); } else busy = NULL; - } - return !list_empty(list); + } while (!list_empty(list)); } /* @@ -3741,7 +3742,6 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) { struct cgroup *cgrp = memcg->css.cgroup; int node, zid; - int ret; do { if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) @@ -3749,28 +3749,30 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) /* This is for making all *used* pages to be on LRU. */ lru_add_drain_all(); drain_all_stock_sync(memcg); - ret = 0; mem_cgroup_start_move(memcg); for_each_node_state(node, N_HIGH_MEMORY) { - for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) { + for (zid = 0; zid < MAX_NR_ZONES; zid++) { enum lru_list lru; for_each_lru(lru) { - ret = mem_cgroup_force_empty_list(memcg, + mem_cgroup_force_empty_list(memcg, node, zid, lru); - if (ret) - break; } } - if (ret) - break; } mem_cgroup_end_move(memcg); memcg_oom_recover(memcg); cond_resched(); - /* "ret" should also be checked to ensure all lists are empty. */ - } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret); - return ret; + /* + * This is a safety check because mem_cgroup_force_empty_list + * could have raced with mem_cgroup_replace_page_cache callers + * so the lru seemed empty but the page could have been added + * right after the check. RES_USAGE should be safe as we always + * charge before adding to the LRU. + */ + } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); + + return 0; } /* -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932282Ab2JQNb1 (ORCPT ); Wed, 17 Oct 2012 09:31:27 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46408 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932228Ab2JQNbU (ORCPT ); Wed, 17 Oct 2012 09:31:20 -0400 From: Michal Hocko To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing Date: Wed, 17 Oct 2012 15:30:47 +0200 Message-Id: <1350480648-10905-6-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that pre_destroy callbacks are called from within cgroup_lock and the cgroup has been checked to be empty without any children then there is no other way to fail. mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css because all css' are marked dead already. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f57ba4c..7c75da3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3738,14 +3738,12 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, * * Caller is responsible for holding css reference on the memcg. */ -static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) +static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg) { struct cgroup *cgrp = memcg->css.cgroup; int node, zid; do { - if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) - return -EBUSY; /* This is for making all *used* pages to be on LRU. */ lru_add_drain_all(); drain_all_stock_sync(memcg); @@ -3771,8 +3769,6 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) * charge before adding to the LRU. */ } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); - - return 0; } /* @@ -3809,7 +3805,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) } lru_add_drain(); - return mem_cgroup_reparent_charges(memcg); + mem_cgroup_reparent_charges(memcg); + + return 0; } static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) @@ -5013,13 +5011,9 @@ free_out: static int mem_cgroup_pre_destroy(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - int ret; - css_get(&memcg->css); - ret = mem_cgroup_reparent_charges(memcg); - css_put(&memcg->css); - - return ret; + mem_cgroup_reparent_charges(memcg); + return 0; } static void mem_cgroup_destroy(struct cgroup *cont) -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932260Ab2JQNbZ (ORCPT ); Wed, 17 Oct 2012 09:31:25 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46411 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932206Ab2JQNbV (ORCPT ); Wed, 17 Oct 2012 09:31:21 -0400 From: Michal Hocko To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: [PATCH 6/6] hugetlb: do not fail in hugetlb_cgroup_pre_destroy Date: Wed, 17 Oct 2012 15:30:48 +0200 Message-Id: <1350480648-10905-7-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that pre_destroy callbacks are called from within cgroup_lock and the cgroup has been checked to be empty without any children then there is no other way to fail. Signed-off-by: Michal Hocko --- mm/hugetlb_cgroup.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index a3f358f..dc595c6 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -159,14 +159,9 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup) { struct hstate *h; struct page *page; - int ret = 0, idx = 0; + int idx = 0; do { - if (cgroup_task_count(cgroup) || - !list_empty(&cgroup->children)) { - ret = -EBUSY; - goto out; - } for_each_hstate(h) { spin_lock(&hugetlb_lock); list_for_each_entry(page, &h->hugepage_activelist, lru) @@ -177,8 +172,8 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup) } cond_resched(); } while (hugetlb_cgroup_have_usage(cgroup)); -out: - return ret; + + return 0; } int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932209Ab2JQNbS (ORCPT ); Wed, 17 Oct 2012 09:31:18 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46377 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932179Ab2JQNbQ (ORCPT ); Wed, 17 Oct 2012 09:31:16 -0400 From: Michal Hocko To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: [PATCH 1/6] memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts Date: Wed, 17 Oct 2012 15:30:43 +0200 Message-Id: <1350480648-10905-2-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org mem_cgroup_force_empty did two separate things depending on free_all parameter from the very beginning. It either reclaimed as many pages as possible and moved the rest to the parent or just moved charges to the parent. The first variant is used as memory.force_empty callback while the later is used from the mem_cgroup_pre_destroy. The whole games around gotos are far from being nice and there is no reason to keep those two functions inside one. Let's split them and also move the responsibility for css reference counting to their callers to make to code easier. This patch doesn't have any functional changes. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 72 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 42 insertions(+), 30 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e4e9b18..f25e9c0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3733,27 +3733,21 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, } /* - * make mem_cgroup's charge to be 0 if there is no task. + * make mem_cgroup's charge to be 0 if there is no task by moving + * all the charges and pages to the parent. * This enables deleting this mem_cgroup. + * + * Caller is responsible for holding css reference on the memcg. */ -static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all) +static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) { - int ret; - int node, zid, shrink; - int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; struct cgroup *cgrp = memcg->css.cgroup; + int node, zid; + int ret; - css_get(&memcg->css); - - shrink = 0; - /* should free all ? */ - if (free_all) - goto try_to_free; -move_account: do { - ret = -EBUSY; if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) - goto out; + return -EBUSY; /* This is for making all *used* pages to be on LRU. */ lru_add_drain_all(); drain_all_stock_sync(memcg); @@ -3777,27 +3771,34 @@ move_account: cond_resched(); /* "ret" should also be checked to ensure all lists are empty. */ } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret); -out: - css_put(&memcg->css); + return ret; +} + +/* + * Reclaims as many pages from the given memcg as possible and moves + * the rest to the parent. + * + * Caller is responsible for holding css reference for memcg. + */ +static int mem_cgroup_force_empty(struct mem_cgroup *memcg) +{ + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; + struct cgroup *cgrp = memcg->css.cgroup; -try_to_free: /* returns EBUSY if there is a task or if we come here twice. */ - if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children) || shrink) { - ret = -EBUSY; - goto out; - } + if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) + return -EBUSY; + /* we call try-to-free pages for make this cgroup empty */ lru_add_drain_all(); /* try to free all pages in this cgroup */ - shrink = 1; while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) { int progress; - if (signal_pending(current)) { - ret = -EINTR; - goto out; - } + if (signal_pending(current)) + return -EINTR; + progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL, false); if (!progress) { @@ -3808,13 +3809,19 @@ try_to_free: } lru_add_drain(); - /* try move_account...there may be some *locked* pages. */ - goto move_account; + return mem_cgroup_reparent_charges(memcg); } static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) { - return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), true); + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + int ret; + + css_get(&memcg->css); + ret = mem_cgroup_force_empty(memcg); + css_put(&memcg->css); + + return ret; } @@ -5004,8 +5011,13 @@ free_out: static int mem_cgroup_pre_destroy(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + int ret; - return mem_cgroup_force_empty(memcg, false); + css_get(&memcg->css); + ret = mem_cgroup_reparent_charges(memcg); + css_put(&memcg->css); + + return ret; } static void mem_cgroup_destroy(struct cgroup *cont) -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932294Ab2JQNb7 (ORCPT ); Wed, 17 Oct 2012 09:31:59 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46397 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932179Ab2JQNbT (ORCPT ); Wed, 17 Oct 2012 09:31:19 -0400 From: Michal Hocko To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Date: Wed, 17 Oct 2012 15:30:46 +0200 Message-Id: <1350480648-10905-5-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that mem_cgroup_pre_destroy callback doesn't fail finally we can safely move on and forbit all the callbacks to fail. The last missing piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so that css_tryget fails so no new charges for the memcg can happen. The callbacks are also called from within cgroup_lock to guarantee that no new tasks show up. We could theoretically call them outside of the lock but then we have to move after CGRP_REMOVED flag is set. Signed-off-by: Michal Hocko --- kernel/cgroup.c | 30 +++++++++--------------------- 1 file changed, 9 insertions(+), 21 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index b7d9606..00729c1 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -855,7 +855,7 @@ static struct inode *cgroup_new_inode(umode_t mode, struct super_block *sb) * Call subsys's pre_destroy handler. * This is called before css refcnt check. */ -static int cgroup_call_pre_destroy(struct cgroup *cgrp) +static void cgroup_call_pre_destroy(struct cgroup *cgrp) { struct cgroup_subsys *ss; int ret = 0; @@ -864,15 +864,8 @@ static int cgroup_call_pre_destroy(struct cgroup *cgrp) if (!ss->pre_destroy) continue; - ret = ss->pre_destroy(cgrp); - if (ret) { - /* ->pre_destroy() failure is being deprecated */ - WARN_ON_ONCE(!ss->__DEPRECATED_clear_css_refs); - break; - } + BUG_ON(ss->pre_destroy(cgrp)); } - - return ret; } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -4161,7 +4154,6 @@ again: mutex_unlock(&cgroup_mutex); return -EBUSY; } - mutex_unlock(&cgroup_mutex); /* * In general, subsystem has no css->refcnt after pre_destroy(). But @@ -4174,17 +4166,6 @@ again: */ set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); - /* - * Call pre_destroy handlers of subsys. Notify subsystems - * that rmdir() request comes. - */ - ret = cgroup_call_pre_destroy(cgrp); - if (ret) { - clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); - return ret; - } - - mutex_lock(&cgroup_mutex); parent = cgrp->parent; if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) { clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); @@ -4206,6 +4187,13 @@ again: return -EINTR; goto again; } + + /* + * Call pre_destroy handlers of subsys. Notify subsystems + * that rmdir() request comes. + */ + cgroup_call_pre_destroy(cgrp); + /* NO css_tryget() can success after here. */ finish_wait(&cgroup_rmdir_waitq, &wait); clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757032Ab2JQNcl (ORCPT ); Wed, 17 Oct 2012 09:32:41 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46384 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932181Ab2JQNbR (ORCPT ); Wed, 17 Oct 2012 09:31:17 -0400 From: Michal Hocko To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: [PATCH 2/6] memcg: root_cgroup cannot reach mem_cgroup_move_parent Date: Wed, 17 Oct 2012 15:30:44 +0200 Message-Id: <1350480648-10905-3-git-send-email-mhocko@suse.cz> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The root cgroup cannot be destroyed so we never hit it idown the mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't even try to do anything if called for the root. This means that mem_cgroup_move_parent doesn't have to bother with the root cgroup and it can assume it can always move charges upwards. Signed-off-by: Michal Hocko --- mm/memcontrol.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f25e9c0..9ce24b7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2709,9 +2709,7 @@ static int mem_cgroup_move_parent(struct page *page, unsigned long uninitialized_var(flags); int ret; - /* Is ROOT ? */ - if (mem_cgroup_is_root(child)) - return -EINVAL; + VM_BUG_ON(mem_cgroup_is_root(child)); ret = -EBUSY; if (!get_page_unless_zero(page)) @@ -3817,6 +3815,8 @@ static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); int ret; + if (mem_cgroup_is_root(memcg)) + return -EINVAL; css_get(&memcg->css); ret = mem_cgroup_force_empty(memcg); css_put(&memcg->css); -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932276Ab2JQPbL (ORCPT ); Wed, 17 Oct 2012 11:31:11 -0400 Received: from mx2.parallels.com ([64.131.90.16]:60819 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932220Ab2JQPbH (ORCPT ); Wed, 17 Oct 2012 11:31:07 -0400 Message-ID: <507ECF28.1060602@parallels.com> Date: Wed, 17 Oct 2012 19:30:48 +0400 From: Glauber Costa User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Michal Hocko CC: , , , Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [RFC] memcg/cgroup: do not fail fail on pre_destroy callbacks References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/17/2012 05:30 PM, Michal Hocko wrote: > Hi, > memcg is the only controller which might fail in its pre_destroy > callback which makes the cgroup core more complicated for no good > reason. This is an attempt to change this unfortunate state. > > I am sending this a RFC because I would like to hear back whether the > approach is correct. I thought that the changes would be more invasive > but it seems that the current code was mostly prepared for this and it > needs just some small tweaks (so I might be missing something important > here). > > The first two patches are just clean ups. They could be merged even > without the rest. > > The real change, although the code is not changed that much, is the 3rd > patch. It changes the way how we handle mem_cgroup_move_parent failures. > We have to realize that all those failures are *temporal*. Because we > are either racing with the page removal or the page is temporarily off > the LRU because of migration resp. global reclaim. As a result we do > not fail mem_cgroup_force_empty_list if the page cannot be moved to the > parent and rather retry until the LRU is empty. > > The 4th patch is for cgroup core. I have moved cgroup_call_pre_destroy > inside the cgroup_lock which is not very nice because the callbacks > can take some time. Maybe we can move this call at the very end of the > function? > All I need for memcg is that cgroup_call_pre_destroy has been called and > that no new cgroups can be attached to the group. The cgroup_lock is > necessary for the later condition but if we move after CGRP_REMOVED flag > is set then we are safe as well. > > The last two patches are trivial follow ups for the cgroups core change > because now we know that nobody will interfere with us so we can drop > those empty && no child condition. > > Comments, thoughts? > I personally don't see anything fundamentally wrong with this. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753134Ab2JRAan (ORCPT ); Wed, 17 Oct 2012 20:30:43 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:36016 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752961Ab2JRAal (ORCPT ); Wed, 17 Oct 2012 20:30:41 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <507F4D86.106@jp.fujitsu.com> Date: Thu, 18 Oct 2012 09:29:58 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121010 Thunderbird/16.0.1 MIME-Version: 1.0 To: Michal Hocko CC: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , Balbir Singh Subject: Re: [RFC] memcg/cgroup: do not fail fail on pre_destroy callbacks References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> In-Reply-To: <1350480648-10905-1-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/10/17 22:30), Michal Hocko wrote: > Hi, > memcg is the only controller which might fail in its pre_destroy > callback which makes the cgroup core more complicated for no good > reason. This is an attempt to change this unfortunate state. > > I am sending this a RFC because I would like to hear back whether the > approach is correct. I thought that the changes would be more invasive > but it seems that the current code was mostly prepared for this and it > needs just some small tweaks (so I might be missing something important > here). > > The first two patches are just clean ups. They could be merged even > without the rest. > > The real change, although the code is not changed that much, is the 3rd > patch. It changes the way how we handle mem_cgroup_move_parent failures. > We have to realize that all those failures are *temporal*. Because we > are either racing with the page removal or the page is temporarily off > the LRU because of migration resp. global reclaim. As a result we do > not fail mem_cgroup_force_empty_list if the page cannot be moved to the > parent and rather retry until the LRU is empty. > > The 4th patch is for cgroup core. I have moved cgroup_call_pre_destroy > inside the cgroup_lock which is not very nice because the callbacks > can take some time. Maybe we can move this call at the very end of the > function? > All I need for memcg is that cgroup_call_pre_destroy has been called and > that no new cgroups can be attached to the group. The cgroup_lock is > necessary for the later condition but if we move after CGRP_REMOVED flag > is set then we are safe as well. > > The last two patches are trivial follow ups for the cgroups core change > because now we know that nobody will interfere with us so we can drop > those empty && no child condition. > > Comments, thoughts? > > Michal Hocko (6): > memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts > memcg: root_cgroup cannot reach mem_cgroup_move_parent > memcg: Simplify mem_cgroup_force_empty_list error handling > cgroups: forbid pre_destroy callback to fail > memcg: make mem_cgroup_reparent_charges non failing > hugetlb: do not fail in hugetlb_cgroup_pre_destroy > > Cumulative diffstat: > kernel/cgroup.c | 30 ++++--------- > mm/hugetlb_cgroup.c | 11 ++--- > mm/memcontrol.c | 124 +++++++++++++++++++++++++++------------------------ > 3 files changed, 78 insertions(+), 87 deletions(-) Thank you very much ! The whole patch seems good to me and I like this approach. Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754414Ab2JRIah (ORCPT ); Thu, 18 Oct 2012 04:30:37 -0400 Received: from szxga01-in.huawei.com ([119.145.14.64]:8494 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752619Ab2JRIad (ORCPT ); Thu, 18 Oct 2012 04:30:33 -0400 Message-ID: <507FBE1B.4080906@huawei.com> Date: Thu, 18 Oct 2012 16:30:19 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120907 Thunderbird/15.0.1 MIME-Version: 1.0 To: Michal Hocko CC: , , , Tejun Heo , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-6-git-send-email-mhocko@suse.cz> In-Reply-To: <1350480648-10905-6-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) > @@ -5013,13 +5011,9 @@ free_out: > static int mem_cgroup_pre_destroy(struct cgroup *cont) > { > struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); > - int ret; > > - css_get(&memcg->css); > - ret = mem_cgroup_reparent_charges(memcg); > - css_put(&memcg->css); > - > - return ret; > + mem_cgroup_reparent_charges(memcg); > + return 0; > } > Why don't you make pre_destroy() return void? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754637Ab2JRImQ (ORCPT ); Thu, 18 Oct 2012 04:42:16 -0400 Received: from cantor2.suse.de ([195.135.220.15]:52661 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752337Ab2JRImP (ORCPT ); Thu, 18 Oct 2012 04:42:15 -0400 Date: Thu, 18 Oct 2012 10:42:12 +0200 From: Michal Hocko To: Li Zefan Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing Message-ID: <20121018084212.GA24295@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-6-git-send-email-mhocko@suse.cz> <507FBE1B.4080906@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <507FBE1B.4080906@huawei.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 18-10-12 16:30:19, Li Zefan wrote: > > static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) > > @@ -5013,13 +5011,9 @@ free_out: > > static int mem_cgroup_pre_destroy(struct cgroup *cont) > > { > > struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); > > - int ret; > > > > - css_get(&memcg->css); > > - ret = mem_cgroup_reparent_charges(memcg); > > - css_put(&memcg->css); > > - > > - return ret; > > + mem_cgroup_reparent_charges(memcg); > > + return 0; > > } > > > > Why don't you make pre_destroy() return void? Yes I plan to do that later after I have feedback for this RFC. I am especially interested whether the cgroup core patch is OK, resp. has to be reworked to pull pre_destroy outside of cgroup_lock Thanks -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756337Ab2JRV4N (ORCPT ); Thu, 18 Oct 2012 17:56:13 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:62240 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755361Ab2JRV4M (ORCPT ); Thu, 18 Oct 2012 17:56:12 -0400 Date: Thu, 18 Oct 2012 14:56:07 -0700 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 1/6] memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts Message-ID: <20121018215607.GN13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-2-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1350480648-10905-2-git-send-email-mhocko@suse.cz> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 17, 2012 at 03:30:43PM +0200, Michal Hocko wrote: > mem_cgroup_force_empty did two separate things depending on free_all > parameter from the very beginning. It either reclaimed as many pages as > possible and moved the rest to the parent or just moved charges to the > parent. The first variant is used as memory.force_empty callback while > the later is used from the mem_cgroup_pre_destroy. > > The whole games around gotos are far from being nice and there is no > reason to keep those two functions inside one. Let's split them and > also move the responsibility for css reference counting to their callers > to make to code easier. > > This patch doesn't have any functional changes. > > Signed-off-by: Michal Hocko Reviewed-by: Tejun Heo Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755595Ab2JRV6N (ORCPT ); Thu, 18 Oct 2012 17:58:13 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:44600 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753268Ab2JRV6M (ORCPT ); Thu, 18 Oct 2012 17:58:12 -0400 Date: Thu, 18 Oct 2012 14:58:07 -0700 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 2/6] memcg: root_cgroup cannot reach mem_cgroup_move_parent Message-ID: <20121018215807.GO13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-3-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1350480648-10905-3-git-send-email-mhocko@suse.cz> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 17, 2012 at 03:30:44PM +0200, Michal Hocko wrote: > The root cgroup cannot be destroyed so we never hit it idown the > mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't > even try to do anything if called for the root. > > This means that mem_cgroup_move_parent doesn't have to bother with the > root cgroup and it can assume it can always move charges upwards. > > Signed-off-by: Michal Hocko Reviewed-by: Tejun Heo Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756335Ab2JRWRA (ORCPT ); Thu, 18 Oct 2012 18:17:00 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:64121 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752080Ab2JRWQ6 (ORCPT ); Thu, 18 Oct 2012 18:16:58 -0400 Date: Thu, 18 Oct 2012 15:16:54 -0700 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling Message-ID: <20121018221654.GP13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-4-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1350480648-10905-4-git-send-email-mhocko@suse.cz> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michal. On Wed, Oct 17, 2012 at 03:30:45PM +0200, Michal Hocko wrote: > mem_cgroup_force_empty_list currently tries to remove all pages from > the given LRU. To prevent from temoporary failures (EBUSY returned by > mem_cgroup_move_parent) it uses a margin to the current LRU pages and > returns the true if there are still some pages left on the list. > > If we consider that mem_cgroup_move_parent fails only when we are racing > with somebody else removing the page (resp. uncharging it) or when the > page is migrated then it is obvious that all those failures are only > temporal and so we can safely retry later. > Let's get rid of the safety margin and make the loop really wait for the > empty LRU. The caller should still make sure that all charges have been > removed from the res_counter because mem_cgroup_replace_page_cache might > add a page to the LRU after the check (it doesn't touch res_counter > though). > This catches most of the cases except for shmem which might call > mem_cgroup_replace_page_cache with a page which is not charged and on > the LRU yet but this was the case also without this patch. In order to > fix this we need a guarantee that try_get_mem_cgroup_from_page falls > back to the current mm's cgroup so it needs css_tryget to fail. This > will be fixed up in a later patch because it nees a help from cgroup > core. > > Signed-off-by: Michal Hocko In the sense that "I looked at it and nothing seemed too scary". Reviewed-by: Tejun Heo Some nitpicks below. > /* > - * move charges to its parent. > + * move charges to its parent or the root cgroup if the group > + * has no parent (aka use_hierarchy==0). > + * Although this might fail the failure is always temporary and it > + * signals a race with a page removal/uncharge or migration. In the > + * first case the page will vanish from the LRU on the next attempt > + * and the call should be retried later. > */ > - Maybe convert to proper /** function comment while at it? I also think it would be helpful to actually comment on each possible failure case explaining why the failure condition is temporal. > /* > * Traverse a specified page_cgroup list and try to drop them all. This doesn't > - * reclaim the pages page themselves - it just removes the page_cgroups. > - * Returns true if some page_cgroups were not freed, indicating that the caller > - * must retry this operation. > + * reclaim the pages page themselves - pages are moved to the parent (or root) > + * group. > */ Ditto. > -static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > +static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > int node, int zid, enum lru_list lru) > { > struct mem_cgroup_per_zone *mz; > - unsigned long flags, loop; > + unsigned long flags; > struct list_head *list; > struct page *busy; > struct zone *zone; > @@ -3696,11 +3701,8 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > mz = mem_cgroup_zoneinfo(memcg, node, zid); > list = &mz->lruvec.lists[lru]; > > - loop = mz->lru_size[lru]; > - /* give some margin against EBUSY etc...*/ > - loop += 256; > busy = NULL; > - while (loop--) { > + do { > struct page_cgroup *pc; > struct page *page; > > @@ -3726,8 +3728,7 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > cond_resched(); > } else > busy = NULL; > - } > - return !list_empty(list); > + } while (!list_empty(list)); > } Is there anything which can keep failing until migration to another cgroup is complete? I think there is, e.g., if mmap_sem is busy or memcg is co-mounted with other controllers and another controller's ->attach() is blocking on something. If so, busy-looping blindly probably isn't a good idea and we would want at least msleep between retries (e.g. have two lists, throw failed ones to the other and sleep shortly when switching the front and back lists). > + /* > + * This is a safety check because mem_cgroup_force_empty_list > + * could have raced with mem_cgroup_replace_page_cache callers > + * so the lru seemed empty but the page could have been added > + * right after the check. RES_USAGE should be safe as we always > + * charge before adding to the LRU. > + */ > + } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); Maybe we want to trigger some warning if retry count gets too high? At least for now? Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756168Ab2JRWmu (ORCPT ); Thu, 18 Oct 2012 18:42:50 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:54158 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932504Ab2JRWlw (ORCPT ); Thu, 18 Oct 2012 18:41:52 -0400 Date: Thu, 18 Oct 2012 15:41:48 -0700 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121018224148.GR13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1350480648-10905-5-git-send-email-mhocko@suse.cz> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michal. On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > safely move on and forbit all the callbacks to fail. The last missing > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > that css_tryget fails so no new charges for the memcg can happen. > The callbacks are also called from within cgroup_lock to guarantee that > no new tasks show up. We could theoretically call them outside of the > lock but then we have to move after CGRP_REMOVED flag is set. > > Signed-off-by: Michal Hocko So, the plan is to do something like the following once memcg is ready. http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 Note that the patch is broken in a couple places but it does show the general direction. I'd prefer if patch #3 simply makes pre_destroy() return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. Then, I can pull the branch in and drop all the unnecessary cruft. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932410Ab2JRWqM (ORCPT ); Thu, 18 Oct 2012 18:46:12 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:55679 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755008Ab2JRWqK (ORCPT ); Thu, 18 Oct 2012 18:46:10 -0400 Date: Thu, 18 Oct 2012 15:46:06 -0700 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121018224606.GS13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121018224148.GR13370@google.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 18, 2012 at 03:41:48PM -0700, Tejun Heo wrote: > Note that the patch is broken in a couple places but it does show the > general direction. I'd prefer if patch #3 simply makes pre_destroy() > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > Then, I can pull the branch in and drop all the unnecessary cruft. But you need the locking change for further memcg cleanup. To avoid interlocked pulls from both sides, I think it's okay to push this one with the rest of memcg changes. I can do the cleanup on top of this whole series, but please do drop .__DEPRECATED_clear_css_refs from memcg. Acked-by: Tejun Heo Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932655Ab2JRWsb (ORCPT ); Thu, 18 Oct 2012 18:48:31 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:63204 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750724Ab2JRWs3 (ORCPT ); Thu, 18 Oct 2012 18:48:29 -0400 Date: Thu, 18 Oct 2012 15:48:25 -0700 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 6/6] hugetlb: do not fail in hugetlb_cgroup_pre_destroy Message-ID: <20121018224825.GU13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-7-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1350480648-10905-7-git-send-email-mhocko@suse.cz> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 17, 2012 at 03:30:48PM +0200, Michal Hocko wrote: > Now that pre_destroy callbacks are called from within cgroup_lock and > the cgroup has been checked to be empty without any children then there > is no other way to fail. > > Signed-off-by: Michal Hocko Reviewed-by: Tejun Heo Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757336Ab2JSJd4 (ORCPT ); Fri, 19 Oct 2012 05:33:56 -0400 Received: from szxga01-in.huawei.com ([119.145.14.64]:4372 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753908Ab2JSJdy (ORCPT ); Fri, 19 Oct 2012 05:33:54 -0400 Message-ID: <50811E5E.1090205@huawei.com> Date: Fri, 19 Oct 2012 17:33:18 +0800 From: Li Zefan User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120907 Thunderbird/15.0.1 MIME-Version: 1.0 To: Michal Hocko CC: , , , Tejun Heo , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> In-Reply-To: <1350480648-10905-5-git-send-email-mhocko@suse.cz> Content-Type: text/plain; charset="GB2312" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.135.68.215] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2012/10/17 21:30, Michal Hocko wrote: > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > safely move on and forbit all the callbacks to fail. The last missing > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > that css_tryget fails so no new charges for the memcg can happen. > The callbacks are also called from within cgroup_lock to guarantee that > no new tasks show up. I'm afraid this won't work. See commit 3fa59dfbc3b223f02c26593be69ce6fc9a940405 ("cgroup: fix potential deadlock in pre_destroy") > We could theoretically call them outside of the > lock but then we have to move after CGRP_REMOVED flag is set. > > Signed-off-by: Michal Hocko > --- > kernel/cgroup.c | 30 +++++++++--------------------- > 1 file changed, 9 insertions(+), 21 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758305Ab2JSLJy (ORCPT ); Fri, 19 Oct 2012 07:09:54 -0400 Received: from cantor2.suse.de ([195.135.220.15]:48947 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754062Ab2JSLJx (ORCPT ); Fri, 19 Oct 2012 07:09:53 -0400 Date: Fri, 19 Oct 2012 13:09:49 +0200 From: Michal Hocko To: Li Zefan Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121019110949.GC799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <50811E5E.1090205@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50811E5E.1090205@huawei.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 19-10-12 17:33:18, Li Zefan wrote: > On 2012/10/17 21:30, Michal Hocko wrote: > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > safely move on and forbit all the callbacks to fail. The last missing > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > that css_tryget fails so no new charges for the memcg can happen. > > > The callbacks are also called from within cgroup_lock to guarantee that > > no new tasks show up. > > I'm afraid this won't work. See commit 3fa59dfbc3b223f02c26593be69ce6fc9a940405 > ("cgroup: fix potential deadlock in pre_destroy") Very good point. Thanks for poiting this out. So we should call pre_destroy at the very end? What about the following? Or should be rather drop the lock after check_for_release(parent) or sooner but after CGRP_REMOVED is set? --- >>From 70ea8718aba1c1784b94bfb26aa2307195c07c0b Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 17 Oct 2012 13:42:06 +0200 Subject: [PATCH] cgroups: forbid pre_destroy callback to fail Now that mem_cgroup_pre_destroy callback doesn't fail finally we can safely move on and forbit all the callbacks to fail. The last missing piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so that css_tryget fails so no new charges for the memcg can happen. We cannot, however, move cgroup_call_pre_destroy right after because we cannot call mem_cgroup_pre_destroy with the cgroup_lock held (see 3fa59dfb cgroup: fix potential deadlock in pre_destroy) so we have to move it after the lock is released. Changes since v1 - Li Zefan pointed out that mem_cgroup_pre_destroy cannot be called with cgroup_lock held Signed-off-by: Michal Hocko --- kernel/cgroup.c | 30 +++++++++--------------------- 1 file changed, 9 insertions(+), 21 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index b7d9606..4c6adbd 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -855,7 +855,7 @@ static struct inode *cgroup_new_inode(umode_t mode, struct super_block *sb) * Call subsys's pre_destroy handler. * This is called before css refcnt check. */ -static int cgroup_call_pre_destroy(struct cgroup *cgrp) +static void cgroup_call_pre_destroy(struct cgroup *cgrp) { struct cgroup_subsys *ss; int ret = 0; @@ -864,15 +864,8 @@ static int cgroup_call_pre_destroy(struct cgroup *cgrp) if (!ss->pre_destroy) continue; - ret = ss->pre_destroy(cgrp); - if (ret) { - /* ->pre_destroy() failure is being deprecated */ - WARN_ON_ONCE(!ss->__DEPRECATED_clear_css_refs); - break; - } + BUG_ON(ss->pre_destroy(cgrp)); } - - return ret; } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -4161,7 +4154,6 @@ again: mutex_unlock(&cgroup_mutex); return -EBUSY; } - mutex_unlock(&cgroup_mutex); /* * In general, subsystem has no css->refcnt after pre_destroy(). But @@ -4174,17 +4166,6 @@ again: */ set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); - /* - * Call pre_destroy handlers of subsys. Notify subsystems - * that rmdir() request comes. - */ - ret = cgroup_call_pre_destroy(cgrp); - if (ret) { - clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); - return ret; - } - - mutex_lock(&cgroup_mutex); parent = cgrp->parent; if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) { clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); @@ -4206,6 +4187,7 @@ again: return -EINTR; goto again; } + /* NO css_tryget() can success after here. */ finish_wait(&cgroup_rmdir_waitq, &wait); clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); @@ -4244,6 +4226,12 @@ again: spin_unlock(&cgrp->event_list_lock); mutex_unlock(&cgroup_mutex); + + /* + * Call pre_destroy handlers of subsys. Notify subsystems + * that rmdir() request comes. + */ + cgroup_call_pre_destroy(cgrp); return 0; } -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758227Ab2JSNYn (ORCPT ); Fri, 19 Oct 2012 09:24:43 -0400 Received: from cantor2.suse.de ([195.135.220.15]:53786 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755544Ab2JSNYm (ORCPT ); Fri, 19 Oct 2012 09:24:42 -0400 Date: Fri, 19 Oct 2012 15:24:38 +0200 From: Michal Hocko To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling Message-ID: <20121019132438.GD799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-4-git-send-email-mhocko@suse.cz> <20121018221654.GP13370@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121018221654.GP13370@google.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 18-10-12 15:16:54, Tejun Heo wrote: > Hello, Michal. > > On Wed, Oct 17, 2012 at 03:30:45PM +0200, Michal Hocko wrote: > > mem_cgroup_force_empty_list currently tries to remove all pages from > > the given LRU. To prevent from temoporary failures (EBUSY returned by > > mem_cgroup_move_parent) it uses a margin to the current LRU pages and > > returns the true if there are still some pages left on the list. > > > > If we consider that mem_cgroup_move_parent fails only when we are racing > > with somebody else removing the page (resp. uncharging it) or when the > > page is migrated then it is obvious that all those failures are only > > temporal and so we can safely retry later. > > Let's get rid of the safety margin and make the loop really wait for the > > empty LRU. The caller should still make sure that all charges have been > > removed from the res_counter because mem_cgroup_replace_page_cache might > > add a page to the LRU after the check (it doesn't touch res_counter > > though). > > This catches most of the cases except for shmem which might call > > mem_cgroup_replace_page_cache with a page which is not charged and on > > the LRU yet but this was the case also without this patch. In order to > > fix this we need a guarantee that try_get_mem_cgroup_from_page falls > > back to the current mm's cgroup so it needs css_tryget to fail. This > > will be fixed up in a later patch because it nees a help from cgroup > > core. > > > > Signed-off-by: Michal Hocko > > In the sense that "I looked at it and nothing seemed too scary". > > Reviewed-by: Tejun Heo Thanks > > Some nitpicks below. > > > /* > > - * move charges to its parent. > > + * move charges to its parent or the root cgroup if the group > > + * has no parent (aka use_hierarchy==0). > > + * Although this might fail the failure is always temporary and it > > + * signals a race with a page removal/uncharge or migration. In the > > + * first case the page will vanish from the LRU on the next attempt > > + * and the call should be retried later. > > */ > > - > > Maybe convert to proper /** function comment while at it? these are internal functions and we usually do not create kerneldoc for them. But I can surely change it - it would deserve a bigger clean up then. > I also think it would be helpful to actually comment on each possible > failure case explaining why the failure condition is temporal. What about: " * Although this might fail (get_page_unless_zero, isolate_lru_page or * mem_cgroup_move_account fails) the failure is always temporary and * it signals a race with a page removal/uncharge or migration. In the * first case the page is on the way out and it will vanish from the LRU * on the next attempt and the call should be retried later. * Isolation from the LRU fails only if page has been isolated from * the LRU since we looked at it and that usually means either global * reclaim or migration going on. The page will either get back to the * LRU or vanish. * Finaly mem_cgroup_move_account fails only if the page got uncharged * (!PageCgroupUsed) or moved to a different group. The page will * disappear in the next attempt. " Better? Or should it rather be in the changelog? > > > /* > > * Traverse a specified page_cgroup list and try to drop them all. This doesn't > > - * reclaim the pages page themselves - it just removes the page_cgroups. > > - * Returns true if some page_cgroups were not freed, indicating that the caller > > - * must retry this operation. > > + * reclaim the pages page themselves - pages are moved to the parent (or root) > > + * group. > > */ > > Ditto. > > > -static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > +static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > int node, int zid, enum lru_list lru) > > { > > struct mem_cgroup_per_zone *mz; > > - unsigned long flags, loop; > > + unsigned long flags; > > struct list_head *list; > > struct page *busy; > > struct zone *zone; > > @@ -3696,11 +3701,8 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > mz = mem_cgroup_zoneinfo(memcg, node, zid); > > list = &mz->lruvec.lists[lru]; > > > > - loop = mz->lru_size[lru]; > > - /* give some margin against EBUSY etc...*/ > > - loop += 256; > > busy = NULL; > > - while (loop--) { > > + do { > > struct page_cgroup *pc; > > struct page *page; > > > > @@ -3726,8 +3728,7 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, > > cond_resched(); > > } else > > busy = NULL; > > - } > > - return !list_empty(list); > > + } while (!list_empty(list)); > > } > > Is there anything which can keep failing until migration to another > cgroup is complete? This is not about migration to another cgroup. Remember there are no tasks in the group so we have no origin for the migration. I was talking about migrate_pages. > I think there is, e.g., if mmap_sem is busy or memcg is co-mounted > with other controllers and another controller's ->attach() is blocking > on something. I am not sure I understand your concern. There are no tasks and we will break out the loop if some appear. And yes we can retry a lot in pathological cases. But this is a group removal path which is not hot. > If so, busy-looping blindly probably isn't a good idea and we would > want at least msleep between retries (e.g. have two lists, throw > failed ones to the other and sleep shortly when switching the front > and back lists). we do cond_resched if we fail. > > + /* > > + * This is a safety check because mem_cgroup_force_empty_list > > + * could have raced with mem_cgroup_replace_page_cache callers > > + * so the lru seemed empty but the page could have been added > > + * right after the check. RES_USAGE should be safe as we always > > + * charge before adding to the LRU. > > + */ > > + } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); > > Maybe we want to trigger some warning if retry count gets too high? > At least for now? We can but is this really worth it? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932325Ab2JSNct (ORCPT ); Fri, 19 Oct 2012 09:32:49 -0400 Received: from cantor2.suse.de ([195.135.220.15]:54114 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752732Ab2JSNcq (ORCPT ); Fri, 19 Oct 2012 09:32:46 -0400 Date: Fri, 19 Oct 2012 15:32:45 +0200 From: Michal Hocko To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121019133244.GE799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121018224148.GR13370@google.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 18-10-12 15:41:48, Tejun Heo wrote: > Hello, Michal. > > On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > safely move on and forbit all the callbacks to fail. The last missing > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > that css_tryget fails so no new charges for the memcg can happen. > > The callbacks are also called from within cgroup_lock to guarantee that > > no new tasks show up. We could theoretically call them outside of the > > lock but then we have to move after CGRP_REMOVED flag is set. > > > > Signed-off-by: Michal Hocko > > So, the plan is to do something like the following once memcg is > ready. > > http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 > > Note that the patch is broken in a couple places but it does show the > general direction. I'd prefer if patch #3 simply makes pre_destroy() > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. We can still fail inn #3 without this patch becasuse there are is no guarantee that a new task is attached to the group. And I wanted to keep memcg and generic cgroup parts separated. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756285Ab2JSNeS (ORCPT ); Fri, 19 Oct 2012 09:34:18 -0400 Received: from cantor2.suse.de ([195.135.220.15]:54175 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751476Ab2JSNeQ (ORCPT ); Fri, 19 Oct 2012 09:34:16 -0400 Date: Fri, 19 Oct 2012 15:34:15 +0200 From: Michal Hocko To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121019133415.GF799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121018224606.GS13370@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121018224606.GS13370@google.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 18-10-12 15:46:06, Tejun Heo wrote: > On Thu, Oct 18, 2012 at 03:41:48PM -0700, Tejun Heo wrote: > > Note that the patch is broken in a couple places but it does show the > > general direction. I'd prefer if patch #3 simply makes pre_destroy() > > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > > Then, I can pull the branch in and drop all the unnecessary cruft. > > But you need the locking change for further memcg cleanup. To avoid > interlocked pulls from both sides, I think it's okay to push this one > with the rest of memcg changes. I can do the cleanup on top of this > whole series, but please do drop .__DEPRECATED_clear_css_refs from > memcg. OK I will drop that one. > Acked-by: Tejun Heo Do you still agree with the v2 based on Li's feedback? Thanks -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758815Ab2JSNti (ORCPT ); Fri, 19 Oct 2012 09:49:38 -0400 Received: from cantor2.suse.de ([195.135.220.15]:54768 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758726Ab2JSNth (ORCPT ); Fri, 19 Oct 2012 09:49:37 -0400 Date: Fri, 19 Oct 2012 15:49:34 +0200 From: Michal Hocko To: linux-mm@kvack.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 5/6] memcg: make mem_cgroup_reparent_charges non failing Message-ID: <20121019134934.GG799@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-6-git-send-email-mhocko@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1350480648-10905-6-git-send-email-mhocko@suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is an updated version of the patch. I have dropped .__DEPRECATED_clear_css_refs in this one as it makes the best sense to me. I didn't add Tejun's Reviewed-by because of this change. Could you recheck, please? --- >>From 6c1f2e76e254e7638ad8cc87f319e3492ac80c5b Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 17 Oct 2012 14:15:09 +0200 Subject: [PATCH] memcg: make mem_cgroup_reparent_charges non failing Now that pre_destroy callbacks are called from within cgroup_lock and the cgroup has been checked to be empty without any children then there is no other way to fail. mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css because all css' are marked dead already. mem_cgroup_subsys.__DEPRECATED_clear_css_refs can be dropped as mem_cgroup_pre_destroy cannot fail now. Changes since v1 - drop __DEPRECATED_clear_css_refs Signed-off-by: Michal Hocko --- mm/memcontrol.c | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f57ba4c..b4d854e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3738,14 +3738,12 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, * * Caller is responsible for holding css reference on the memcg. */ -static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) +static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg) { struct cgroup *cgrp = memcg->css.cgroup; int node, zid; do { - if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) - return -EBUSY; /* This is for making all *used* pages to be on LRU. */ lru_add_drain_all(); drain_all_stock_sync(memcg); @@ -3771,8 +3769,6 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg) * charge before adding to the LRU. */ } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0); - - return 0; } /* @@ -3809,7 +3805,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) } lru_add_drain(); - return mem_cgroup_reparent_charges(memcg); + mem_cgroup_reparent_charges(memcg); + + return 0; } static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event) @@ -5013,13 +5011,9 @@ free_out: static int mem_cgroup_pre_destroy(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - int ret; - css_get(&memcg->css); - ret = mem_cgroup_reparent_charges(memcg); - css_put(&memcg->css); - - return ret; + mem_cgroup_reparent_charges(memcg); + return 0; } static void mem_cgroup_destroy(struct cgroup *cont) @@ -5621,7 +5615,6 @@ struct cgroup_subsys mem_cgroup_subsys = { .base_cftypes = mem_cgroup_files, .early_init = 0, .use_id = 1, - .__DEPRECATED_clear_css_refs = true, }; #ifdef CONFIG_MEMCG_SWAP -- 1.7.10.4 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758360Ab2JSTty (ORCPT ); Fri, 19 Oct 2012 15:49:54 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:33702 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758157Ab2JSTtv (ORCPT ); Fri, 19 Oct 2012 15:49:51 -0400 Date: Fri, 19 Oct 2012 12:49:46 -0700 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling Message-ID: <20121019194946.GM13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-4-git-send-email-mhocko@suse.cz> <20121018221654.GP13370@google.com> <20121019132438.GD799@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121019132438.GD799@dhcp22.suse.cz> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michal. On Fri, Oct 19, 2012 at 03:24:38PM +0200, Michal Hocko wrote: > > Maybe convert to proper /** function comment while at it? > > these are internal functions and we usually do not create kerneldoc for > them. But I can surely change it - it would deserve a bigger clean up > then. Yeah, I got into the habit of making function comments kerneldoc if the function is important / scary enough. It's upto you but I think that would be an improvement here. > What about: > " > * Although this might fail (get_page_unless_zero, isolate_lru_page or > * mem_cgroup_move_account fails) the failure is always temporary and > * it signals a race with a page removal/uncharge or migration. In the > * first case the page is on the way out and it will vanish from the LRU > * on the next attempt and the call should be retried later. > * Isolation from the LRU fails only if page has been isolated from > * the LRU since we looked at it and that usually means either global > * reclaim or migration going on. The page will either get back to the > * LRU or vanish. > * Finaly mem_cgroup_move_account fails only if the page got uncharged > * (!PageCgroupUsed) or moved to a different group. The page will > * disappear in the next attempt. > " > > Better? Or should it rather be in the changelog? Looks good to me and I personally think it deserves to be a comment. > > Is there anything which can keep failing until migration to another > > cgroup is complete? > > This is not about migration to another cgroup. Remember there are no > tasks in the group so we have no origin for the migration. I was talking > about migrate_pages. > > > I think there is, e.g., if mmap_sem is busy or memcg is co-mounted > > with other controllers and another controller's ->attach() is blocking > > on something. > > I am not sure I understand your concern. There are no tasks and we will > break out the loop if some appear. And yes we can retry a lot in > pathological cases. But this is a group removal path which is not hot. Ah, okay, I misunderstood that it could wait for task cgroup migration. > > If so, busy-looping blindly probably isn't a good idea and we would > > want at least msleep between retries (e.g. have two lists, throw > > failed ones to the other and sleep shortly when switching the front > > and back lists). > > we do cond_resched if we fail. If it won't ever spin for someone else sleeping, I think it should be fine. > > Maybe we want to trigger some warning if retry count gets too high? > > At least for now? > > We can but is this really worth it? I don't know. My sense of danger here is likely to be way off compared to yours so if you think it's a fairly safe loop, it probably is. It just reminds me of the busy looping we had in freezer. It was correct but actually manifested as a problem - when a system was going down for emergency hibernation from low battery, that busy loop not too rarely drained the small reserve making the machine lose power before completing hibernation. So, it could be that I'm a bit paranoid here. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759288Ab2JSURn (ORCPT ); Fri, 19 Oct 2012 16:17:43 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:53644 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755086Ab2JSURl (ORCPT ); Fri, 19 Oct 2012 16:17:41 -0400 Date: Fri, 19 Oct 2012 13:17:36 -0700 From: Tejun Heo To: Li Zefan Cc: Michal Hocko , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121019201736.GQ13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <50811E5E.1090205@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50811E5E.1090205@huawei.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 19, 2012 at 05:33:18PM +0800, Li Zefan wrote: > On 2012/10/17 21:30, Michal Hocko wrote: > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > safely move on and forbit all the callbacks to fail. The last missing > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > that css_tryget fails so no new charges for the memcg can happen. > > > The callbacks are also called from within cgroup_lock to guarantee that > > no new tasks show up. > > I'm afraid this won't work. See commit 3fa59dfbc3b223f02c26593be69ce6fc9a940405 > ("cgroup: fix potential deadlock in pre_destroy") Yeah, you're right. Argh... we really should unexport cgroup_lock soon. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030200Ab2JSUYL (ORCPT ); Fri, 19 Oct 2012 16:24:11 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:55926 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754399Ab2JSUYJ (ORCPT ); Fri, 19 Oct 2012 16:24:09 -0400 Date: Fri, 19 Oct 2012 13:24:05 -0700 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121019202405.GR13370@google.com> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121019133244.GE799@dhcp22.suse.cz> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michal. On Fri, Oct 19, 2012 at 03:32:45PM +0200, Michal Hocko wrote: > On Thu 18-10-12 15:41:48, Tejun Heo wrote: > > Hello, Michal. > > > > On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > > safely move on and forbit all the callbacks to fail. The last missing > > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > > that css_tryget fails so no new charges for the memcg can happen. > > > The callbacks are also called from within cgroup_lock to guarantee that > > > no new tasks show up. We could theoretically call them outside of the > > > lock but then we have to move after CGRP_REMOVED flag is set. > > > > > > Signed-off-by: Michal Hocko > > > > So, the plan is to do something like the following once memcg is > > ready. > > > > http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 > > > > Note that the patch is broken in a couple places but it does show the > > general direction. I'd prefer if patch #3 simply makes pre_destroy() > > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > > We can still fail inn #3 without this patch becasuse there are is no > guarantee that a new task is attached to the group. And I wanted to keep > memcg and generic cgroup parts separated. Yes, but all other controllers are broken that way too and the worst thing which will hapen is triggering WARN_ON_ONCE(). Let's note the failure in the commit and remove DEPREDATED_clear_css_refs in the previous patch. Then, I can pull from you, clean up pre_destroy mess and then you can pull back for further cleanups. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752775Ab2JVKa1 (ORCPT ); Mon, 22 Oct 2012 06:30:27 -0400 Received: from cantor2.suse.de ([195.135.220.15]:34198 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751863Ab2JVKaZ (ORCPT ); Mon, 22 Oct 2012 06:30:25 -0400 Date: Mon, 22 Oct 2012 12:30:21 +0200 From: Michal Hocko To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121022103021.GA6367@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121019202405.GR13370@google.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 19-10-12 13:24:05, Tejun Heo wrote: > Hello, Michal. > > On Fri, Oct 19, 2012 at 03:32:45PM +0200, Michal Hocko wrote: > > On Thu 18-10-12 15:41:48, Tejun Heo wrote: > > > Hello, Michal. > > > > > > On Wed, Oct 17, 2012 at 03:30:46PM +0200, Michal Hocko wrote: > > > > Now that mem_cgroup_pre_destroy callback doesn't fail finally we can > > > > safely move on and forbit all the callbacks to fail. The last missing > > > > piece is moving cgroup_call_pre_destroy after cgroup_clear_css_refs so > > > > that css_tryget fails so no new charges for the memcg can happen. > > > > The callbacks are also called from within cgroup_lock to guarantee that > > > > no new tasks show up. We could theoretically call them outside of the > > > > lock but then we have to move after CGRP_REMOVED flag is set. > > > > > > > > Signed-off-by: Michal Hocko > > > > > > So, the plan is to do something like the following once memcg is > > > ready. > > > > > > http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 > > > > > > Note that the patch is broken in a couple places but it does show the > > > general direction. I'd prefer if patch #3 simply makes pre_destroy() > > > return 0 and drop __DEPRECATED_clear_css_refs from mem_cgroup_subsys. > > > > We can still fail inn #3 without this patch becasuse there are is no > > guarantee that a new task is attached to the group. And I wanted to keep > > memcg and generic cgroup parts separated. > > Yes, but all other controllers are broken that way too It's just hugetlb and memcg that have pre_destroy. > and the worst thing which will hapen is triggering WARN_ON_ONCE(). The patch does BUG_ON(ss->pre_destroy(cgrp)). I am not sure WARN_ON_ONCE is appropriate here because we would like to have it at least per controller warning. I do not see any reason why to make this more complicated but I am open to suggestions. > Let's note the failure in the commit and remove > DEPREDATED_clear_css_refs in the previous patch. Then, I can pull > from you, clean up pre_destroy mess and then you can pull back for > further cleanups. Well this will get complicated as there are dependencies between memcg parts (based on Andrew's tree) and your tree. My tree is not pullable as all the patches go via Andrew. I am not sure how to get out of this. There is only one cgroup patch so what about pushing all of this via Andrew and do the follow up cleanups once they get merged? We are not in hurry, are we? Anyway does it really make sense to drop DEPREDATED_clear_css_refs already in the previous patch when it is _not_ guaranteed that pre_destroy succeeds? -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935298Ab2JXTZt (ORCPT ); Wed, 24 Oct 2012 15:25:49 -0400 Received: from mail-ie0-f174.google.com ([209.85.223.174]:45041 "EHLO mail-ie0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933046Ab2JXTZr (ORCPT ); Wed, 24 Oct 2012 15:25:47 -0400 Date: Wed, 24 Oct 2012 12:25:35 -0700 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121024192535.GG12182@atj.dyndns.org> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> <20121022103021.GA6367@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121022103021.GA6367@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michal. On Mon, Oct 22, 2012 at 12:30:21PM +0200, Michal Hocko wrote: > > > We can still fail inn #3 without this patch becasuse there are is no > > > guarantee that a new task is attached to the group. And I wanted to keep > > > memcg and generic cgroup parts separated. > > > > Yes, but all other controllers are broken that way too > > It's just hugetlb and memcg that have pre_destroy. > > > and the worst thing which will hapen is triggering WARN_ON_ONCE(). > > The patch does BUG_ON(ss->pre_destroy(cgrp)). I am not sure WARN_ON_ONCE is > appropriate here because we would like to have it at least per > controller warning. I do not see any reason why to make this more > complicated but I am open to suggestions. Once it's dropped from memcg, the next patch can update cgroup core accordingly and the bug will exist for a single commit and the failure mode would be triggering of WARN_ON_ONCE(). Seems pretty simple to me. > > Let's note the failure in the commit and remove > > DEPREDATED_clear_css_refs in the previous patch. Then, I can pull > > from you, clean up pre_destroy mess and then you can pull back for > > further cleanups. > > Well this will get complicated as there are dependencies between memcg > parts (based on Andrew's tree) and your tree. My tree is not pullable as > all the patches go via Andrew. I am not sure how to get out of this. > There is only one cgroup patch so what about pushing all of this via > Andrew and do the follow up cleanups once they get merged? We are not in > hurry, are we? Let's create a cgroup branch and build things there. I don't think cgroup changes are gonna be a single patch and expect to see at least some bug fixes afterwards and don't wanna keep them floating separate from other cgroup changes. mm being based on top of -next, that should work, right? > Anyway does it really make sense to drop DEPREDATED_clear_css_refs > already in the previous patch when it is _not_ guaranteed that > pre_destroy succeeds? It makes things simpler here by decoupling memcg change with core cgroup changes and the introduced bug isn't too easy to trigger and even when triggered the failure mode isn't critical. It's not gonna break normal common operations or bisection. As long as the issue is clearly documented, I think it should be fine. Just note that this opens up a race window from deficient cgroup API and the following commits will address it. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933780Ab2JYOi3 (ORCPT ); Thu, 25 Oct 2012 10:38:29 -0400 Received: from cantor2.suse.de ([195.135.220.15]:47027 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932947Ab2JYOi1 (ORCPT ); Thu, 25 Oct 2012 10:38:27 -0400 Date: Thu, 25 Oct 2012 16:37:56 +0200 From: Michal Hocko To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121025143756.GI11105@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> <20121022103021.GA6367@dhcp22.suse.cz> <20121024192535.GG12182@atj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121024192535.GG12182@atj.dyndns.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 24-10-12 12:25:35, Tejun Heo wrote: > Hello, Michal. > > On Mon, Oct 22, 2012 at 12:30:21PM +0200, Michal Hocko wrote: > > > > We can still fail inn #3 without this patch becasuse there are is no > > > > guarantee that a new task is attached to the group. And I wanted to keep > > > > memcg and generic cgroup parts separated. > > > > > > Yes, but all other controllers are broken that way too > > > > It's just hugetlb and memcg that have pre_destroy. > > > > > and the worst thing which will hapen is triggering WARN_ON_ONCE(). > > > > The patch does BUG_ON(ss->pre_destroy(cgrp)). I am not sure WARN_ON_ONCE is > > appropriate here because we would like to have it at least per > > controller warning. I do not see any reason why to make this more > > complicated but I am open to suggestions. > > Once it's dropped from memcg, the next patch can update cgroup core > accordingly and the bug will exist for a single commit and the failure > mode would be triggering of WARN_ON_ONCE(). Seems pretty simple to > me. I am not sure I understand you here. So are you suggesting s/BUG_ON/WARN_ON_ONCE/ in this patch? It is true that this will not break bisectability but it is still not correct (strictly speaking because any load that can race group removal with new tasks addition would hit BUG/WARN and we will remove a group with a task inside). The patchset as posted makes sure that none of the stages adds a regression and I would like to stick with that as much as possible if it doesn't cause too much of a hassle. > > > Let's note the failure in the commit and remove > > > DEPREDATED_clear_css_refs in the previous patch. Then, I can pull > > > from you, clean up pre_destroy mess and then you can pull back for > > > further cleanups. > > > > Well this will get complicated as there are dependencies between memcg > > parts (based on Andrew's tree) and your tree. My tree is not pullable as > > all the patches go via Andrew. I am not sure how to get out of this. > > There is only one cgroup patch so what about pushing all of this via > > Andrew and do the follow up cleanups once they get merged? We are not in > > hurry, are we? > > Let's create a cgroup branch and build things there. I don't think > cgroup changes are gonna be a single patch and expect to see at least > some bug fixes afterwards and don't wanna keep them floating separate > from other cgroup changes. > mm being based on top of -next, that should work, right? Well, a tree based on -next is, ehm, impractical. I can create a bug on top of my -mm git branch (where I merge your cgroup common changes) for development and then when we are ready we can send it as a series and push it via Andrew. Would that work for you? Or we can push the core part via Andrew, wait for the merge and work on the follow up cleanups later? It is not like the follow up part is really urgent, isn't it? I would just like the memcg part settled first because this can potentially conflict with other memcg work. [...] -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936040Ab2JYRm1 (ORCPT ); Thu, 25 Oct 2012 13:42:27 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:48734 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757050Ab2JYRmZ (ORCPT ); Thu, 25 Oct 2012 13:42:25 -0400 Date: Thu, 25 Oct 2012 10:42:20 -0700 From: Tejun Heo To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121025174220.GJ11442@htj.dyndns.org> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> <20121022103021.GA6367@dhcp22.suse.cz> <20121024192535.GG12182@atj.dyndns.org> <20121025143756.GI11105@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121025143756.GI11105@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey, Michal. On Thu, Oct 25, 2012 at 04:37:56PM +0200, Michal Hocko wrote: > I am not sure I understand you here. So are you suggesting > s/BUG_ON/WARN_ON_ONCE/ in this patch? Oh, no, I meant that we can do upto patch 3 of this series and then follow up with proper cgroup core update and then stack further memcg cleanups on top. > > Let's create a cgroup branch and build things there. I don't think > > cgroup changes are gonna be a single patch and expect to see at least > > some bug fixes afterwards and don't wanna keep them floating separate > > from other cgroup changes. > > > mm being based on top of -next, that should work, right? > > Well, a tree based on -next is, ehm, impractical. I can create a bug on > top of my -mm git branch (where I merge your cgroup common changes) for > development and then when we are ready we can send it as a series and > push it via Andrew. Would that work for you? > Or we can push the core part via Andrew, wait for the merge and work on > the follow up cleanups later? > It is not like the follow up part is really urgent, isn't it? I would > just like the memcg part settled first because this can potentially > conflict with other memcg work. Argh... can we pretty *please* just do a plain git branch? I don't care where it is but I want to be able to pull it into cgroup core and yes I do wanna make this happen in this devel cycle. We've been sitting on it far too long waiting for memcg. Thanks. -- tejun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2992623Ab2JYSs4 (ORCPT ); Thu, 25 Oct 2012 14:48:56 -0400 Received: from cantor2.suse.de ([195.135.220.15]:58900 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2992597Ab2JYSsl (ORCPT ); Thu, 25 Oct 2012 14:48:41 -0400 Date: Thu, 25 Oct 2012 20:48:34 +0200 From: Michal Hocko To: Tejun Heo Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Li Zefan , Johannes Weiner , KAMEZAWA Hiroyuki , Balbir Singh Subject: Re: [PATCH 4/6] cgroups: forbid pre_destroy callback to fail Message-ID: <20121025184834.GB20618@dhcp22.suse.cz> References: <1350480648-10905-1-git-send-email-mhocko@suse.cz> <1350480648-10905-5-git-send-email-mhocko@suse.cz> <20121018224148.GR13370@google.com> <20121019133244.GE799@dhcp22.suse.cz> <20121019202405.GR13370@google.com> <20121022103021.GA6367@dhcp22.suse.cz> <20121024192535.GG12182@atj.dyndns.org> <20121025143756.GI11105@dhcp22.suse.cz> <20121025174220.GJ11442@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121025174220.GJ11442@htj.dyndns.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 25-10-12 10:42:20, Tejun Heo wrote: > Hey, Michal. > > On Thu, Oct 25, 2012 at 04:37:56PM +0200, Michal Hocko wrote: > > I am not sure I understand you here. So are you suggesting > > s/BUG_ON/WARN_ON_ONCE/ in this patch? > > Oh, no, I meant that we can do upto patch 3 of this series and then > follow up with proper cgroup core update and then stack further > memcg cleanups on top. I thought the later cleanups would be on top of the series. > > > Let's create a cgroup branch and build things there. I don't think > > > cgroup changes are gonna be a single patch and expect to see at least > > > some bug fixes afterwards and don't wanna keep them floating separate > > > from other cgroup changes. > > > > > mm being based on top of -next, that should work, right? > > > > Well, a tree based on -next is, ehm, impractical. I can create a bug on > > top of my -mm git branch (where I merge your cgroup common changes) for > > development and then when we are ready we can send it as a series and > > push it via Andrew. Would that work for you? > > Or we can push the core part via Andrew, wait for the merge and work on > > the follow up cleanups later? > > It is not like the follow up part is really urgent, isn't it? I would > > just like the memcg part settled first because this can potentially > > conflict with other memcg work. > > Argh... can we pretty *please* just do a plain git branch? I don't > care where it is but I want to be able to pull it into cgroup core and Hohumm, I have tried to apply the series on top of Linus' 3.6 and there were no conflicts so I can create a branch which you can pull into your cgroup branch (which I can then merge into -mm git tree). This would however mean that those patches wouldn't fly through Andrew's tree. Is this really what we want and what does it give to us? > yes I do wanna make this happen in this devel cycle. We've been > sitting on it far too long waiting for memcg. I can surely imagine that (for the memcg part) but it needs throughout review. -- Michal Hocko SUSE Labs