From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Zhengju Subject: [PATCH V2 0/2] Provide more precise dump info for memcg-oom Date: Wed, 7 Nov 2012 16:40:02 +0800 Message-ID: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:cc:subject:date:message-id:x-mailer; bh=cD4tm8bBNy7Z/ZWzHQ2XWLhQJhP9wtPHv7Rvp7ZRFZc=; b=hJP8LTlbEmHcmMwDsw1AoNFvyRfCLWZePtHixpRHQ1+5rIWfF0LStfQ5SguHUmZpi2 laybpwjbe3rJ0LY7wL1ZRPeVZ6Re33muHP3Q1Dt7/Ow3QqyFXlnWu1RILJkMUsSs44Dv uxG/ja6HC1I22S/b/b0jFIRQrrH5k5up9VJnQbQeFLYwcJzWN7HbA8tguL1p9TVAWFRA Y6eDmAM9IQtYz3KwGC0pCSyrRhJ2LBxJkujGk5aW51QZfmm6Ld5tWLTcw23AGyMS9drm KMkOhcrnrrPl+z/z0Q3FhB77BCBE2UfcFrt8/ttNM6IzFvoaKpwtdgd5vPxKWVPfu65E UmSw== Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com Cc: linux-kernel@vger.kernel.org, Sha Zhengju From: Sha Zhengju When memcg oom is happening the current memcg related dump information is limited for debugging. The patches provide more detailed memcg page statistics and also take hierarchy into consideration. The previous primitive version can be reached here: https://lkml.org/lkml/2012/7/30/179. Change log: 1. some modification towards hierarchy 2. rework dump_tasks 3. rebased on Michal's mm tree since-3.6 Any comments are welcomed. : ) Sha Zhengju (2): memcg-oom-provide-more-precise-dump-info-while-memcg.patch oom-rework-dump_tasks-to-optimize-memcg-oom-situatio.patch include/linux/memcontrol.h | 7 ++++ include/linux/oom.h | 2 + mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++++++++++----- mm/oom_kill.c | 61 +++++++++++++++++++------------ 4 files changed, 122 insertions(+), 33 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Zhengju Subject: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Date: Wed, 7 Nov 2012 16:41:36 +0800 Message-ID: <1352277696-21724-1-git-send-email-handai.szj@taobao.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:cc:subject:date:message-id:x-mailer:in-reply-to:references; bh=i+dAZMuwje0qWQG21N5eNvlEpx2VIIVo6MheegpXQMU=; b=HkTveTqawmV4VcKXGHhgDCb6LkC6w6NQqeb0ibP8yrmYfxFKdNeSd1bUZxAf0AEK3S WA2YuE6LMD9R86RpddNn0VhkDmUOBLpPs2DC77edWkopJdWr9NlMqvqGIWh2rnNqWT62 Ss/Si8VTdQ0JausJO6nJC8gPXd03BtdYHwS9EfWQ/tf4rOrXdoIQ4kQUC7QSSuOP+W8a TyNAABXcsj00rb8HLAzyD0tEZAkfV29gF7sWxjjP0FYDrZSuRhadt+yWfG+HgJlJ4lZv 2uHBpgqjcMWdIOilPfJxp0XC2jxnugpf50Edf0+yzZaYwkKovENYP0qe8m0LtKIkDgmZ f5nA== In-Reply-To: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> Sender: owner-linux-mm@kvack.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com Cc: linux-kernel@vger.kernel.org, Sha Zhengju From: Sha Zhengju Current, when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom. Signed-off-by: Sha Zhengju Cc: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: David Rientjes Cc: Andrew Morton --- mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- mm/oom_kill.c | 6 +++- 2 files changed, 66 insertions(+), 11 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0eab7d5..2df5e72 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { "pgmajfault", }; +static const char * const mem_cgroup_lru_names[] = { + "inactive_anon", + "active_anon", + "inactive_file", + "active_file", + "unevictable", +}; + /* * Per memcg event counter is incremented at every pagein/pageout. With THP, * it will be incremated by the number of pages. This counter is used for @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, spin_unlock_irqrestore(&memcg->move_lock, *flags); } +#define K(x) ((x) << (PAGE_SHIFT-10)) +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) +{ + struct mem_cgroup *mi; + unsigned int i; + + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) + continue; + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], + K(mem_cgroup_read_stat(memcg, i))); + } + + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], + mem_cgroup_read_events(memcg, i)); + + for (i = 0; i < NR_LRU_LISTS; i++) + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); + } else { + + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { + long long val = 0; + + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) + continue; + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_read_stat(mi, i); + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); + } + + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { + unsigned long long val = 0; + + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_read_events(mi, i); + printk(KERN_CONT "%s:%llu ", + mem_cgroup_events_names[i], val); + } + + for (i = 0; i < NR_LRU_LISTS; i++) { + unsigned long long val = 0; + + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); + } + } + printk(KERN_CONT "\n"); +} /** - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. * @memcg: The memory cgroup that went over limit * @p: Task that is going to be killed * @@ -1569,6 +1628,8 @@ done: res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); + + mem_cgroup_print_oom_stat(memcg); } /* @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, } #endif /* CONFIG_NUMA */ -static const char * const mem_cgroup_lru_names[] = { - "inactive_anon", - "active_anon", - "inactive_file", - "active_file", - "unevictable", -}; - static inline void mem_cgroup_lru_names_not_uptodate(void) { BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7e9e911..4b8a6dd 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, cpuset_print_task_mems_allowed(current); task_unlock(current); dump_stack(); - mem_cgroup_print_oom_info(memcg, p); - show_mem(SHOW_MEM_FILTER_NODES); + if (memcg) + mem_cgroup_print_oom_info(memcg, p); + else + show_mem(SHOW_MEM_FILTER_NODES); if (sysctl_oom_dump_tasks) dump_tasks(memcg, nodemask); } -- 1.7.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Zhengju Subject: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Date: Wed, 7 Nov 2012 16:41:59 +0800 Message-ID: <1352277719-21760-1-git-send-email-handai.szj@taobao.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:cc:subject:date:message-id:x-mailer:in-reply-to:references; bh=WvymLuwk+RsijU65KLTRjS/0oeYizuTo+2TARwQx3J0=; b=w/jxvFrs8t5SaObu3msU9df80XpZrkwQ7oUCQh1m+fR4GVTO1QRLiEK0wJHg1quIvt FtZmnu449HGmgFyIjoM8gEBvzRv2cfdgvwqXZ9P+Yp09S9yJS0LmgPHnfke8qpOj7C4o ZHfQ6USRaLaWUoo2JmtimOJ6Bh9s1gC1spV5949sTLAqqUEjpbnVJ4izq4ckljS4rby4 EeBuMC7fJtoMavltKwaH8DnzDdi//LYAWe6Ch30dSAnqKT3KYWZsDVAXRb9fxQIdp7Cr F292GIuja1OnRiuJe7RdXzqu/eP3W3lFOZm3wg8M74npDga4cOpS2/Apa+0qI2TilG2x cgJA== In-Reply-To: <1352277602-21687-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, mhocko-AlSwsSmVLrQ@public.gmane.org, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sha Zhengju From: Sha Zhengju If memcg oom happening, don't scan all system tasks to dump memory state of eligible tasks, instead we iterates only over the process attached to the oom memcg and avoid the rcu lock. Signed-off-by: Sha Zhengju Cc: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: David Rientjes Cc: Andrew Morton --- include/linux/memcontrol.h | 7 +++++ include/linux/oom.h | 2 + mm/memcontrol.c | 14 +++++++++++ mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- 4 files changed, 56 insertions(+), 22 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c91e3c1..4322ca8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -122,6 +122,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +extern void dump_tasks_memcg(const struct mem_cgroup *memcg, + const nodemask_t *nodemask); extern void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage); @@ -337,6 +339,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void +dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) +{ +} + static inline void mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, unsigned long *flags) { diff --git a/include/linux/oom.h b/include/linux/oom.h index 20b5c46..9ba3344 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); +extern inline void dump_per_task(struct task_struct *p, + const nodemask_t *nodemask); extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2df5e72..fe648f8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) return min(limit, memsw); } +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) +{ + struct cgroup_iter it; + struct task_struct *task; + struct cgroup *cgroup = memcg->css.cgroup; + + cgroup_iter_start(cgroup, &it); + while ((task = cgroup_iter_next(cgroup, &it))) { + dump_per_task(task, nodemask); + } + + cgroup_iter_end(cgroup, &it); +} + static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, int order) { diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4b8a6dd..aaf6237 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, return chosen; } +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) +{ + struct task_struct *task; + + if (oom_unkillable_task(p, NULL, nodemask)) + return; + + task = find_lock_task_mm(p); + if (!task) { + /* + * This is a kthread or all of p's threads have already + * detached their mm's. There's no need to report + * them; they can't be oom killed anyway. + */ + return; + } + + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", + task->pid, from_kuid(&init_user_ns, task_uid(task)), + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), + task->mm->nr_ptes, + get_mm_counter(task->mm, MM_SWAPENTS), + task->signal->oom_score_adj, task->comm); + task_unlock(task); +} + /** * dump_tasks - dump current memory state of all system tasks * @memcg: current's memory controller, if constrained @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) { struct task_struct *p; - struct task_struct *task; pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); - rcu_read_lock(); - for_each_process(p) { - if (oom_unkillable_task(p, memcg, nodemask)) - continue; - - task = find_lock_task_mm(p); - if (!task) { - /* - * This is a kthread or all of p's threads have already - * detached their mm's. There's no need to report - * them; they can't be oom killed anyway. - */ - continue; - } - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", - task->pid, from_kuid(&init_user_ns, task_uid(task)), - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - task->mm->nr_ptes, - get_mm_counter(task->mm, MM_SWAPENTS), - task->signal->oom_score_adj, task->comm); - task_unlock(task); + if (memcg) { + dump_tasks_memcg(memcg, nodemask); + return; } + + rcu_read_lock(); + for_each_process(p) + dump_per_task(p, nodemask); rcu_read_unlock(); } -- 1.7.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Rientjes Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Date: Wed, 7 Nov 2012 10:02:09 -0800 (PST) Message-ID: References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=date:from:x-x-sender:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version:content-type; bh=pCjKb/aoPJ5w3G3CwH4lYvXWUIkGaO2CxgNxbFPN4MY=; b=mjMRnYhfpkidkpIK4e2plnPujBQ1gTsrwJ4o9+k52KeRjS/B3Hy49JOcvPWUipMcm5 igr9w6AEXvH1Ju9ceKw83CmLXEqbF+YBQzlHMidBakL26U9pSlkGz+erbzlf/3WwRiad wQ3acL/ncB8GWOf1v+PlOVLXi/iUXw2WxVkVB4ha11EuFKW6jBoyZ+9Ah6H/Aj0CTAG0 AD0DJ91Ab4aiXBoNj4EOdz0Gkpm/yrIljb6iACuPndAJvTAVSXXGEs/8YGnmmbJ0+vxM Fuczzd3FGZcPGggrCTkKBnEFkGCAFGtVBhdKwjnUP/2LJe2YL/Ryk8NsaYXrwIC96zTz PttA== In-Reply-To: <1352277696-21724-1-git-send-email-handai.szj@taobao.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Sha Zhengju Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Sha Zhengju On Wed, 7 Nov 2012, Sha Zhengju wrote: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0eab7d5..2df5e72 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { > "pgmajfault", > }; > > +static const char * const mem_cgroup_lru_names[] = { > + "inactive_anon", > + "active_anon", > + "inactive_file", > + "active_file", > + "unevictable", > +}; > + > /* > * Per memcg event counter is incremented at every pagein/pageout. With THP, > * it will be incremated by the number of pages. This counter is used for > @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > spin_unlock_irqrestore(&memcg->move_lock, *flags); > } > > +#define K(x) ((x) << (PAGE_SHIFT-10)) > +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *mi; > + unsigned int i; > + > + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], This printk isn't continuing any previous printk, so using KERN_CONT here will require a short header to be printed first ("Memcg: "?) with KERN_INFO before the iterations. > + K(mem_cgroup_read_stat(memcg, i))); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) > + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], > + mem_cgroup_read_events(memcg, i)); > + > + for (i = 0; i < NR_LRU_LISTS; i++) > + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], > + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > + } else { > + Spurious newline. Eek, is there really no way to avoid this if-conditional and just use for_each_mem_cgroup_tree() for everything and use mem_cgroup_iter_break(memcg, iter); break; for !memcg->use_hierarchy? > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + long long val = 0; > + > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_stat(mi, i); > + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_events(mi, i); > + printk(KERN_CONT "%s:%llu ", > + mem_cgroup_events_names[i], val); > + } > + > + for (i = 0; i < NR_LRU_LISTS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); > + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); > + } > + } > + printk(KERN_CONT "\n"); > +} > /** > - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. > * @memcg: The memory cgroup that went over limit > * @p: Task that is going to be killed > * > @@ -1569,6 +1628,8 @@ done: > res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, > res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, > res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); > + > + mem_cgroup_print_oom_stat(memcg); I think this should be folded into mem_cgroup_print_oom_info(), I don't see a need for a new function. > } > > /* > @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, > } > #endif /* CONFIG_NUMA */ > > -static const char * const mem_cgroup_lru_names[] = { > - "inactive_anon", > - "active_anon", > - "inactive_file", > - "active_file", > - "unevictable", > -}; > - > static inline void mem_cgroup_lru_names_not_uptodate(void) > { > BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 7e9e911..4b8a6dd 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, > cpuset_print_task_mems_allowed(current); > task_unlock(current); > dump_stack(); > - mem_cgroup_print_oom_info(memcg, p); > - show_mem(SHOW_MEM_FILTER_NODES); > + if (memcg) > + mem_cgroup_print_oom_info(memcg, p); mem_cgroup_print_oom_info() already returns immediately for !memcg, so I'm not sure why this change is made. > + else > + show_mem(SHOW_MEM_FILTER_NODES); Well that's disappointing if memcg == root_mem_cgroup, we'd probably like to know the global memory state to determine what the problem is. > if (sysctl_oom_dump_tasks) > dump_tasks(memcg, nodemask); > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Rientjes Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Date: Wed, 7 Nov 2012 10:07:14 -0800 (PST) Message-ID: References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277719-21760-1-git-send-email-handai.szj@taobao.com> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=date:from:x-x-sender:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version:content-type; bh=AM3Xco+uRTT0dg3XvUhBaIs7HgnHb/W/yxBIYo0LDcE=; b=T0S4fxPeUEVnzE8JQe1vobe80RDkez+aDZK4gqgS0jOwcpsjj7BFFxzDP6enl1vXqV Hzyyt6SI1hPyxkufw8sZun2bUsRC5SnbxguxiNDUC0GweUQ8rqlGAUqxmWfIi4KMerRX Lj7jfLj4DUQj1OPMim5gzpB+XlYaWeRu6I4I+sa4v5nBZXG9RcXDw3OGb1VxesUwiOdb Fie2dODffoPkOkXkUVGO9gorhIRpzDROtsZPh1Bj/ktGg5zZ6Qoc/Yqb7cKRgUdugly4 Xft4Xa/qp7cy1/nL2UdYInaf1bTwxw48CuSRgN3+DQ6lmUEmjkpUwzB+Ug6XpoC5bgME 5GHQ== In-Reply-To: <1352277719-21760-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Sha Zhengju Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, mhocko-AlSwsSmVLrQ@public.gmane.org, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sha Zhengju On Wed, 7 Nov 2012, Sha Zhengju wrote: > From: Sha Zhengju > > If memcg oom happening, don't scan all system tasks to dump memory state of > eligible tasks, instead we iterates only over the process attached to the oom > memcg and avoid the rcu lock. > Avoiding the rcu lock isn't actually that impressive here, the cgroup iterator will use it's own lock for that memcg. > Signed-off-by: Sha Zhengju > Cc: Michal Hocko > Cc: KAMEZAWA Hiroyuki > Cc: David Rientjes > Cc: Andrew Morton > --- > include/linux/memcontrol.h | 7 +++++ > include/linux/oom.h | 2 + > mm/memcontrol.c | 14 +++++++++++ > mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- > 4 files changed, 56 insertions(+), 22 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index c91e3c1..4322ca8 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -122,6 +122,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); > void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > +extern void dump_tasks_memcg(const struct mem_cgroup *memcg, > + const nodemask_t *nodemask); Shouldn't need the nodemask parameter, just have dump_tasks_memcg() pass NULL to dump_per_task(), we won't be isolating to tasks with mempolicies restricted to a particular set of nodes since we're in the memcg oom path here, not the global page allocator oom path. > extern void mem_cgroup_replace_page_cache(struct page *oldpage, > struct page *newpage); > > @@ -337,6 +339,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) > { > } > > +static inline void > +dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > +} > + > static inline void mem_cgroup_begin_update_page_stat(struct page *page, > bool *locked, unsigned long *flags) > { > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 20b5c46..9ba3344 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, > unsigned long totalpages, const nodemask_t *nodemask, > bool force_kill); > > +extern inline void dump_per_task(struct task_struct *p, > + const nodemask_t *nodemask); This is a global symbol, so dump_per_task() doesn't make a lot of sense: it would need to be prefixed with "oom_" so perhaps oom_dump_task() is better? > extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *mask, bool force_kill); > extern int register_oom_notifier(struct notifier_block *nb); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2df5e72..fe648f8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > + struct cgroup_iter it; > + struct task_struct *task; > + struct cgroup *cgroup = memcg->css.cgroup; > + > + cgroup_iter_start(cgroup, &it); > + while ((task = cgroup_iter_next(cgroup, &it))) { > + dump_per_task(task, nodemask); > + } > + > + cgroup_iter_end(cgroup, &it); > +} > + > static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > int order) > { > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 4b8a6dd..aaf6237 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > return chosen; > } > > +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) No inline. > +{ > + struct task_struct *task; > + > + if (oom_unkillable_task(p, NULL, nodemask)) > + return; > + > + task = find_lock_task_mm(p); > + if (!task) { > + /* > + * This is a kthread or all of p's threads have already > + * detached their mm's. There's no need to report > + * them; they can't be oom killed anyway. > + */ > + return; > + } > + > + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > + task->pid, from_kuid(&init_user_ns, task_uid(task)), > + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > + task->mm->nr_ptes, > + get_mm_counter(task->mm, MM_SWAPENTS), > + task->signal->oom_score_adj, task->comm); > + task_unlock(task); > +} > + > /** > * dump_tasks - dump current memory state of all system tasks > * @memcg: current's memory controller, if constrained > @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > { > struct task_struct *p; > - struct task_struct *task; > > pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); > - rcu_read_lock(); > - for_each_process(p) { > - if (oom_unkillable_task(p, memcg, nodemask)) > - continue; > - > - task = find_lock_task_mm(p); > - if (!task) { > - /* > - * This is a kthread or all of p's threads have already > - * detached their mm's. There's no need to report > - * them; they can't be oom killed anyway. > - */ > - continue; > - } > > - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > - task->pid, from_kuid(&init_user_ns, task_uid(task)), > - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > - task->mm->nr_ptes, > - get_mm_counter(task->mm, MM_SWAPENTS), > - task->signal->oom_score_adj, task->comm); > - task_unlock(task); > + if (memcg) { > + dump_tasks_memcg(memcg, nodemask); > + return; > } > + > + rcu_read_lock(); > + for_each_process(p) > + dump_per_task(p, nodemask); > rcu_read_unlock(); > } > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [PATCH V2 0/2] Provide more precise dump info for memcg-oom Date: Wed, 7 Nov 2012 11:31:07 -0800 Message-ID: <20121107113107.d91eba43.akpm@linux-foundation.org> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Sha Zhengju Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju On Wed, 7 Nov 2012 16:40:02 +0800 Sha Zhengju wrote: > When memcg oom is happening the current memcg related dump information > is limited for debugging. The patches provide more detailed memcg page statistics > and also take hierarchy into consideration. Within the changelogs, please include a sample of the proposed output so we can properly review the proposal. Also it would be useful to provide some justification for the decisions in this patch: which data is displayed and, particularly, which is not? Why is the displayed information useful to developers, etc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Date: Wed, 7 Nov 2012 23:17:09 +0100 Message-ID: <20121107221709.GB26382@dhcp22.suse.cz> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=Rc7bk1S6jCbJ9ihb1oTa7T11w/GbwbcgeyFbjceRfoY=; b=Bwvz27lEhQT1S1rcD6prKnwqdD+wo3+77vxF9dW2Mdp1QrJtsz11+CyhhTJJ2bdko7 7Ik5eq8dept7tsHXfulU0r320b4LFCTGMK5zjpdf00RDBsvVITQzMmUy9wF1RYG9vAeL CP1FF0bJxmxEdHv0ytHnp3KMSW+cZMt6OjcPiXFLFKo3mTHoV75DteetBwIZ0JtI88yb OkJgGj9iObitqP4rxNTchhaaLGRMvCojh7nTOQuIGx2iZB/KJe2+kQKC/38paSmKA3Ok FKP2n4FMtrxxjlG2rUQgun1pAWjDwZmjlTRJysSDpNLHqRL2mQLgEbUXpnwFYZpuqNJD h46w== Content-Disposition: inline In-Reply-To: <1352277696-21724-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Sha Zhengju Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sha Zhengju On Wed 07-11-12 16:41:36, Sha Zhengju wrote: > From: Sha Zhengju > > Current, when a memcg oom is happening the oom dump messages is still global > state and provides few useful info for users. This patch prints more pointed > memcg page statistics for memcg-oom. > > Signed-off-by: Sha Zhengju > Cc: Michal Hocko > Cc: KAMEZAWA Hiroyuki > Cc: David Rientjes > Cc: Andrew Morton > --- > mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- > mm/oom_kill.c | 6 +++- > 2 files changed, 66 insertions(+), 11 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0eab7d5..2df5e72 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > spin_unlock_irqrestore(&memcg->move_lock, *flags); > } > > +#define K(x) ((x) << (PAGE_SHIFT-10)) > +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *mi; > + unsigned int i; > + > + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > + K(mem_cgroup_read_stat(memcg, i))); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) > + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], > + mem_cgroup_read_events(memcg, i)); > + > + for (i = 0; i < NR_LRU_LISTS; i++) > + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], > + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > + } else { > + > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + long long val = 0; > + > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_stat(mi, i); > + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_events(mi, i); > + printk(KERN_CONT "%s:%llu ", > + mem_cgroup_events_names[i], val); > + } > + > + for (i = 0; i < NR_LRU_LISTS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); > + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); > + } > + } This is just plain ugly. for_each_mem_cgroup_tree is use_hierarchy aware and there is no need for if (use_hierarchy) part. memcg != root_mem_cgroup test doesn't make much sense as well because we call that a global oom killer ;) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Date: Wed, 7 Nov 2012 23:34:37 +0100 Message-ID: <20121107223437.GC26382@dhcp22.suse.cz> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277719-21760-1-git-send-email-handai.szj@taobao.com> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=kk4NXYuzJwjIk2lQXnAS1l2iURvm8ryRHThJA652j3w=; b=ShEMTDtPIbnUWMZ+uMaN4ipoq3jUVEfGUa4VjkfalGXMzBwrtXN7lR34JriySGN69j HEKmOfdr/UTxymwkgy6aF1X4yC20qECg9rnSwNEjSRusfvmOBHCyZQbJp1BF9eQtltRz FVbOjqD5OXvHoqB7KsG2JWuRS9yqbb1LgvBTXM1TqcfetPNQZTm+YFVsEvLftrtxc98/ SluyrsvzNh8ZRNiMSUL5Bm7Wz1XCeEco2+B8sq0uQvgs8fPS5Zn/LFt9AlopsbUA0Y9v 6uDKxZu5O3av9C8PeVfcWT3FlPlAw4+ZplS0gGZhC6a80YVZY95W/p5Yc8mkytVAmK3L 08uQ== Content-Disposition: inline In-Reply-To: <1352277719-21760-1-git-send-email-handai.szj@taobao.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Sha Zhengju Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju On Wed 07-11-12 16:41:59, Sha Zhengju wrote: > From: Sha Zhengju > > If memcg oom happening, don't scan all system tasks to dump memory state of > eligible tasks, instead we iterates only over the process attached to the oom > memcg and avoid the rcu lock. you have replaced rcu lock by css_set_lock which is, well, heavier than rcu. Besides that the patch is not correct because you have excluded all tasks that are from subgroups because you iterate only through the top level one. I am not sure the whole optimization would be a win even if implemented correctly. Well, we scan through more tasks currently and most of them are not relevant but then you would need to exclude task_in_mem_cgroup from oom_unkillable_task and that would be more code churn than the win. > Signed-off-by: Sha Zhengju > Cc: Michal Hocko > Cc: KAMEZAWA Hiroyuki > Cc: David Rientjes > Cc: Andrew Morton > --- > include/linux/memcontrol.h | 7 +++++ > include/linux/oom.h | 2 + > mm/memcontrol.c | 14 +++++++++++ > mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- > 4 files changed, 56 insertions(+), 22 deletions(-) > [...] > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 20b5c46..9ba3344 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, > unsigned long totalpages, const nodemask_t *nodemask, > bool force_kill); > > +extern inline void dump_per_task(struct task_struct *p, > + const nodemask_t *nodemask); > extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *mask, bool force_kill); > extern int register_oom_notifier(struct notifier_block *nb); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2df5e72..fe648f8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > + struct cgroup_iter it; > + struct task_struct *task; > + struct cgroup *cgroup = memcg->css.cgroup; > + > + cgroup_iter_start(cgroup, &it); > + while ((task = cgroup_iter_next(cgroup, &it))) { > + dump_per_task(task, nodemask); > + } > + > + cgroup_iter_end(cgroup, &it); > +} > + > static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > int order) > { > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 4b8a6dd..aaf6237 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > return chosen; > } > > +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) > +{ > + struct task_struct *task; > + > + if (oom_unkillable_task(p, NULL, nodemask)) > + return; > + > + task = find_lock_task_mm(p); > + if (!task) { > + /* > + * This is a kthread or all of p's threads have already > + * detached their mm's. There's no need to report > + * them; they can't be oom killed anyway. > + */ > + return; > + } > + > + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > + task->pid, from_kuid(&init_user_ns, task_uid(task)), > + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > + task->mm->nr_ptes, > + get_mm_counter(task->mm, MM_SWAPENTS), > + task->signal->oom_score_adj, task->comm); > + task_unlock(task); > +} > + > /** > * dump_tasks - dump current memory state of all system tasks > * @memcg: current's memory controller, if constrained > @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > { > struct task_struct *p; > - struct task_struct *task; > > pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); > - rcu_read_lock(); > - for_each_process(p) { > - if (oom_unkillable_task(p, memcg, nodemask)) > - continue; > - > - task = find_lock_task_mm(p); > - if (!task) { > - /* > - * This is a kthread or all of p's threads have already > - * detached their mm's. There's no need to report > - * them; they can't be oom killed anyway. > - */ > - continue; > - } > > - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > - task->pid, from_kuid(&init_user_ns, task_uid(task)), > - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > - task->mm->nr_ptes, > - get_mm_counter(task->mm, MM_SWAPENTS), > - task->signal->oom_score_adj, task->comm); > - task_unlock(task); > + if (memcg) { > + dump_tasks_memcg(memcg, nodemask); > + return; > } > + > + rcu_read_lock(); > + for_each_process(p) > + dump_per_task(p, nodemask); > rcu_read_unlock(); > } > > -- > 1.7.6.1 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kamezawa Hiroyuki Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Date: Thu, 08 Nov 2012 18:07:36 +0900 Message-ID: <509B7658.1020807@jp.fujitsu.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1352277696-21724-1-git-send-email-handai.szj@taobao.com> Sender: owner-linux-mm@kvack.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Sha Zhengju Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju (2012/11/07 17:41), Sha Zhengju wrote: > From: Sha Zhengju > > Current, when a memcg oom is happening the oom dump messages is still global > state and provides few useful info for users. This patch prints more pointed > memcg page statistics for memcg-oom. > > Signed-off-by: Sha Zhengju > Cc: Michal Hocko > Cc: KAMEZAWA Hiroyuki > Cc: David Rientjes > Cc: Andrew Morton > --- > mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- > mm/oom_kill.c | 6 +++- > 2 files changed, 66 insertions(+), 11 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0eab7d5..2df5e72 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { > "pgmajfault", > }; > > +static const char * const mem_cgroup_lru_names[] = { > + "inactive_anon", > + "active_anon", > + "inactive_file", > + "active_file", > + "unevictable", > +}; > + Is this for the same strings with show_free_areas() ? > /* > * Per memcg event counter is incremented at every pagein/pageout. With THP, > * it will be incremated by the number of pages. This counter is used for > @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > spin_unlock_irqrestore(&memcg->move_lock, *flags); > } > > +#define K(x) ((x) << (PAGE_SHIFT-10)) > +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *mi; > + unsigned int i; > + > + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { Why do you need to have this condition check ? > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > + K(mem_cgroup_read_stat(memcg, i))); Hm, how about using the same style with show_free_areas() ? > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) > + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], > + mem_cgroup_read_events(memcg, i)); > + I don't think EVENTS info is useful for oom. > + for (i = 0; i < NR_LRU_LISTS; i++) > + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], > + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); How far does your new information has different format than usual oom ? Could you show a sample and difference in changelog ? Of course, I prefer both of them has similar format. > + } else { > + > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + long long val = 0; > + > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_stat(mi, i); > + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_events(mi, i); > + printk(KERN_CONT "%s:%llu ", > + mem_cgroup_events_names[i], val); > + } > + > + for (i = 0; i < NR_LRU_LISTS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); > + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); > + } > + } > + printk(KERN_CONT "\n"); > +} > /** > - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. > * @memcg: The memory cgroup that went over limit > * @p: Task that is going to be killed > * > @@ -1569,6 +1628,8 @@ done: > res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, > res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, > res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); > + > + mem_cgroup_print_oom_stat(memcg); > } please put directly in print_oom_info() Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Zhengju Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Date: Thu, 08 Nov 2012 20:37:45 +0800 Message-ID: <509BA799.505@gmail.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=1+9tUMYlvkzTu+p5aPVTaWD60MXvzMLlET37uPHEt6E=; b=H2IsbyhGVwTL4uH2ZKxi8aT2S4vNSk8q+xq48sAroBjD3MjByMud2C8QvSlCtY0TPF l0HmWdRMyWPeqrm+OtPsjxzezKBL0EQwJgLZ3ShQXu5zFIc9szgdJxX0WPL1M/mwLKi6 53iyaKv/kBOYPA61Egpx7ilzAfOxs+lGYRmyx+nrNg4DlkdCrlkrpuWe9itJc3TnOg1F WqPwlcqh+rmhTfdku84uX8T738CtTZ1g9NR8psVgTiv+OR99PJHkd1PkoJJcR16/6VEG OWiOVAIyCXaNL1et+AY33m/hF4fbwyR+E61lxZHbZsvxbMp+B9gSwGtQZMFhYnOiFvC5 GczA== In-Reply-To: Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="windows-1252"; format="flowed" To: David Rientjes Cc: Sha Zhengju , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, mhocko-AlSwsSmVLrQ@public.gmane.org, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 11/08/2012 02:02 AM, David Rientjes wrote: > On Wed, 7 Nov 2012, Sha Zhengju wrote: > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_nam= es[] =3D { >> "pgmajfault", >> }; >> >> +static const char * const mem_cgroup_lru_names[] =3D { >> + "inactive_anon", >> + "active_anon", >> + "inactive_file", >> + "active_file", >> + "unevictable", >> +}; >> + >> /* >> * Per memcg event counter is incremented at every pagein/pageout.= With THP, >> * it will be incremated by the number of pages. This counter is u= sed for >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem= _cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x)<< (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy&& memcg !=3D root_mem_cgroup) { >> + for (i =3D 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i =3D=3D MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > This printk isn't continuing any previous printk, so using KERN_CONT = here > will require a short header to be printed first ("Memcg: "?) with > KERN_INFO before the iterations. > Yep...I think I lost it while rebasing... sorry for the stupid mistake. >> + K(mem_cgroup_read_stat(memcg, i))); >> + } >> + >> + for (i =3D 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + >> + for (i =3D 0; i< NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); >> + } else { >> + > Spurious newline. > > Eek, is there really no way to avoid this if-conditional and just use > for_each_mem_cgroup_tree() for everything and use > > mem_cgroup_iter_break(memcg, iter); > break; > > for !memcg->use_hierarchy? > Now I'm shamed at my bad brain of yesterday by sending this chunk out..= =2E Yes, the if-part code above is obviously unwanted, and the=20 for_each_mem_cgroup_tree can handle hierarchy already. >> + for (i =3D 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + long long val =3D 0; >> + >> + if (i =3D=3D MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + for_each_mem_cgroup_tree(mi, memcg) >> + val +=3D mem_cgroup_read_stat(mi, i); >> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val))= ; >> + } >> + >> + for (i =3D 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) { >> + unsigned long long val =3D 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val +=3D mem_cgroup_read_events(mi, i); >> + printk(KERN_CONT "%s:%llu ", >> + mem_cgroup_events_names[i], val); >> + } >> + >> + for (i =3D 0; i< NR_LRU_LISTS; i++) { >> + unsigned long long val =3D 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val +=3D mem_cgroup_nr_lru_pages(mi, BIT(i)); >> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); >> + } >> + } >> + printk(KERN_CONT "\n"); >> +} >> /** >> - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock he= ld in read mode. >> * @memcg: The memory cgroup that went over limit >> * @p: Task that is going to be killed >> * >> @@ -1569,6 +1628,8 @@ done: >> res_counter_read_u64(&memcg->kmem, RES_USAGE)>> 10, >> res_counter_read_u64(&memcg->kmem, RES_LIMIT)>> 10, >> res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); >> + >> + mem_cgroup_print_oom_stat(memcg); > I think this should be folded into mem_cgroup_print_oom_info(), I don= 't > see a need for a new function. > >> } >> >> /* >> @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup= *cont, struct cftype *cft, >> } >> #endif /* CONFIG_NUMA */ >> >> -static const char * const mem_cgroup_lru_names[] =3D { >> - "inactive_anon", >> - "active_anon", >> - "inactive_file", >> - "active_file", >> - "unevictable", >> -}; >> - >> static inline void mem_cgroup_lru_names_not_uptodate(void) >> { >> BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) !=3D NR_LRU_LISTS); >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 7e9e911..4b8a6dd 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, = gfp_t gfp_mask, int order, >> cpuset_print_task_mems_allowed(current); >> task_unlock(current); >> dump_stack(); >> - mem_cgroup_print_oom_info(memcg, p); >> - show_mem(SHOW_MEM_FILTER_NODES); >> + if (memcg) >> + mem_cgroup_print_oom_info(memcg, p); > mem_cgroup_print_oom_info() already returns immediately for !memcg, s= o I'm > not sure why this change is made. > Here the if-else checking is aiming at printing distinct messages for=20 memcg & non-memcg. IMHO, global state has little actual use for memcg-oom and why not we=20 wipe off it=EF=BC=9F Though mem_cgroup_print_oom_info already checking for !memcg, the=20 if-statement can avoid one function call and save the deep-enough oom call stack a=20 little. >> + else >> + show_mem(SHOW_MEM_FILTER_NODES); > Well that's disappointing if memcg =3D=3D root_mem_cgroup, we'd proba= bly like > to know the global memory state to determine what the problem is. > I really wondering if there is any case that can pass root_mem_cgroup=20 down here. It's called by global or memcg oom killer and the global oom will set=20 memcg=3DNULL directly instead of root_mem_cgroup. Besides, root memcg will not go=20 through charging and there is no chance to call mem_cgroup_out_of_memory for root cgroup= =20 tasks. Thanks, Sha >> if (sysctl_oom_dump_tasks) >> dump_tasks(memcg, nodemask); >> } > . > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Date: Thu, 8 Nov 2012 13:44:26 +0100 Message-ID: <20121108124426.GF31821@dhcp22.suse.cz> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> <509BA799.505@gmail.com> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <509BA799.505-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Sha Zhengju Cc: David Rientjes , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu 08-11-12 20:37:45, Sha Zhengju wrote: > On 11/08/2012 02:02 AM, David Rientjes wrote: > >On Wed, 7 Nov 2012, Sha Zhengju wrote: [..] > >>+ else > >>+ show_mem(SHOW_MEM_FILTER_NODES); > >Well that's disappointing if memcg == root_mem_cgroup, we'd probably like > >to know the global memory state to determine what the problem is. > > > > I really wondering if there is any case that can pass > root_mem_cgroup down here. No it cannot because the root cgroup doesn't have any limit so we cannot trigger memcg oom killer. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Zhengju Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Date: Thu, 08 Nov 2012 20:45:30 +0800 Message-ID: <509BA96A.2070701@gmail.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> <20121107221709.GB26382@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=KtYdqF01zVEXq75YwvSOdJg9WLmftFK2KYtUyrEZtd8=; b=aieS0j6FoG9VjBH+z/Del5gBu3rzAwJNC9TeoCSIXKDg2Ws/dcTEevQ0I2dYa8r7+v Cg+KH0OFnyKzXF9RNFpZcJGvgyQcm7um+TLNEAMPVKlDjpK25l3x1kbTKImave1M9Rmz vp/V2xesVYu9RIpH6BuSzCW+C5LdXdIABkQSif/mnPC/MtNWBfRSk5sU2VaTHJKZK8Gi IYZYEy+pv8zmWKqczFbnq93RUJmIiQHPMGLm3Ubir1liAGv6KKxBaAxeDi7Jfx5P1WsL tRZDfl2kan8OI7Xm3JHZjZxsLF/2gquZzG5m4gn0rOPV1RjG7jgmXsRbUN7ErCdfBis8 +O6g== In-Reply-To: <20121107221709.GB26382-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Michal Hocko Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sha Zhengju On 11/08/2012 06:17 AM, Michal Hocko wrote: > On Wed 07-11-12 16:41:36, Sha Zhengju wrote: >> From: Sha Zhengju >> >> Current, when a memcg oom is happening the oom dump messages is still global >> state and provides few useful info for users. This patch prints more pointed >> memcg page statistics for memcg-oom. >> >> Signed-off-by: Sha Zhengju >> Cc: Michal Hocko >> Cc: KAMEZAWA Hiroyuki >> Cc: David Rientjes >> Cc: Andrew Morton >> --- >> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- >> mm/oom_kill.c | 6 +++- >> 2 files changed, 66 insertions(+), 11 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c > [...] >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x)<< (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy&& memcg != root_mem_cgroup) { >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], >> + K(mem_cgroup_read_stat(memcg, i))); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); >> + } else { >> + >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + long long val = 0; >> + >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_stat(mi, i); >> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_events(mi, i); >> + printk(KERN_CONT "%s:%llu ", >> + mem_cgroup_events_names[i], val); >> + } >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); >> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); >> + } >> + } > This is just plain ugly. for_each_mem_cgroup_tree is use_hierarchy aware > and there is no need for if (use_hierarchy) part. > memcg != root_mem_cgroup test doesn't make much sense as well because we > call that a global oom killer ;) Yes... bitterly did I repent the patch... The else-part of for_each_mem_cgroup_tree is enough for hierarchy. I'll send a update one later. Sorry for the noise. : ( Thanks, Sha From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Zhengju Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Date: Thu, 08 Nov 2012 21:12:00 +0800 Message-ID: <509BAFA0.4010604@gmail.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> <509B7658.1020807@jp.fujitsu.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=o3sovKB/3dZN9LoVgKBxiKGrbrvjX1wvYss2U921ObI=; b=aqSq4cOGs5mEjAfayTg8C4t/9ra8Jf7TiVb/ZHpRV1mxoxz8/ZShdiaLaBGeOScvzq Vz7sb8RmWwPYLQNUZZBeg6WzhVJscs+kp/zpQ1b40M/J6l6b0yaUR1LS0eqMBLTFFfsU lKizbJJbsC7uKE/L5+iS686Sh33FYVu0Wj++flGrcD2oXEsZP/yjnpz1zN9bqYUdCX4Q 63YryifPGjF/dJ1TI5MXxXzOzCyS5VLOr9sq1WefYDjiE51MR6HWBMnRH46WQQbyKQMj L9ZNPAnIFeTYgDASmLZwB+W6MCj2pYI3dB7ANKjq1XlEppv9fCIN32BKzxb4TfXdiiWk E1PA== In-Reply-To: <509B7658.1020807-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Kamezawa Hiroyuki Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, mhocko-AlSwsSmVLrQ@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sha Zhengju On 11/08/2012 05:07 PM, Kamezawa Hiroyuki wrote: > (2012/11/07 17:41), Sha Zhengju wrote: >> From: Sha Zhengju >> >> Current, when a memcg oom is happening the oom dump messages is still global >> state and provides few useful info for users. This patch prints more pointed >> memcg page statistics for memcg-oom. >> >> Signed-off-by: Sha Zhengju >> Cc: Michal Hocko >> Cc: KAMEZAWA Hiroyuki >> Cc: David Rientjes >> Cc: Andrew Morton >> --- >> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- >> mm/oom_kill.c | 6 +++- >> 2 files changed, 66 insertions(+), 11 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { >> "pgmajfault", >> }; >> >> +static const char * const mem_cgroup_lru_names[] = { >> + "inactive_anon", >> + "active_anon", >> + "inactive_file", >> + "active_file", >> + "unevictable", >> +}; >> + > Is this for the same strings with show_free_areas() ? > I just move the declaration here from the bottom of source file to make the following use error-free. >> /* >> * Per memcg event counter is incremented at every pagein/pageout. With THP, >> * it will be incremated by the number of pages. This counter is used for >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x) << (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > Why do you need to have this condition check ? > Yes, the check is unnecessary... I'll remove it next version. >> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], >> + K(mem_cgroup_read_stat(memcg, i))); > Hm, how about using the same style with show_free_areas() ? > I'm also trying do so. show_free_areas() prints the memory related info in two style: one is in page unit and the oher is in KB (I've no idea why we distinct them), but I think the KB format is more readable. >> + } >> + >> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + > I don't think EVENTS info is useful for oom. > It seems you're right. : ) >> + for (i = 0; i < NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > How far does your new information has different format than usual oom ? > Could you show a sample and difference in changelog ? > > Of course, I prefer both of them has similar format. > > > The new memcg-oom info excludes global state out and prints the memcg statistics instead which seems more brevity. I'll add a sample next time. Thanks for reminding me! Thanks, Sha From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sha Zhengju Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Date: Thu, 08 Nov 2012 21:58:52 +0800 Message-ID: <509BBA9C.7050007@gmail.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277719-21760-1-git-send-email-handai.szj@taobao.com> <20121107223437.GC26382@dhcp22.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=UguCFhAsL3Vjc3lrI4VK4LbuC5L6yrIWIGb4jSI3BEY=; b=MApKHgYksUmRlK7ni83pqlYgp699U1UouXrPq+VQPqdki+UeolrI+0zhhWF7vFWWWd N3A8Rps3h9bNf3OQxZK93DK+Qas94Wf7U0dVjxs9P+4WwYxlWJcsOj2c65sZvR3jZFJf cAqECIuT1auZQW91ZL9zm2uDGpkkPyY8sYA3xmNKqyUaFsnnXRekE0ZfC9nXxezIx95t 7IMi48Pmldu+rnEQ0AXQk/sm01etTUnpAvq4lykcDVYQrdZ1VACbr7fsxDSt+mxjC03I vBp+KuF0OZ9AzaDRZo7+HQe4dsCv0C5AZ/hZoH9rl3biSYIRUoCQ+xeTUvSjRX+kiB1i oGOw== In-Reply-To: <20121107223437.GC26382-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Michal Hocko Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sha Zhengju On 11/08/2012 06:34 AM, Michal Hocko wrote: > On Wed 07-11-12 16:41:59, Sha Zhengju wrote: >> From: Sha Zhengju >> >> If memcg oom happening, don't scan all system tasks to dump memory state of >> eligible tasks, instead we iterates only over the process attached to the oom >> memcg and avoid the rcu lock. > you have replaced rcu lock by css_set_lock which is, well, heavier than > rcu. Besides that the patch is not correct because you have excluded > all tasks that are from subgroups because you iterate only through the > top level one. > I am not sure the whole optimization would be a win even if implemented > correctly. Well, we scan through more tasks currently and most of them > are not relevant but then you would need to exclude task_in_mem_cgroup > from oom_unkillable_task and that would be more code churn than the > win. Thanks for your and David's advice. This piece is trying to save some expense while dumping memcg tasks, but failed to scanning subgroups by iterating the cgroup. I'm agreed with your cost&win opinion, so I decide to give up this one. : ) Thanks, Sha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx134.postini.com [74.125.245.134]) by kanga.kvack.org (Postfix) with SMTP id 6D5606B004D for ; Wed, 7 Nov 2012 03:42:00 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id i14so648014dad.14 for ; Wed, 07 Nov 2012 00:42:00 -0800 (PST) From: Sha Zhengju Subject: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Date: Wed, 7 Nov 2012 16:41:59 +0800 Message-Id: <1352277719-21760-1-git-send-email-handai.szj@taobao.com> In-Reply-To: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com Cc: linux-kernel@vger.kernel.org, Sha Zhengju From: Sha Zhengju If memcg oom happening, don't scan all system tasks to dump memory state of eligible tasks, instead we iterates only over the process attached to the oom memcg and avoid the rcu lock. Signed-off-by: Sha Zhengju Cc: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: David Rientjes Cc: Andrew Morton --- include/linux/memcontrol.h | 7 +++++ include/linux/oom.h | 2 + mm/memcontrol.c | 14 +++++++++++ mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- 4 files changed, 56 insertions(+), 22 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c91e3c1..4322ca8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -122,6 +122,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +extern void dump_tasks_memcg(const struct mem_cgroup *memcg, + const nodemask_t *nodemask); extern void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage); @@ -337,6 +339,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void +dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) +{ +} + static inline void mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, unsigned long *flags) { diff --git a/include/linux/oom.h b/include/linux/oom.h index 20b5c46..9ba3344 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); +extern inline void dump_per_task(struct task_struct *p, + const nodemask_t *nodemask); extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2df5e72..fe648f8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) return min(limit, memsw); } +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) +{ + struct cgroup_iter it; + struct task_struct *task; + struct cgroup *cgroup = memcg->css.cgroup; + + cgroup_iter_start(cgroup, &it); + while ((task = cgroup_iter_next(cgroup, &it))) { + dump_per_task(task, nodemask); + } + + cgroup_iter_end(cgroup, &it); +} + static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, int order) { diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4b8a6dd..aaf6237 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, return chosen; } +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) +{ + struct task_struct *task; + + if (oom_unkillable_task(p, NULL, nodemask)) + return; + + task = find_lock_task_mm(p); + if (!task) { + /* + * This is a kthread or all of p's threads have already + * detached their mm's. There's no need to report + * them; they can't be oom killed anyway. + */ + return; + } + + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", + task->pid, from_kuid(&init_user_ns, task_uid(task)), + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), + task->mm->nr_ptes, + get_mm_counter(task->mm, MM_SWAPENTS), + task->signal->oom_score_adj, task->comm); + task_unlock(task); +} + /** * dump_tasks - dump current memory state of all system tasks * @memcg: current's memory controller, if constrained @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) { struct task_struct *p; - struct task_struct *task; pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); - rcu_read_lock(); - for_each_process(p) { - if (oom_unkillable_task(p, memcg, nodemask)) - continue; - - task = find_lock_task_mm(p); - if (!task) { - /* - * This is a kthread or all of p's threads have already - * detached their mm's. There's no need to report - * them; they can't be oom killed anyway. - */ - continue; - } - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", - task->pid, from_kuid(&init_user_ns, task_uid(task)), - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - task->mm->nr_ptes, - get_mm_counter(task->mm, MM_SWAPENTS), - task->signal->oom_score_adj, task->comm); - task_unlock(task); + if (memcg) { + dump_tasks_memcg(memcg, nodemask); + return; } + + rcu_read_lock(); + for_each_process(p) + dump_per_task(p, nodemask); rcu_read_unlock(); } -- 1.7.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx147.postini.com [74.125.245.147]) by kanga.kvack.org (Postfix) with SMTP id 158516B005A for ; Wed, 7 Nov 2012 13:07:17 -0500 (EST) Received: by mail-pa0-f41.google.com with SMTP id fa10so1466084pad.14 for ; Wed, 07 Nov 2012 10:07:16 -0800 (PST) Date: Wed, 7 Nov 2012 10:07:14 -0800 (PST) From: David Rientjes Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation In-Reply-To: <1352277719-21760-1-git-send-email-handai.szj@taobao.com> Message-ID: References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277719-21760-1-git-send-email-handai.szj@taobao.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Sha Zhengju Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Sha Zhengju On Wed, 7 Nov 2012, Sha Zhengju wrote: > From: Sha Zhengju > > If memcg oom happening, don't scan all system tasks to dump memory state of > eligible tasks, instead we iterates only over the process attached to the oom > memcg and avoid the rcu lock. > Avoiding the rcu lock isn't actually that impressive here, the cgroup iterator will use it's own lock for that memcg. > Signed-off-by: Sha Zhengju > Cc: Michal Hocko > Cc: KAMEZAWA Hiroyuki > Cc: David Rientjes > Cc: Andrew Morton > --- > include/linux/memcontrol.h | 7 +++++ > include/linux/oom.h | 2 + > mm/memcontrol.c | 14 +++++++++++ > mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- > 4 files changed, 56 insertions(+), 22 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index c91e3c1..4322ca8 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -122,6 +122,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); > void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > +extern void dump_tasks_memcg(const struct mem_cgroup *memcg, > + const nodemask_t *nodemask); Shouldn't need the nodemask parameter, just have dump_tasks_memcg() pass NULL to dump_per_task(), we won't be isolating to tasks with mempolicies restricted to a particular set of nodes since we're in the memcg oom path here, not the global page allocator oom path. > extern void mem_cgroup_replace_page_cache(struct page *oldpage, > struct page *newpage); > > @@ -337,6 +339,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) > { > } > > +static inline void > +dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > +} > + > static inline void mem_cgroup_begin_update_page_stat(struct page *page, > bool *locked, unsigned long *flags) > { > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 20b5c46..9ba3344 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, > unsigned long totalpages, const nodemask_t *nodemask, > bool force_kill); > > +extern inline void dump_per_task(struct task_struct *p, > + const nodemask_t *nodemask); This is a global symbol, so dump_per_task() doesn't make a lot of sense: it would need to be prefixed with "oom_" so perhaps oom_dump_task() is better? > extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *mask, bool force_kill); > extern int register_oom_notifier(struct notifier_block *nb); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2df5e72..fe648f8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > + struct cgroup_iter it; > + struct task_struct *task; > + struct cgroup *cgroup = memcg->css.cgroup; > + > + cgroup_iter_start(cgroup, &it); > + while ((task = cgroup_iter_next(cgroup, &it))) { > + dump_per_task(task, nodemask); > + } > + > + cgroup_iter_end(cgroup, &it); > +} > + > static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > int order) > { > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 4b8a6dd..aaf6237 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > return chosen; > } > > +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) No inline. > +{ > + struct task_struct *task; > + > + if (oom_unkillable_task(p, NULL, nodemask)) > + return; > + > + task = find_lock_task_mm(p); > + if (!task) { > + /* > + * This is a kthread or all of p's threads have already > + * detached their mm's. There's no need to report > + * them; they can't be oom killed anyway. > + */ > + return; > + } > + > + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > + task->pid, from_kuid(&init_user_ns, task_uid(task)), > + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > + task->mm->nr_ptes, > + get_mm_counter(task->mm, MM_SWAPENTS), > + task->signal->oom_score_adj, task->comm); > + task_unlock(task); > +} > + > /** > * dump_tasks - dump current memory state of all system tasks > * @memcg: current's memory controller, if constrained > @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > { > struct task_struct *p; > - struct task_struct *task; > > pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); > - rcu_read_lock(); > - for_each_process(p) { > - if (oom_unkillable_task(p, memcg, nodemask)) > - continue; > - > - task = find_lock_task_mm(p); > - if (!task) { > - /* > - * This is a kthread or all of p's threads have already > - * detached their mm's. There's no need to report > - * them; they can't be oom killed anyway. > - */ > - continue; > - } > > - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > - task->pid, from_kuid(&init_user_ns, task_uid(task)), > - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > - task->mm->nr_ptes, > - get_mm_counter(task->mm, MM_SWAPENTS), > - task->signal->oom_score_adj, task->comm); > - task_unlock(task); > + if (memcg) { > + dump_tasks_memcg(memcg, nodemask); > + return; > } > + > + rcu_read_lock(); > + for_each_process(p) > + dump_per_task(p, nodemask); > rcu_read_unlock(); > } > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id AA3546B0044 for ; Wed, 7 Nov 2012 17:17:12 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id c4so1490779eek.14 for ; Wed, 07 Nov 2012 14:17:11 -0800 (PST) Date: Wed, 7 Nov 2012 23:17:09 +0100 From: Michal Hocko Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Message-ID: <20121107221709.GB26382@dhcp22.suse.cz> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1352277696-21724-1-git-send-email-handai.szj@taobao.com> Sender: owner-linux-mm@kvack.org List-ID: To: Sha Zhengju Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju On Wed 07-11-12 16:41:36, Sha Zhengju wrote: > From: Sha Zhengju > > Current, when a memcg oom is happening the oom dump messages is still global > state and provides few useful info for users. This patch prints more pointed > memcg page statistics for memcg-oom. > > Signed-off-by: Sha Zhengju > Cc: Michal Hocko > Cc: KAMEZAWA Hiroyuki > Cc: David Rientjes > Cc: Andrew Morton > --- > mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- > mm/oom_kill.c | 6 +++- > 2 files changed, 66 insertions(+), 11 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0eab7d5..2df5e72 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > spin_unlock_irqrestore(&memcg->move_lock, *flags); > } > > +#define K(x) ((x) << (PAGE_SHIFT-10)) > +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *mi; > + unsigned int i; > + > + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > + K(mem_cgroup_read_stat(memcg, i))); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) > + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], > + mem_cgroup_read_events(memcg, i)); > + > + for (i = 0; i < NR_LRU_LISTS; i++) > + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], > + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > + } else { > + > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + long long val = 0; > + > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_stat(mi, i); > + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_events(mi, i); > + printk(KERN_CONT "%s:%llu ", > + mem_cgroup_events_names[i], val); > + } > + > + for (i = 0; i < NR_LRU_LISTS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); > + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); > + } > + } This is just plain ugly. for_each_mem_cgroup_tree is use_hierarchy aware and there is no need for if (use_hierarchy) part. memcg != root_mem_cgroup test doesn't make much sense as well because we call that a global oom killer ;) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx173.postini.com [74.125.245.173]) by kanga.kvack.org (Postfix) with SMTP id C19046B0044 for ; Thu, 8 Nov 2012 07:37:43 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id i14so1279943dad.14 for ; Thu, 08 Nov 2012 04:37:43 -0800 (PST) Message-ID: <509BA799.505@gmail.com> Date: Thu, 08 Nov 2012 20:37:45 +0800 From: Sha Zhengju MIME-Version: 1.0 Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Sha Zhengju , linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org On 11/08/2012 02:02 AM, David Rientjes wrote: > On Wed, 7 Nov 2012, Sha Zhengju wrote: > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { >> "pgmajfault", >> }; >> >> +static const char * const mem_cgroup_lru_names[] = { >> + "inactive_anon", >> + "active_anon", >> + "inactive_file", >> + "active_file", >> + "unevictable", >> +}; >> + >> /* >> * Per memcg event counter is incremented at every pagein/pageout. With THP, >> * it will be incremated by the number of pages. This counter is used for >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x)<< (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy&& memcg != root_mem_cgroup) { >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > This printk isn't continuing any previous printk, so using KERN_CONT here > will require a short header to be printed first ("Memcg: "?) with > KERN_INFO before the iterations. > Yep...I think I lost it while rebasing... sorry for the stupid mistake. >> + K(mem_cgroup_read_stat(memcg, i))); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); >> + } else { >> + > Spurious newline. > > Eek, is there really no way to avoid this if-conditional and just use > for_each_mem_cgroup_tree() for everything and use > > mem_cgroup_iter_break(memcg, iter); > break; > > for !memcg->use_hierarchy? > Now I'm shamed at my bad brain of yesterday by sending this chunk out... Yes, the if-part code above is obviously unwanted, and the for_each_mem_cgroup_tree can handle hierarchy already. >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + long long val = 0; >> + >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_stat(mi, i); >> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_events(mi, i); >> + printk(KERN_CONT "%s:%llu ", >> + mem_cgroup_events_names[i], val); >> + } >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); >> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); >> + } >> + } >> + printk(KERN_CONT "\n"); >> +} >> /** >> - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. >> * @memcg: The memory cgroup that went over limit >> * @p: Task that is going to be killed >> * >> @@ -1569,6 +1628,8 @@ done: >> res_counter_read_u64(&memcg->kmem, RES_USAGE)>> 10, >> res_counter_read_u64(&memcg->kmem, RES_LIMIT)>> 10, >> res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); >> + >> + mem_cgroup_print_oom_stat(memcg); > I think this should be folded into mem_cgroup_print_oom_info(), I don't > see a need for a new function. > >> } >> >> /* >> @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, >> } >> #endif /* CONFIG_NUMA */ >> >> -static const char * const mem_cgroup_lru_names[] = { >> - "inactive_anon", >> - "active_anon", >> - "inactive_file", >> - "active_file", >> - "unevictable", >> -}; >> - >> static inline void mem_cgroup_lru_names_not_uptodate(void) >> { >> BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 7e9e911..4b8a6dd 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, >> cpuset_print_task_mems_allowed(current); >> task_unlock(current); >> dump_stack(); >> - mem_cgroup_print_oom_info(memcg, p); >> - show_mem(SHOW_MEM_FILTER_NODES); >> + if (memcg) >> + mem_cgroup_print_oom_info(memcg, p); > mem_cgroup_print_oom_info() already returns immediately for !memcg, so I'm > not sure why this change is made. > Here the if-else checking is aiming at printing distinct messages for memcg & non-memcg. IMHO, global state has little actual use for memcg-oom and why not we wipe off iti 1/4 ? Though mem_cgroup_print_oom_info already checking for !memcg, the if-statement can avoid one function call and save the deep-enough oom call stack a little. >> + else >> + show_mem(SHOW_MEM_FILTER_NODES); > Well that's disappointing if memcg == root_mem_cgroup, we'd probably like > to know the global memory state to determine what the problem is. > I really wondering if there is any case that can pass root_mem_cgroup down here. It's called by global or memcg oom killer and the global oom will set memcg=NULL directly instead of root_mem_cgroup. Besides, root memcg will not go through charging and there is no chance to call mem_cgroup_out_of_memory for root cgroup tasks. Thanks, Sha >> if (sysctl_oom_dump_tasks) >> dump_tasks(memcg, nodemask); >> } > . > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx195.postini.com [74.125.245.195]) by kanga.kvack.org (Postfix) with SMTP id C4D656B0044 for ; Thu, 8 Nov 2012 07:44:29 -0500 (EST) Date: Thu, 8 Nov 2012 13:44:26 +0100 From: Michal Hocko Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Message-ID: <20121108124426.GF31821@dhcp22.suse.cz> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> <509BA799.505@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <509BA799.505@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Sha Zhengju Cc: David Rientjes , linux-mm@kvack.org, cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org On Thu 08-11-12 20:37:45, Sha Zhengju wrote: > On 11/08/2012 02:02 AM, David Rientjes wrote: > >On Wed, 7 Nov 2012, Sha Zhengju wrote: [..] > >>+ else > >>+ show_mem(SHOW_MEM_FILTER_NODES); > >Well that's disappointing if memcg == root_mem_cgroup, we'd probably like > >to know the global memory state to determine what the problem is. > > > > I really wondering if there is any case that can pass > root_mem_cgroup down here. No it cannot because the root cgroup doesn't have any limit so we cannot trigger memcg oom killer. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx196.postini.com [74.125.245.196]) by kanga.kvack.org (Postfix) with SMTP id 4F1076B004D for ; Thu, 8 Nov 2012 07:45:27 -0500 (EST) Received: by mail-pa0-f41.google.com with SMTP id fa10so2155992pad.14 for ; Thu, 08 Nov 2012 04:45:26 -0800 (PST) Message-ID: <509BA96A.2070701@gmail.com> Date: Thu, 08 Nov 2012 20:45:30 +0800 From: Sha Zhengju MIME-Version: 1.0 Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> <20121107221709.GB26382@dhcp22.suse.cz> In-Reply-To: <20121107221709.GB26382@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju On 11/08/2012 06:17 AM, Michal Hocko wrote: > On Wed 07-11-12 16:41:36, Sha Zhengju wrote: >> From: Sha Zhengju >> >> Current, when a memcg oom is happening the oom dump messages is still global >> state and provides few useful info for users. This patch prints more pointed >> memcg page statistics for memcg-oom. >> >> Signed-off-by: Sha Zhengju >> Cc: Michal Hocko >> Cc: KAMEZAWA Hiroyuki >> Cc: David Rientjes >> Cc: Andrew Morton >> --- >> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- >> mm/oom_kill.c | 6 +++- >> 2 files changed, 66 insertions(+), 11 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c > [...] >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x)<< (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy&& memcg != root_mem_cgroup) { >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], >> + K(mem_cgroup_read_stat(memcg, i))); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); >> + } else { >> + >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + long long val = 0; >> + >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_stat(mi, i); >> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_events(mi, i); >> + printk(KERN_CONT "%s:%llu ", >> + mem_cgroup_events_names[i], val); >> + } >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); >> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); >> + } >> + } > This is just plain ugly. for_each_mem_cgroup_tree is use_hierarchy aware > and there is no need for if (use_hierarchy) part. > memcg != root_mem_cgroup test doesn't make much sense as well because we > call that a global oom killer ;) Yes... bitterly did I repent the patch... The else-part of for_each_mem_cgroup_tree is enough for hierarchy. I'll send a update one later. Sorry for the noise. : ( Thanks, Sha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx201.postini.com [74.125.245.201]) by kanga.kvack.org (Postfix) with SMTP id C60446B0044 for ; Thu, 8 Nov 2012 08:11:57 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id i14so1292565dad.14 for ; Thu, 08 Nov 2012 05:11:57 -0800 (PST) Message-ID: <509BAFA0.4010604@gmail.com> Date: Thu, 08 Nov 2012 21:12:00 +0800 From: Sha Zhengju MIME-Version: 1.0 Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> <509B7658.1020807@jp.fujitsu.com> In-Reply-To: <509B7658.1020807@jp.fujitsu.com> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju On 11/08/2012 05:07 PM, Kamezawa Hiroyuki wrote: > (2012/11/07 17:41), Sha Zhengju wrote: >> From: Sha Zhengju >> >> Current, when a memcg oom is happening the oom dump messages is still global >> state and provides few useful info for users. This patch prints more pointed >> memcg page statistics for memcg-oom. >> >> Signed-off-by: Sha Zhengju >> Cc: Michal Hocko >> Cc: KAMEZAWA Hiroyuki >> Cc: David Rientjes >> Cc: Andrew Morton >> --- >> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- >> mm/oom_kill.c | 6 +++- >> 2 files changed, 66 insertions(+), 11 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { >> "pgmajfault", >> }; >> >> +static const char * const mem_cgroup_lru_names[] = { >> + "inactive_anon", >> + "active_anon", >> + "inactive_file", >> + "active_file", >> + "unevictable", >> +}; >> + > Is this for the same strings with show_free_areas() ? > I just move the declaration here from the bottom of source file to make the following use error-free. >> /* >> * Per memcg event counter is incremented at every pagein/pageout. With THP, >> * it will be incremated by the number of pages. This counter is used for >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x) << (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > Why do you need to have this condition check ? > Yes, the check is unnecessary... I'll remove it next version. >> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], >> + K(mem_cgroup_read_stat(memcg, i))); > Hm, how about using the same style with show_free_areas() ? > I'm also trying do so. show_free_areas() prints the memory related info in two style: one is in page unit and the oher is in KB (I've no idea why we distinct them), but I think the KB format is more readable. >> + } >> + >> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + > I don't think EVENTS info is useful for oom. > It seems you're right. : ) >> + for (i = 0; i < NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > How far does your new information has different format than usual oom ? > Could you show a sample and difference in changelog ? > > Of course, I prefer both of them has similar format. > > > The new memcg-oom info excludes global state out and prints the memcg statistics instead which seems more brevity. I'll add a sample next time. Thanks for reminding me! Thanks, Sha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx125.postini.com [74.125.245.125]) by kanga.kvack.org (Postfix) with SMTP id 5A6E76B0044 for ; Thu, 8 Nov 2012 08:58:49 -0500 (EST) Received: by mail-pa0-f41.google.com with SMTP id fa10so2201638pad.14 for ; Thu, 08 Nov 2012 05:58:48 -0800 (PST) Message-ID: <509BBA9C.7050007@gmail.com> Date: Thu, 08 Nov 2012 21:58:52 +0800 From: Sha Zhengju MIME-Version: 1.0 Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277719-21760-1-git-send-email-handai.szj@taobao.com> <20121107223437.GC26382@dhcp22.suse.cz> In-Reply-To: <20121107223437.GC26382@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju On 11/08/2012 06:34 AM, Michal Hocko wrote: > On Wed 07-11-12 16:41:59, Sha Zhengju wrote: >> From: Sha Zhengju >> >> If memcg oom happening, don't scan all system tasks to dump memory state of >> eligible tasks, instead we iterates only over the process attached to the oom >> memcg and avoid the rcu lock. > you have replaced rcu lock by css_set_lock which is, well, heavier than > rcu. Besides that the patch is not correct because you have excluded > all tasks that are from subgroups because you iterate only through the > top level one. > I am not sure the whole optimization would be a win even if implemented > correctly. Well, we scan through more tasks currently and most of them > are not relevant but then you would need to exclude task_in_mem_cgroup > from oom_unkillable_task and that would be more code churn than the > win. Thanks for your and David's advice. This piece is trying to save some expense while dumping memcg tasks, but failed to scanning subgroups by iterating the cgroup. I'm agreed with your cost&win opinion, so I decide to give up this one. : ) Thanks, Sha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752816Ab2KGIkX (ORCPT ); Wed, 7 Nov 2012 03:40:23 -0500 Received: from mail-da0-f46.google.com ([209.85.210.46]:65471 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750973Ab2KGIkV (ORCPT ); Wed, 7 Nov 2012 03:40:21 -0500 From: Sha Zhengju To: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com Cc: linux-kernel@vger.kernel.org, Sha Zhengju Subject: [PATCH V2 0/2] Provide more precise dump info for memcg-oom Date: Wed, 7 Nov 2012 16:40:02 +0800 Message-Id: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> X-Mailer: git-send-email 1.7.4.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Sha Zhengju When memcg oom is happening the current memcg related dump information is limited for debugging. The patches provide more detailed memcg page statistics and also take hierarchy into consideration. The previous primitive version can be reached here: https://lkml.org/lkml/2012/7/30/179. Change log: 1. some modification towards hierarchy 2. rework dump_tasks 3. rebased on Michal's mm tree since-3.6 Any comments are welcomed. : ) Sha Zhengju (2): memcg-oom-provide-more-precise-dump-info-while-memcg.patch oom-rework-dump_tasks-to-optimize-memcg-oom-situatio.patch include/linux/memcontrol.h | 7 ++++ include/linux/oom.h | 2 + mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++++++++++----- mm/oom_kill.c | 61 +++++++++++++++++++------------ 4 files changed, 122 insertions(+), 33 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752842Ab2KGIln (ORCPT ); Wed, 7 Nov 2012 03:41:43 -0500 Received: from mail-pb0-f46.google.com ([209.85.160.46]:45791 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750996Ab2KGIll (ORCPT ); Wed, 7 Nov 2012 03:41:41 -0500 From: Sha Zhengju To: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com Cc: linux-kernel@vger.kernel.org, Sha Zhengju Subject: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Date: Wed, 7 Nov 2012 16:41:36 +0800 Message-Id: <1352277696-21724-1-git-send-email-handai.szj@taobao.com> X-Mailer: git-send-email 1.7.4.1 In-Reply-To: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Sha Zhengju Current, when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom. Signed-off-by: Sha Zhengju Cc: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: David Rientjes Cc: Andrew Morton --- mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- mm/oom_kill.c | 6 +++- 2 files changed, 66 insertions(+), 11 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0eab7d5..2df5e72 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { "pgmajfault", }; +static const char * const mem_cgroup_lru_names[] = { + "inactive_anon", + "active_anon", + "inactive_file", + "active_file", + "unevictable", +}; + /* * Per memcg event counter is incremented at every pagein/pageout. With THP, * it will be incremated by the number of pages. This counter is used for @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, spin_unlock_irqrestore(&memcg->move_lock, *flags); } +#define K(x) ((x) << (PAGE_SHIFT-10)) +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) +{ + struct mem_cgroup *mi; + unsigned int i; + + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) + continue; + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], + K(mem_cgroup_read_stat(memcg, i))); + } + + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], + mem_cgroup_read_events(memcg, i)); + + for (i = 0; i < NR_LRU_LISTS; i++) + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); + } else { + + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { + long long val = 0; + + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) + continue; + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_read_stat(mi, i); + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); + } + + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { + unsigned long long val = 0; + + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_read_events(mi, i); + printk(KERN_CONT "%s:%llu ", + mem_cgroup_events_names[i], val); + } + + for (i = 0; i < NR_LRU_LISTS; i++) { + unsigned long long val = 0; + + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); + } + } + printk(KERN_CONT "\n"); +} /** - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. * @memcg: The memory cgroup that went over limit * @p: Task that is going to be killed * @@ -1569,6 +1628,8 @@ done: res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); + + mem_cgroup_print_oom_stat(memcg); } /* @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, } #endif /* CONFIG_NUMA */ -static const char * const mem_cgroup_lru_names[] = { - "inactive_anon", - "active_anon", - "inactive_file", - "active_file", - "unevictable", -}; - static inline void mem_cgroup_lru_names_not_uptodate(void) { BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7e9e911..4b8a6dd 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, cpuset_print_task_mems_allowed(current); task_unlock(current); dump_stack(); - mem_cgroup_print_oom_info(memcg, p); - show_mem(SHOW_MEM_FILTER_NODES); + if (memcg) + mem_cgroup_print_oom_info(memcg, p); + else + show_mem(SHOW_MEM_FILTER_NODES); if (sysctl_oom_dump_tasks) dump_tasks(memcg, nodemask); } -- 1.7.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753262Ab2KGImD (ORCPT ); Wed, 7 Nov 2012 03:42:03 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:50352 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751530Ab2KGImA (ORCPT ); Wed, 7 Nov 2012 03:42:00 -0500 From: Sha Zhengju To: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com Cc: linux-kernel@vger.kernel.org, Sha Zhengju Subject: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Date: Wed, 7 Nov 2012 16:41:59 +0800 Message-Id: <1352277719-21760-1-git-send-email-handai.szj@taobao.com> X-Mailer: git-send-email 1.7.4.1 In-Reply-To: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Sha Zhengju If memcg oom happening, don't scan all system tasks to dump memory state of eligible tasks, instead we iterates only over the process attached to the oom memcg and avoid the rcu lock. Signed-off-by: Sha Zhengju Cc: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: David Rientjes Cc: Andrew Morton --- include/linux/memcontrol.h | 7 +++++ include/linux/oom.h | 2 + mm/memcontrol.c | 14 +++++++++++ mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- 4 files changed, 56 insertions(+), 22 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c91e3c1..4322ca8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -122,6 +122,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +extern void dump_tasks_memcg(const struct mem_cgroup *memcg, + const nodemask_t *nodemask); extern void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage); @@ -337,6 +339,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void +dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) +{ +} + static inline void mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, unsigned long *flags) { diff --git a/include/linux/oom.h b/include/linux/oom.h index 20b5c46..9ba3344 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); +extern inline void dump_per_task(struct task_struct *p, + const nodemask_t *nodemask); extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2df5e72..fe648f8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) return min(limit, memsw); } +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) +{ + struct cgroup_iter it; + struct task_struct *task; + struct cgroup *cgroup = memcg->css.cgroup; + + cgroup_iter_start(cgroup, &it); + while ((task = cgroup_iter_next(cgroup, &it))) { + dump_per_task(task, nodemask); + } + + cgroup_iter_end(cgroup, &it); +} + static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, int order) { diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4b8a6dd..aaf6237 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, return chosen; } +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) +{ + struct task_struct *task; + + if (oom_unkillable_task(p, NULL, nodemask)) + return; + + task = find_lock_task_mm(p); + if (!task) { + /* + * This is a kthread or all of p's threads have already + * detached their mm's. There's no need to report + * them; they can't be oom killed anyway. + */ + return; + } + + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", + task->pid, from_kuid(&init_user_ns, task_uid(task)), + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), + task->mm->nr_ptes, + get_mm_counter(task->mm, MM_SWAPENTS), + task->signal->oom_score_adj, task->comm); + task_unlock(task); +} + /** * dump_tasks - dump current memory state of all system tasks * @memcg: current's memory controller, if constrained @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) { struct task_struct *p; - struct task_struct *task; pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); - rcu_read_lock(); - for_each_process(p) { - if (oom_unkillable_task(p, memcg, nodemask)) - continue; - - task = find_lock_task_mm(p); - if (!task) { - /* - * This is a kthread or all of p's threads have already - * detached their mm's. There's no need to report - * them; they can't be oom killed anyway. - */ - continue; - } - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", - task->pid, from_kuid(&init_user_ns, task_uid(task)), - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - task->mm->nr_ptes, - get_mm_counter(task->mm, MM_SWAPENTS), - task->signal->oom_score_adj, task->comm); - task_unlock(task); + if (memcg) { + dump_tasks_memcg(memcg, nodemask); + return; } + + rcu_read_lock(); + for_each_process(p) + dump_per_task(p, nodemask); rcu_read_unlock(); } -- 1.7.6.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753079Ab2KGSCO (ORCPT ); Wed, 7 Nov 2012 13:02:14 -0500 Received: from mail-da0-f46.google.com ([209.85.210.46]:55918 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751346Ab2KGSCM (ORCPT ); Wed, 7 Nov 2012 13:02:12 -0500 Date: Wed, 7 Nov 2012 10:02:09 -0800 (PST) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Sha Zhengju cc: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Sha Zhengju Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening In-Reply-To: <1352277696-21724-1-git-send-email-handai.szj@taobao.com> Message-ID: References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 7 Nov 2012, Sha Zhengju wrote: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0eab7d5..2df5e72 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { > "pgmajfault", > }; > > +static const char * const mem_cgroup_lru_names[] = { > + "inactive_anon", > + "active_anon", > + "inactive_file", > + "active_file", > + "unevictable", > +}; > + > /* > * Per memcg event counter is incremented at every pagein/pageout. With THP, > * it will be incremated by the number of pages. This counter is used for > @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > spin_unlock_irqrestore(&memcg->move_lock, *flags); > } > > +#define K(x) ((x) << (PAGE_SHIFT-10)) > +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *mi; > + unsigned int i; > + > + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], This printk isn't continuing any previous printk, so using KERN_CONT here will require a short header to be printed first ("Memcg: "?) with KERN_INFO before the iterations. > + K(mem_cgroup_read_stat(memcg, i))); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) > + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], > + mem_cgroup_read_events(memcg, i)); > + > + for (i = 0; i < NR_LRU_LISTS; i++) > + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], > + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > + } else { > + Spurious newline. Eek, is there really no way to avoid this if-conditional and just use for_each_mem_cgroup_tree() for everything and use mem_cgroup_iter_break(memcg, iter); break; for !memcg->use_hierarchy? > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + long long val = 0; > + > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_stat(mi, i); > + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_events(mi, i); > + printk(KERN_CONT "%s:%llu ", > + mem_cgroup_events_names[i], val); > + } > + > + for (i = 0; i < NR_LRU_LISTS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); > + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); > + } > + } > + printk(KERN_CONT "\n"); > +} > /** > - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. > * @memcg: The memory cgroup that went over limit > * @p: Task that is going to be killed > * > @@ -1569,6 +1628,8 @@ done: > res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, > res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, > res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); > + > + mem_cgroup_print_oom_stat(memcg); I think this should be folded into mem_cgroup_print_oom_info(), I don't see a need for a new function. > } > > /* > @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, > } > #endif /* CONFIG_NUMA */ > > -static const char * const mem_cgroup_lru_names[] = { > - "inactive_anon", > - "active_anon", > - "inactive_file", > - "active_file", > - "unevictable", > -}; > - > static inline void mem_cgroup_lru_names_not_uptodate(void) > { > BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 7e9e911..4b8a6dd 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, > cpuset_print_task_mems_allowed(current); > task_unlock(current); > dump_stack(); > - mem_cgroup_print_oom_info(memcg, p); > - show_mem(SHOW_MEM_FILTER_NODES); > + if (memcg) > + mem_cgroup_print_oom_info(memcg, p); mem_cgroup_print_oom_info() already returns immediately for !memcg, so I'm not sure why this change is made. > + else > + show_mem(SHOW_MEM_FILTER_NODES); Well that's disappointing if memcg == root_mem_cgroup, we'd probably like to know the global memory state to determine what the problem is. > if (sysctl_oom_dump_tasks) > dump_tasks(memcg, nodemask); > } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752539Ab2KGSHT (ORCPT ); Wed, 7 Nov 2012 13:07:19 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:33582 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751747Ab2KGSHQ (ORCPT ); Wed, 7 Nov 2012 13:07:16 -0500 Date: Wed, 7 Nov 2012 10:07:14 -0800 (PST) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Sha Zhengju cc: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Sha Zhengju Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation In-Reply-To: <1352277719-21760-1-git-send-email-handai.szj@taobao.com> Message-ID: References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277719-21760-1-git-send-email-handai.szj@taobao.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 7 Nov 2012, Sha Zhengju wrote: > From: Sha Zhengju > > If memcg oom happening, don't scan all system tasks to dump memory state of > eligible tasks, instead we iterates only over the process attached to the oom > memcg and avoid the rcu lock. > Avoiding the rcu lock isn't actually that impressive here, the cgroup iterator will use it's own lock for that memcg. > Signed-off-by: Sha Zhengju > Cc: Michal Hocko > Cc: KAMEZAWA Hiroyuki > Cc: David Rientjes > Cc: Andrew Morton > --- > include/linux/memcontrol.h | 7 +++++ > include/linux/oom.h | 2 + > mm/memcontrol.c | 14 +++++++++++ > mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- > 4 files changed, 56 insertions(+), 22 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index c91e3c1..4322ca8 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -122,6 +122,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); > void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > +extern void dump_tasks_memcg(const struct mem_cgroup *memcg, > + const nodemask_t *nodemask); Shouldn't need the nodemask parameter, just have dump_tasks_memcg() pass NULL to dump_per_task(), we won't be isolating to tasks with mempolicies restricted to a particular set of nodes since we're in the memcg oom path here, not the global page allocator oom path. > extern void mem_cgroup_replace_page_cache(struct page *oldpage, > struct page *newpage); > > @@ -337,6 +339,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) > { > } > > +static inline void > +dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > +} > + > static inline void mem_cgroup_begin_update_page_stat(struct page *page, > bool *locked, unsigned long *flags) > { > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 20b5c46..9ba3344 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, > unsigned long totalpages, const nodemask_t *nodemask, > bool force_kill); > > +extern inline void dump_per_task(struct task_struct *p, > + const nodemask_t *nodemask); This is a global symbol, so dump_per_task() doesn't make a lot of sense: it would need to be prefixed with "oom_" so perhaps oom_dump_task() is better? > extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *mask, bool force_kill); > extern int register_oom_notifier(struct notifier_block *nb); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2df5e72..fe648f8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > + struct cgroup_iter it; > + struct task_struct *task; > + struct cgroup *cgroup = memcg->css.cgroup; > + > + cgroup_iter_start(cgroup, &it); > + while ((task = cgroup_iter_next(cgroup, &it))) { > + dump_per_task(task, nodemask); > + } > + > + cgroup_iter_end(cgroup, &it); > +} > + > static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > int order) > { > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 4b8a6dd..aaf6237 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > return chosen; > } > > +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) No inline. > +{ > + struct task_struct *task; > + > + if (oom_unkillable_task(p, NULL, nodemask)) > + return; > + > + task = find_lock_task_mm(p); > + if (!task) { > + /* > + * This is a kthread or all of p's threads have already > + * detached their mm's. There's no need to report > + * them; they can't be oom killed anyway. > + */ > + return; > + } > + > + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > + task->pid, from_kuid(&init_user_ns, task_uid(task)), > + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > + task->mm->nr_ptes, > + get_mm_counter(task->mm, MM_SWAPENTS), > + task->signal->oom_score_adj, task->comm); > + task_unlock(task); > +} > + > /** > * dump_tasks - dump current memory state of all system tasks > * @memcg: current's memory controller, if constrained > @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > { > struct task_struct *p; > - struct task_struct *task; > > pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); > - rcu_read_lock(); > - for_each_process(p) { > - if (oom_unkillable_task(p, memcg, nodemask)) > - continue; > - > - task = find_lock_task_mm(p); > - if (!task) { > - /* > - * This is a kthread or all of p's threads have already > - * detached their mm's. There's no need to report > - * them; they can't be oom killed anyway. > - */ > - continue; > - } > > - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > - task->pid, from_kuid(&init_user_ns, task_uid(task)), > - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > - task->mm->nr_ptes, > - get_mm_counter(task->mm, MM_SWAPENTS), > - task->signal->oom_score_adj, task->comm); > - task_unlock(task); > + if (memcg) { > + dump_tasks_memcg(memcg, nodemask); > + return; > } > + > + rcu_read_lock(); > + for_each_process(p) > + dump_per_task(p, nodemask); > rcu_read_unlock(); > } > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751701Ab2KGTbK (ORCPT ); Wed, 7 Nov 2012 14:31:10 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:53037 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751081Ab2KGTbI (ORCPT ); Wed, 7 Nov 2012 14:31:08 -0500 Date: Wed, 7 Nov 2012 11:31:07 -0800 From: Andrew Morton To: Sha Zhengju Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju Subject: Re: [PATCH V2 0/2] Provide more precise dump info for memcg-oom Message-Id: <20121107113107.d91eba43.akpm@linux-foundation.org> In-Reply-To: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 7 Nov 2012 16:40:02 +0800 Sha Zhengju wrote: > When memcg oom is happening the current memcg related dump information > is limited for debugging. The patches provide more detailed memcg page statistics > and also take hierarchy into consideration. Within the changelogs, please include a sample of the proposed output so we can properly review the proposal. Also it would be useful to provide some justification for the decisions in this patch: which data is displayed and, particularly, which is not? Why is the displayed information useful to developers, etc. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754134Ab2KGWRP (ORCPT ); Wed, 7 Nov 2012 17:17:15 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:51689 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753515Ab2KGWRN (ORCPT ); Wed, 7 Nov 2012 17:17:13 -0500 Date: Wed, 7 Nov 2012 23:17:09 +0100 From: Michal Hocko To: Sha Zhengju Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Message-ID: <20121107221709.GB26382@dhcp22.suse.cz> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1352277696-21724-1-git-send-email-handai.szj@taobao.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 07-11-12 16:41:36, Sha Zhengju wrote: > From: Sha Zhengju > > Current, when a memcg oom is happening the oom dump messages is still global > state and provides few useful info for users. This patch prints more pointed > memcg page statistics for memcg-oom. > > Signed-off-by: Sha Zhengju > Cc: Michal Hocko > Cc: KAMEZAWA Hiroyuki > Cc: David Rientjes > Cc: Andrew Morton > --- > mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- > mm/oom_kill.c | 6 +++- > 2 files changed, 66 insertions(+), 11 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0eab7d5..2df5e72 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > spin_unlock_irqrestore(&memcg->move_lock, *flags); > } > > +#define K(x) ((x) << (PAGE_SHIFT-10)) > +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *mi; > + unsigned int i; > + > + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > + K(mem_cgroup_read_stat(memcg, i))); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) > + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], > + mem_cgroup_read_events(memcg, i)); > + > + for (i = 0; i < NR_LRU_LISTS; i++) > + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], > + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > + } else { > + > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + long long val = 0; > + > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_stat(mi, i); > + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_events(mi, i); > + printk(KERN_CONT "%s:%llu ", > + mem_cgroup_events_names[i], val); > + } > + > + for (i = 0; i < NR_LRU_LISTS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); > + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); > + } > + } This is just plain ugly. for_each_mem_cgroup_tree is use_hierarchy aware and there is no need for if (use_hierarchy) part. memcg != root_mem_cgroup test doesn't make much sense as well because we call that a global oom killer ;) -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754150Ab2KGWen (ORCPT ); Wed, 7 Nov 2012 17:34:43 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58594 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752088Ab2KGWel (ORCPT ); Wed, 7 Nov 2012 17:34:41 -0500 Date: Wed, 7 Nov 2012 23:34:37 +0100 From: Michal Hocko To: Sha Zhengju Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Message-ID: <20121107223437.GC26382@dhcp22.suse.cz> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277719-21760-1-git-send-email-handai.szj@taobao.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1352277719-21760-1-git-send-email-handai.szj@taobao.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 07-11-12 16:41:59, Sha Zhengju wrote: > From: Sha Zhengju > > If memcg oom happening, don't scan all system tasks to dump memory state of > eligible tasks, instead we iterates only over the process attached to the oom > memcg and avoid the rcu lock. you have replaced rcu lock by css_set_lock which is, well, heavier than rcu. Besides that the patch is not correct because you have excluded all tasks that are from subgroups because you iterate only through the top level one. I am not sure the whole optimization would be a win even if implemented correctly. Well, we scan through more tasks currently and most of them are not relevant but then you would need to exclude task_in_mem_cgroup from oom_unkillable_task and that would be more code churn than the win. > Signed-off-by: Sha Zhengju > Cc: Michal Hocko > Cc: KAMEZAWA Hiroyuki > Cc: David Rientjes > Cc: Andrew Morton > --- > include/linux/memcontrol.h | 7 +++++ > include/linux/oom.h | 2 + > mm/memcontrol.c | 14 +++++++++++ > mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- > 4 files changed, 56 insertions(+), 22 deletions(-) > [...] > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 20b5c46..9ba3344 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, > unsigned long totalpages, const nodemask_t *nodemask, > bool force_kill); > > +extern inline void dump_per_task(struct task_struct *p, > + const nodemask_t *nodemask); > extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *mask, bool force_kill); > extern int register_oom_notifier(struct notifier_block *nb); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2df5e72..fe648f8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > + struct cgroup_iter it; > + struct task_struct *task; > + struct cgroup *cgroup = memcg->css.cgroup; > + > + cgroup_iter_start(cgroup, &it); > + while ((task = cgroup_iter_next(cgroup, &it))) { > + dump_per_task(task, nodemask); > + } > + > + cgroup_iter_end(cgroup, &it); > +} > + > static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > int order) > { > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 4b8a6dd..aaf6237 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > return chosen; > } > > +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) > +{ > + struct task_struct *task; > + > + if (oom_unkillable_task(p, NULL, nodemask)) > + return; > + > + task = find_lock_task_mm(p); > + if (!task) { > + /* > + * This is a kthread or all of p's threads have already > + * detached their mm's. There's no need to report > + * them; they can't be oom killed anyway. > + */ > + return; > + } > + > + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > + task->pid, from_kuid(&init_user_ns, task_uid(task)), > + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > + task->mm->nr_ptes, > + get_mm_counter(task->mm, MM_SWAPENTS), > + task->signal->oom_score_adj, task->comm); > + task_unlock(task); > +} > + > /** > * dump_tasks - dump current memory state of all system tasks > * @memcg: current's memory controller, if constrained > @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > { > struct task_struct *p; > - struct task_struct *task; > > pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); > - rcu_read_lock(); > - for_each_process(p) { > - if (oom_unkillable_task(p, memcg, nodemask)) > - continue; > - > - task = find_lock_task_mm(p); > - if (!task) { > - /* > - * This is a kthread or all of p's threads have already > - * detached their mm's. There's no need to report > - * them; they can't be oom killed anyway. > - */ > - continue; > - } > > - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > - task->pid, from_kuid(&init_user_ns, task_uid(task)), > - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > - task->mm->nr_ptes, > - get_mm_counter(task->mm, MM_SWAPENTS), > - task->signal->oom_score_adj, task->comm); > - task_unlock(task); > + if (memcg) { > + dump_tasks_memcg(memcg, nodemask); > + return; > } > + > + rcu_read_lock(); > + for_each_process(p) > + dump_per_task(p, nodemask); > rcu_read_unlock(); > } > > -- > 1.7.6.1 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755245Ab2KHJIE (ORCPT ); Thu, 8 Nov 2012 04:08:04 -0500 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:50777 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751393Ab2KHJHz (ORCPT ); Thu, 8 Nov 2012 04:07:55 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <509B7658.1020807@jp.fujitsu.com> Date: Thu, 08 Nov 2012 18:07:36 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: Sha Zhengju CC: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> In-Reply-To: <1352277696-21724-1-git-send-email-handai.szj@taobao.com> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/11/07 17:41), Sha Zhengju wrote: > From: Sha Zhengju > > Current, when a memcg oom is happening the oom dump messages is still global > state and provides few useful info for users. This patch prints more pointed > memcg page statistics for memcg-oom. > > Signed-off-by: Sha Zhengju > Cc: Michal Hocko > Cc: KAMEZAWA Hiroyuki > Cc: David Rientjes > Cc: Andrew Morton > --- > mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- > mm/oom_kill.c | 6 +++- > 2 files changed, 66 insertions(+), 11 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0eab7d5..2df5e72 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { > "pgmajfault", > }; > > +static const char * const mem_cgroup_lru_names[] = { > + "inactive_anon", > + "active_anon", > + "inactive_file", > + "active_file", > + "unevictable", > +}; > + Is this for the same strings with show_free_areas() ? > /* > * Per memcg event counter is incremented at every pagein/pageout. With THP, > * it will be incremated by the number of pages. This counter is used for > @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > spin_unlock_irqrestore(&memcg->move_lock, *flags); > } > > +#define K(x) ((x) << (PAGE_SHIFT-10)) > +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *mi; > + unsigned int i; > + > + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { Why do you need to have this condition check ? > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > + K(mem_cgroup_read_stat(memcg, i))); Hm, how about using the same style with show_free_areas() ? > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) > + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], > + mem_cgroup_read_events(memcg, i)); > + I don't think EVENTS info is useful for oom. > + for (i = 0; i < NR_LRU_LISTS; i++) > + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], > + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); How far does your new information has different format than usual oom ? Could you show a sample and difference in changelog ? Of course, I prefer both of them has similar format. > + } else { > + > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + long long val = 0; > + > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_stat(mi, i); > + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_events(mi, i); > + printk(KERN_CONT "%s:%llu ", > + mem_cgroup_events_names[i], val); > + } > + > + for (i = 0; i < NR_LRU_LISTS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); > + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); > + } > + } > + printk(KERN_CONT "\n"); > +} > /** > - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. > * @memcg: The memory cgroup that went over limit > * @p: Task that is going to be killed > * > @@ -1569,6 +1628,8 @@ done: > res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, > res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, > res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); > + > + mem_cgroup_print_oom_stat(memcg); > } please put directly in print_oom_info() Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752041Ab2KHMhp (ORCPT ); Thu, 8 Nov 2012 07:37:45 -0500 Received: from mail-da0-f46.google.com ([209.85.210.46]:49227 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751477Ab2KHMhn (ORCPT ); Thu, 8 Nov 2012 07:37:43 -0500 Message-ID: <509BA799.505@gmail.com> Date: Thu, 08 Nov 2012 20:37:45 +0800 From: Sha Zhengju User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110922 Thunderbird/3.1.15 MIME-Version: 1.0 To: David Rientjes CC: Sha Zhengju , linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/08/2012 02:02 AM, David Rientjes wrote: > On Wed, 7 Nov 2012, Sha Zhengju wrote: > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { >> "pgmajfault", >> }; >> >> +static const char * const mem_cgroup_lru_names[] = { >> + "inactive_anon", >> + "active_anon", >> + "inactive_file", >> + "active_file", >> + "unevictable", >> +}; >> + >> /* >> * Per memcg event counter is incremented at every pagein/pageout. With THP, >> * it will be incremated by the number of pages. This counter is used for >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x)<< (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy&& memcg != root_mem_cgroup) { >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > This printk isn't continuing any previous printk, so using KERN_CONT here > will require a short header to be printed first ("Memcg: "?) with > KERN_INFO before the iterations. > Yep...I think I lost it while rebasing... sorry for the stupid mistake. >> + K(mem_cgroup_read_stat(memcg, i))); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); >> + } else { >> + > Spurious newline. > > Eek, is there really no way to avoid this if-conditional and just use > for_each_mem_cgroup_tree() for everything and use > > mem_cgroup_iter_break(memcg, iter); > break; > > for !memcg->use_hierarchy? > Now I'm shamed at my bad brain of yesterday by sending this chunk out... Yes, the if-part code above is obviously unwanted, and the for_each_mem_cgroup_tree can handle hierarchy already. >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + long long val = 0; >> + >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_stat(mi, i); >> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_events(mi, i); >> + printk(KERN_CONT "%s:%llu ", >> + mem_cgroup_events_names[i], val); >> + } >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); >> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); >> + } >> + } >> + printk(KERN_CONT "\n"); >> +} >> /** >> - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. >> * @memcg: The memory cgroup that went over limit >> * @p: Task that is going to be killed >> * >> @@ -1569,6 +1628,8 @@ done: >> res_counter_read_u64(&memcg->kmem, RES_USAGE)>> 10, >> res_counter_read_u64(&memcg->kmem, RES_LIMIT)>> 10, >> res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); >> + >> + mem_cgroup_print_oom_stat(memcg); > I think this should be folded into mem_cgroup_print_oom_info(), I don't > see a need for a new function. > >> } >> >> /* >> @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, >> } >> #endif /* CONFIG_NUMA */ >> >> -static const char * const mem_cgroup_lru_names[] = { >> - "inactive_anon", >> - "active_anon", >> - "inactive_file", >> - "active_file", >> - "unevictable", >> -}; >> - >> static inline void mem_cgroup_lru_names_not_uptodate(void) >> { >> BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 7e9e911..4b8a6dd 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, >> cpuset_print_task_mems_allowed(current); >> task_unlock(current); >> dump_stack(); >> - mem_cgroup_print_oom_info(memcg, p); >> - show_mem(SHOW_MEM_FILTER_NODES); >> + if (memcg) >> + mem_cgroup_print_oom_info(memcg, p); > mem_cgroup_print_oom_info() already returns immediately for !memcg, so I'm > not sure why this change is made. > Here the if-else checking is aiming at printing distinct messages for memcg & non-memcg. IMHO, global state has little actual use for memcg-oom and why not we wipe off it? Though mem_cgroup_print_oom_info already checking for !memcg, the if-statement can avoid one function call and save the deep-enough oom call stack a little. >> + else >> + show_mem(SHOW_MEM_FILTER_NODES); > Well that's disappointing if memcg == root_mem_cgroup, we'd probably like > to know the global memory state to determine what the problem is. > I really wondering if there is any case that can pass root_mem_cgroup down here. It's called by global or memcg oom killer and the global oom will set memcg=NULL directly instead of root_mem_cgroup. Besides, root memcg will not go through charging and there is no chance to call mem_cgroup_out_of_memory for root cgroup tasks. Thanks, Sha >> if (sysctl_oom_dump_tasks) >> dump_tasks(memcg, nodemask); >> } > . > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755445Ab2KHMoa (ORCPT ); Thu, 8 Nov 2012 07:44:30 -0500 Received: from cantor2.suse.de ([195.135.220.15]:52457 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751916Ab2KHMo2 (ORCPT ); Thu, 8 Nov 2012 07:44:28 -0500 Date: Thu, 8 Nov 2012 13:44:26 +0100 From: Michal Hocko To: Sha Zhengju Cc: David Rientjes , linux-mm@kvack.org, cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Message-ID: <20121108124426.GF31821@dhcp22.suse.cz> References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> <509BA799.505@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <509BA799.505@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 08-11-12 20:37:45, Sha Zhengju wrote: > On 11/08/2012 02:02 AM, David Rientjes wrote: > >On Wed, 7 Nov 2012, Sha Zhengju wrote: [..] > >>+ else > >>+ show_mem(SHOW_MEM_FILTER_NODES); > >Well that's disappointing if memcg == root_mem_cgroup, we'd probably like > >to know the global memory state to determine what the problem is. > > > > I really wondering if there is any case that can pass > root_mem_cgroup down here. No it cannot because the root cgroup doesn't have any limit so we cannot trigger memcg oom killer. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755456Ab2KHMp3 (ORCPT ); Thu, 8 Nov 2012 07:45:29 -0500 Received: from mail-da0-f46.google.com ([209.85.210.46]:47820 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751916Ab2KHMp1 (ORCPT ); Thu, 8 Nov 2012 07:45:27 -0500 Message-ID: <509BA96A.2070701@gmail.com> Date: Thu, 08 Nov 2012 20:45:30 +0800 From: Sha Zhengju User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110922 Thunderbird/3.1.15 MIME-Version: 1.0 To: Michal Hocko CC: linux-mm@kvack.org, cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> <20121107221709.GB26382@dhcp22.suse.cz> In-Reply-To: <20121107221709.GB26382@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/08/2012 06:17 AM, Michal Hocko wrote: > On Wed 07-11-12 16:41:36, Sha Zhengju wrote: >> From: Sha Zhengju >> >> Current, when a memcg oom is happening the oom dump messages is still global >> state and provides few useful info for users. This patch prints more pointed >> memcg page statistics for memcg-oom. >> >> Signed-off-by: Sha Zhengju >> Cc: Michal Hocko >> Cc: KAMEZAWA Hiroyuki >> Cc: David Rientjes >> Cc: Andrew Morton >> --- >> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- >> mm/oom_kill.c | 6 +++- >> 2 files changed, 66 insertions(+), 11 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c > [...] >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x)<< (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy&& memcg != root_mem_cgroup) { >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], >> + K(mem_cgroup_read_stat(memcg, i))); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); >> + } else { >> + >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + long long val = 0; >> + >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_stat(mi, i); >> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_events(mi, i); >> + printk(KERN_CONT "%s:%llu ", >> + mem_cgroup_events_names[i], val); >> + } >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); >> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); >> + } >> + } > This is just plain ugly. for_each_mem_cgroup_tree is use_hierarchy aware > and there is no need for if (use_hierarchy) part. > memcg != root_mem_cgroup test doesn't make much sense as well because we > call that a global oom killer ;) Yes... bitterly did I repent the patch... The else-part of for_each_mem_cgroup_tree is enough for hierarchy. I'll send a update one later. Sorry for the noise. : ( Thanks, Sha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755565Ab2KHNMA (ORCPT ); Thu, 8 Nov 2012 08:12:00 -0500 Received: from mail-pb0-f46.google.com ([209.85.160.46]:48903 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751274Ab2KHNL5 (ORCPT ); Thu, 8 Nov 2012 08:11:57 -0500 Message-ID: <509BAFA0.4010604@gmail.com> Date: Thu, 08 Nov 2012 21:12:00 +0800 From: Sha Zhengju User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110922 Thunderbird/3.1.15 MIME-Version: 1.0 To: Kamezawa Hiroyuki CC: linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@suse.cz, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju Subject: Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277696-21724-1-git-send-email-handai.szj@taobao.com> <509B7658.1020807@jp.fujitsu.com> In-Reply-To: <509B7658.1020807@jp.fujitsu.com> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/08/2012 05:07 PM, Kamezawa Hiroyuki wrote: > (2012/11/07 17:41), Sha Zhengju wrote: >> From: Sha Zhengju >> >> Current, when a memcg oom is happening the oom dump messages is still global >> state and provides few useful info for users. This patch prints more pointed >> memcg page statistics for memcg-oom. >> >> Signed-off-by: Sha Zhengju >> Cc: Michal Hocko >> Cc: KAMEZAWA Hiroyuki >> Cc: David Rientjes >> Cc: Andrew Morton >> --- >> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- >> mm/oom_kill.c | 6 +++- >> 2 files changed, 66 insertions(+), 11 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { >> "pgmajfault", >> }; >> >> +static const char * const mem_cgroup_lru_names[] = { >> + "inactive_anon", >> + "active_anon", >> + "inactive_file", >> + "active_file", >> + "unevictable", >> +}; >> + > Is this for the same strings with show_free_areas() ? > I just move the declaration here from the bottom of source file to make the following use error-free. >> /* >> * Per memcg event counter is incremented at every pagein/pageout. With THP, >> * it will be incremated by the number of pages. This counter is used for >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x) << (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > Why do you need to have this condition check ? > Yes, the check is unnecessary... I'll remove it next version. >> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], >> + K(mem_cgroup_read_stat(memcg, i))); > Hm, how about using the same style with show_free_areas() ? > I'm also trying do so. show_free_areas() prints the memory related info in two style: one is in page unit and the oher is in KB (I've no idea why we distinct them), but I think the KB format is more readable. >> + } >> + >> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + > I don't think EVENTS info is useful for oom. > It seems you're right. : ) >> + for (i = 0; i < NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > How far does your new information has different format than usual oom ? > Could you show a sample and difference in changelog ? > > Of course, I prefer both of them has similar format. > > > The new memcg-oom info excludes global state out and prints the memcg statistics instead which seems more brevity. I'll add a sample next time. Thanks for reminding me! Thanks, Sha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755707Ab2KHN6u (ORCPT ); Thu, 8 Nov 2012 08:58:50 -0500 Received: from mail-da0-f46.google.com ([209.85.210.46]:49179 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751069Ab2KHN6t (ORCPT ); Thu, 8 Nov 2012 08:58:49 -0500 Message-ID: <509BBA9C.7050007@gmail.com> Date: Thu, 08 Nov 2012 21:58:52 +0800 From: Sha Zhengju User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110922 Thunderbird/3.1.15 MIME-Version: 1.0 To: Michal Hocko CC: linux-mm@kvack.org, cgroups@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org, rientjes@google.com, linux-kernel@vger.kernel.org, Sha Zhengju Subject: Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation References: <1352277602-21687-1-git-send-email-handai.szj@taobao.com> <1352277719-21760-1-git-send-email-handai.szj@taobao.com> <20121107223437.GC26382@dhcp22.suse.cz> In-Reply-To: <20121107223437.GC26382@dhcp22.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/08/2012 06:34 AM, Michal Hocko wrote: > On Wed 07-11-12 16:41:59, Sha Zhengju wrote: >> From: Sha Zhengju >> >> If memcg oom happening, don't scan all system tasks to dump memory state of >> eligible tasks, instead we iterates only over the process attached to the oom >> memcg and avoid the rcu lock. > you have replaced rcu lock by css_set_lock which is, well, heavier than > rcu. Besides that the patch is not correct because you have excluded > all tasks that are from subgroups because you iterate only through the > top level one. > I am not sure the whole optimization would be a win even if implemented > correctly. Well, we scan through more tasks currently and most of them > are not relevant but then you would need to exclude task_in_mem_cgroup > from oom_unkillable_task and that would be more code churn than the > win. Thanks for your and David's advice. This piece is trying to save some expense while dumping memcg tasks, but failed to scanning subgroups by iterating the cgroup. I'm agreed with your cost&win opinion, so I decide to give up this one. : ) Thanks, Sha