* [PATCH V2 0/2] Provide more precise dump info for memcg-oom @ 2012-11-07 8:40 Sha Zhengju 2012-11-07 8:41 ` [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Sha Zhengju ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Sha Zhengju @ 2012-11-07 8:40 UTC (permalink / raw) To: linux-mm, cgroups, mhocko, kamezawa.hiroyu, akpm, rientjes Cc: linux-kernel, Sha Zhengju From: Sha Zhengju <handai.szj@taobao.com> When memcg oom is happening the current memcg related dump information is limited for debugging. The patches provide more detailed memcg page statistics and also take hierarchy into consideration. The previous primitive version can be reached here: https://lkml.org/lkml/2012/7/30/179. Change log: 1. some modification towards hierarchy 2. rework dump_tasks 3. rebased on Michal's mm tree since-3.6 Any comments are welcomed. : ) Sha Zhengju (2): memcg-oom-provide-more-precise-dump-info-while-memcg.patch oom-rework-dump_tasks-to-optimize-memcg-oom-situatio.patch include/linux/memcontrol.h | 7 ++++ include/linux/oom.h | 2 + mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++++++++++----- mm/oom_kill.c | 61 +++++++++++++++++++------------ 4 files changed, 122 insertions(+), 33 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening 2012-11-07 8:40 [PATCH V2 0/2] Provide more precise dump info for memcg-oom Sha Zhengju @ 2012-11-07 8:41 ` Sha Zhengju 2012-11-07 18:02 ` David Rientjes ` (2 more replies) 2012-11-07 8:41 ` [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Sha Zhengju 2012-11-07 19:31 ` [PATCH V2 0/2] Provide more precise dump info for memcg-oom Andrew Morton 2 siblings, 3 replies; 14+ messages in thread From: Sha Zhengju @ 2012-11-07 8:41 UTC (permalink / raw) To: linux-mm, cgroups, mhocko, kamezawa.hiroyu, akpm, rientjes Cc: linux-kernel, Sha Zhengju From: Sha Zhengju <handai.szj@taobao.com> Current, when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> --- mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- mm/oom_kill.c | 6 +++- 2 files changed, 66 insertions(+), 11 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0eab7d5..2df5e72 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { "pgmajfault", }; +static const char * const mem_cgroup_lru_names[] = { + "inactive_anon", + "active_anon", + "inactive_file", + "active_file", + "unevictable", +}; + /* * Per memcg event counter is incremented at every pagein/pageout. With THP, * it will be incremated by the number of pages. This counter is used for @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, spin_unlock_irqrestore(&memcg->move_lock, *flags); } +#define K(x) ((x) << (PAGE_SHIFT-10)) +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) +{ + struct mem_cgroup *mi; + unsigned int i; + + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) + continue; + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], + K(mem_cgroup_read_stat(memcg, i))); + } + + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], + mem_cgroup_read_events(memcg, i)); + + for (i = 0; i < NR_LRU_LISTS; i++) + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); + } else { + + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { + long long val = 0; + + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) + continue; + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_read_stat(mi, i); + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); + } + + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { + unsigned long long val = 0; + + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_read_events(mi, i); + printk(KERN_CONT "%s:%llu ", + mem_cgroup_events_names[i], val); + } + + for (i = 0; i < NR_LRU_LISTS; i++) { + unsigned long long val = 0; + + for_each_mem_cgroup_tree(mi, memcg) + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); + } + } + printk(KERN_CONT "\n"); +} /** - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. * @memcg: The memory cgroup that went over limit * @p: Task that is going to be killed * @@ -1569,6 +1628,8 @@ done: res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); + + mem_cgroup_print_oom_stat(memcg); } /* @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, } #endif /* CONFIG_NUMA */ -static const char * const mem_cgroup_lru_names[] = { - "inactive_anon", - "active_anon", - "inactive_file", - "active_file", - "unevictable", -}; - static inline void mem_cgroup_lru_names_not_uptodate(void) { BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7e9e911..4b8a6dd 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, cpuset_print_task_mems_allowed(current); task_unlock(current); dump_stack(); - mem_cgroup_print_oom_info(memcg, p); - show_mem(SHOW_MEM_FILTER_NODES); + if (memcg) + mem_cgroup_print_oom_info(memcg, p); + else + show_mem(SHOW_MEM_FILTER_NODES); if (sysctl_oom_dump_tasks) dump_tasks(memcg, nodemask); } -- 1.7.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening 2012-11-07 8:41 ` [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Sha Zhengju @ 2012-11-07 18:02 ` David Rientjes 2012-11-08 12:37 ` Sha Zhengju 2012-11-07 22:17 ` Michal Hocko 2012-11-08 9:07 ` Kamezawa Hiroyuki 2 siblings, 1 reply; 14+ messages in thread From: David Rientjes @ 2012-11-07 18:02 UTC (permalink / raw) To: Sha Zhengju Cc: linux-mm, cgroups, mhocko, kamezawa.hiroyu, akpm, linux-kernel, Sha Zhengju On Wed, 7 Nov 2012, Sha Zhengju wrote: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0eab7d5..2df5e72 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { > "pgmajfault", > }; > > +static const char * const mem_cgroup_lru_names[] = { > + "inactive_anon", > + "active_anon", > + "inactive_file", > + "active_file", > + "unevictable", > +}; > + > /* > * Per memcg event counter is incremented at every pagein/pageout. With THP, > * it will be incremated by the number of pages. This counter is used for > @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > spin_unlock_irqrestore(&memcg->move_lock, *flags); > } > > +#define K(x) ((x) << (PAGE_SHIFT-10)) > +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *mi; > + unsigned int i; > + > + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], This printk isn't continuing any previous printk, so using KERN_CONT here will require a short header to be printed first ("Memcg: "?) with KERN_INFO before the iterations. > + K(mem_cgroup_read_stat(memcg, i))); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) > + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], > + mem_cgroup_read_events(memcg, i)); > + > + for (i = 0; i < NR_LRU_LISTS; i++) > + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], > + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > + } else { > + Spurious newline. Eek, is there really no way to avoid this if-conditional and just use for_each_mem_cgroup_tree() for everything and use mem_cgroup_iter_break(memcg, iter); break; for !memcg->use_hierarchy? > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + long long val = 0; > + > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_stat(mi, i); > + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_events(mi, i); > + printk(KERN_CONT "%s:%llu ", > + mem_cgroup_events_names[i], val); > + } > + > + for (i = 0; i < NR_LRU_LISTS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); > + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); > + } > + } > + printk(KERN_CONT "\n"); > +} > /** > - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. > * @memcg: The memory cgroup that went over limit > * @p: Task that is going to be killed > * > @@ -1569,6 +1628,8 @@ done: > res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, > res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, > res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); > + > + mem_cgroup_print_oom_stat(memcg); I think this should be folded into mem_cgroup_print_oom_info(), I don't see a need for a new function. > } > > /* > @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, > } > #endif /* CONFIG_NUMA */ > > -static const char * const mem_cgroup_lru_names[] = { > - "inactive_anon", > - "active_anon", > - "inactive_file", > - "active_file", > - "unevictable", > -}; > - > static inline void mem_cgroup_lru_names_not_uptodate(void) > { > BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 7e9e911..4b8a6dd 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, > cpuset_print_task_mems_allowed(current); > task_unlock(current); > dump_stack(); > - mem_cgroup_print_oom_info(memcg, p); > - show_mem(SHOW_MEM_FILTER_NODES); > + if (memcg) > + mem_cgroup_print_oom_info(memcg, p); mem_cgroup_print_oom_info() already returns immediately for !memcg, so I'm not sure why this change is made. > + else > + show_mem(SHOW_MEM_FILTER_NODES); Well that's disappointing if memcg == root_mem_cgroup, we'd probably like to know the global memory state to determine what the problem is. > if (sysctl_oom_dump_tasks) > dump_tasks(memcg, nodemask); > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening 2012-11-07 18:02 ` David Rientjes @ 2012-11-08 12:37 ` Sha Zhengju 2012-11-08 12:44 ` Michal Hocko 0 siblings, 1 reply; 14+ messages in thread From: Sha Zhengju @ 2012-11-08 12:37 UTC (permalink / raw) To: David Rientjes Cc: Sha Zhengju, linux-mm, cgroups, mhocko, kamezawa.hiroyu, akpm, linux-kernel On 11/08/2012 02:02 AM, David Rientjes wrote: > On Wed, 7 Nov 2012, Sha Zhengju wrote: > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { >> "pgmajfault", >> }; >> >> +static const char * const mem_cgroup_lru_names[] = { >> + "inactive_anon", >> + "active_anon", >> + "inactive_file", >> + "active_file", >> + "unevictable", >> +}; >> + >> /* >> * Per memcg event counter is incremented at every pagein/pageout. With THP, >> * it will be incremated by the number of pages. This counter is used for >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x)<< (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy&& memcg != root_mem_cgroup) { >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > This printk isn't continuing any previous printk, so using KERN_CONT here > will require a short header to be printed first ("Memcg: "?) with > KERN_INFO before the iterations. > Yep...I think I lost it while rebasing... sorry for the stupid mistake. >> + K(mem_cgroup_read_stat(memcg, i))); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); >> + } else { >> + > Spurious newline. > > Eek, is there really no way to avoid this if-conditional and just use > for_each_mem_cgroup_tree() for everything and use > > mem_cgroup_iter_break(memcg, iter); > break; > > for !memcg->use_hierarchy? > Now I'm shamed at my bad brain of yesterday by sending this chunk out... Yes, the if-part code above is obviously unwanted, and the for_each_mem_cgroup_tree can handle hierarchy already. >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + long long val = 0; >> + >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_stat(mi, i); >> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_events(mi, i); >> + printk(KERN_CONT "%s:%llu ", >> + mem_cgroup_events_names[i], val); >> + } >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); >> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); >> + } >> + } >> + printk(KERN_CONT "\n"); >> +} >> /** >> - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. >> * @memcg: The memory cgroup that went over limit >> * @p: Task that is going to be killed >> * >> @@ -1569,6 +1628,8 @@ done: >> res_counter_read_u64(&memcg->kmem, RES_USAGE)>> 10, >> res_counter_read_u64(&memcg->kmem, RES_LIMIT)>> 10, >> res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); >> + >> + mem_cgroup_print_oom_stat(memcg); > I think this should be folded into mem_cgroup_print_oom_info(), I don't > see a need for a new function. > >> } >> >> /* >> @@ -5195,14 +5256,6 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, >> } >> #endif /* CONFIG_NUMA */ >> >> -static const char * const mem_cgroup_lru_names[] = { >> - "inactive_anon", >> - "active_anon", >> - "inactive_file", >> - "active_file", >> - "unevictable", >> -}; >> - >> static inline void mem_cgroup_lru_names_not_uptodate(void) >> { >> BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 7e9e911..4b8a6dd 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -421,8 +421,10 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, >> cpuset_print_task_mems_allowed(current); >> task_unlock(current); >> dump_stack(); >> - mem_cgroup_print_oom_info(memcg, p); >> - show_mem(SHOW_MEM_FILTER_NODES); >> + if (memcg) >> + mem_cgroup_print_oom_info(memcg, p); > mem_cgroup_print_oom_info() already returns immediately for !memcg, so I'm > not sure why this change is made. > Here the if-else checking is aiming at printing distinct messages for memcg & non-memcg. IMHO, global state has little actual use for memcg-oom and why not we wipe off iti 1/4 ? Though mem_cgroup_print_oom_info already checking for !memcg, the if-statement can avoid one function call and save the deep-enough oom call stack a little. >> + else >> + show_mem(SHOW_MEM_FILTER_NODES); > Well that's disappointing if memcg == root_mem_cgroup, we'd probably like > to know the global memory state to determine what the problem is. > I really wondering if there is any case that can pass root_mem_cgroup down here. It's called by global or memcg oom killer and the global oom will set memcg=NULL directly instead of root_mem_cgroup. Besides, root memcg will not go through charging and there is no chance to call mem_cgroup_out_of_memory for root cgroup tasks. Thanks, Sha >> if (sysctl_oom_dump_tasks) >> dump_tasks(memcg, nodemask); >> } > . > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening 2012-11-08 12:37 ` Sha Zhengju @ 2012-11-08 12:44 ` Michal Hocko 0 siblings, 0 replies; 14+ messages in thread From: Michal Hocko @ 2012-11-08 12:44 UTC (permalink / raw) To: Sha Zhengju Cc: David Rientjes, linux-mm, cgroups, kamezawa.hiroyu, akpm, linux-kernel On Thu 08-11-12 20:37:45, Sha Zhengju wrote: > On 11/08/2012 02:02 AM, David Rientjes wrote: > >On Wed, 7 Nov 2012, Sha Zhengju wrote: [..] > >>+ else > >>+ show_mem(SHOW_MEM_FILTER_NODES); > >Well that's disappointing if memcg == root_mem_cgroup, we'd probably like > >to know the global memory state to determine what the problem is. > > > > I really wondering if there is any case that can pass > root_mem_cgroup down here. No it cannot because the root cgroup doesn't have any limit so we cannot trigger memcg oom killer. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening 2012-11-07 8:41 ` [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Sha Zhengju 2012-11-07 18:02 ` David Rientjes @ 2012-11-07 22:17 ` Michal Hocko 2012-11-08 12:45 ` Sha Zhengju 2012-11-08 9:07 ` Kamezawa Hiroyuki 2 siblings, 1 reply; 14+ messages in thread From: Michal Hocko @ 2012-11-07 22:17 UTC (permalink / raw) To: Sha Zhengju Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, rientjes, linux-kernel, Sha Zhengju On Wed 07-11-12 16:41:36, Sha Zhengju wrote: > From: Sha Zhengju <handai.szj@taobao.com> > > Current, when a memcg oom is happening the oom dump messages is still global > state and provides few useful info for users. This patch prints more pointed > memcg page statistics for memcg-oom. > > Signed-off-by: Sha Zhengju <handai.szj@taobao.com> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- > mm/oom_kill.c | 6 +++- > 2 files changed, 66 insertions(+), 11 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0eab7d5..2df5e72 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c [...] > @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > spin_unlock_irqrestore(&memcg->move_lock, *flags); > } > > +#define K(x) ((x) << (PAGE_SHIFT-10)) > +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *mi; > + unsigned int i; > + > + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > + K(mem_cgroup_read_stat(memcg, i))); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) > + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], > + mem_cgroup_read_events(memcg, i)); > + > + for (i = 0; i < NR_LRU_LISTS; i++) > + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], > + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > + } else { > + > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + long long val = 0; > + > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_stat(mi, i); > + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_events(mi, i); > + printk(KERN_CONT "%s:%llu ", > + mem_cgroup_events_names[i], val); > + } > + > + for (i = 0; i < NR_LRU_LISTS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); > + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); > + } > + } This is just plain ugly. for_each_mem_cgroup_tree is use_hierarchy aware and there is no need for if (use_hierarchy) part. memcg != root_mem_cgroup test doesn't make much sense as well because we call that a global oom killer ;) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening 2012-11-07 22:17 ` Michal Hocko @ 2012-11-08 12:45 ` Sha Zhengju 0 siblings, 0 replies; 14+ messages in thread From: Sha Zhengju @ 2012-11-08 12:45 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, rientjes, linux-kernel, Sha Zhengju On 11/08/2012 06:17 AM, Michal Hocko wrote: > On Wed 07-11-12 16:41:36, Sha Zhengju wrote: >> From: Sha Zhengju<handai.szj@taobao.com> >> >> Current, when a memcg oom is happening the oom dump messages is still global >> state and provides few useful info for users. This patch prints more pointed >> memcg page statistics for memcg-oom. >> >> Signed-off-by: Sha Zhengju<handai.szj@taobao.com> >> Cc: Michal Hocko<mhocko@suse.cz> >> Cc: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> >> Cc: David Rientjes<rientjes@google.com> >> Cc: Andrew Morton<akpm@linux-foundation.org> >> --- >> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- >> mm/oom_kill.c | 6 +++- >> 2 files changed, 66 insertions(+), 11 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c > [...] >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x)<< (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy&& memcg != root_mem_cgroup) { >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], >> + K(mem_cgroup_read_stat(memcg, i))); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); >> + } else { >> + >> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) { >> + long long val = 0; >> + >> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account) >> + continue; >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_stat(mi, i); >> + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); >> + } >> + >> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_read_events(mi, i); >> + printk(KERN_CONT "%s:%llu ", >> + mem_cgroup_events_names[i], val); >> + } >> + >> + for (i = 0; i< NR_LRU_LISTS; i++) { >> + unsigned long long val = 0; >> + >> + for_each_mem_cgroup_tree(mi, memcg) >> + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); >> + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); >> + } >> + } > This is just plain ugly. for_each_mem_cgroup_tree is use_hierarchy aware > and there is no need for if (use_hierarchy) part. > memcg != root_mem_cgroup test doesn't make much sense as well because we > call that a global oom killer ;) Yes... bitterly did I repent the patch... The else-part of for_each_mem_cgroup_tree is enough for hierarchy. I'll send a update one later. Sorry for the noise. : ( Thanks, Sha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening 2012-11-07 8:41 ` [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Sha Zhengju 2012-11-07 18:02 ` David Rientjes 2012-11-07 22:17 ` Michal Hocko @ 2012-11-08 9:07 ` Kamezawa Hiroyuki 2012-11-08 13:12 ` Sha Zhengju 2 siblings, 1 reply; 14+ messages in thread From: Kamezawa Hiroyuki @ 2012-11-08 9:07 UTC (permalink / raw) To: Sha Zhengju Cc: linux-mm, cgroups, mhocko, akpm, rientjes, linux-kernel, Sha Zhengju (2012/11/07 17:41), Sha Zhengju wrote: > From: Sha Zhengju <handai.szj@taobao.com> > > Current, when a memcg oom is happening the oom dump messages is still global > state and provides few useful info for users. This patch prints more pointed > memcg page statistics for memcg-oom. > > Signed-off-by: Sha Zhengju <handai.szj@taobao.com> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- > mm/oom_kill.c | 6 +++- > 2 files changed, 66 insertions(+), 11 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0eab7d5..2df5e72 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { > "pgmajfault", > }; > > +static const char * const mem_cgroup_lru_names[] = { > + "inactive_anon", > + "active_anon", > + "inactive_file", > + "active_file", > + "unevictable", > +}; > + Is this for the same strings with show_free_areas() ? > /* > * Per memcg event counter is incremented at every pagein/pageout. With THP, > * it will be incremated by the number of pages. This counter is used for > @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > spin_unlock_irqrestore(&memcg->move_lock, *flags); > } > > +#define K(x) ((x) << (PAGE_SHIFT-10)) > +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) > +{ > + struct mem_cgroup *mi; > + unsigned int i; > + > + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { Why do you need to have this condition check ? > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], > + K(mem_cgroup_read_stat(memcg, i))); Hm, how about using the same style with show_free_areas() ? > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) > + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], > + mem_cgroup_read_events(memcg, i)); > + I don't think EVENTS info is useful for oom. > + for (i = 0; i < NR_LRU_LISTS; i++) > + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], > + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); How far does your new information has different format than usual oom ? Could you show a sample and difference in changelog ? Of course, I prefer both of them has similar format. > + } else { > + > + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > + long long val = 0; > + > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > + continue; > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_stat(mi, i); > + printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val)); > + } > + > + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_read_events(mi, i); > + printk(KERN_CONT "%s:%llu ", > + mem_cgroup_events_names[i], val); > + } > + > + for (i = 0; i < NR_LRU_LISTS; i++) { > + unsigned long long val = 0; > + > + for_each_mem_cgroup_tree(mi, memcg) > + val += mem_cgroup_nr_lru_pages(mi, BIT(i)); > + printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val)); > + } > + } > + printk(KERN_CONT "\n"); > +} > /** > - * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. > * @memcg: The memory cgroup that went over limit > * @p: Task that is going to be killed > * > @@ -1569,6 +1628,8 @@ done: > res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, > res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, > res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); > + > + mem_cgroup_print_oom_stat(memcg); > } please put directly in print_oom_info() Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening 2012-11-08 9:07 ` Kamezawa Hiroyuki @ 2012-11-08 13:12 ` Sha Zhengju 0 siblings, 0 replies; 14+ messages in thread From: Sha Zhengju @ 2012-11-08 13:12 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: linux-mm, cgroups, mhocko, akpm, rientjes, linux-kernel, Sha Zhengju On 11/08/2012 05:07 PM, Kamezawa Hiroyuki wrote: > (2012/11/07 17:41), Sha Zhengju wrote: >> From: Sha Zhengju <handai.szj@taobao.com> >> >> Current, when a memcg oom is happening the oom dump messages is still global >> state and provides few useful info for users. This patch prints more pointed >> memcg page statistics for memcg-oom. >> >> Signed-off-by: Sha Zhengju <handai.szj@taobao.com> >> Cc: Michal Hocko <mhocko@suse.cz> >> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> >> Cc: David Rientjes <rientjes@google.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> --- >> mm/memcontrol.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++------- >> mm/oom_kill.c | 6 +++- >> 2 files changed, 66 insertions(+), 11 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 0eab7d5..2df5e72 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -118,6 +118,14 @@ static const char * const mem_cgroup_events_names[] = { >> "pgmajfault", >> }; >> >> +static const char * const mem_cgroup_lru_names[] = { >> + "inactive_anon", >> + "active_anon", >> + "inactive_file", >> + "active_file", >> + "unevictable", >> +}; >> + > Is this for the same strings with show_free_areas() ? > I just move the declaration here from the bottom of source file to make the following use error-free. >> /* >> * Per memcg event counter is incremented at every pagein/pageout. With THP, >> * it will be incremated by the number of pages. This counter is used for >> @@ -1501,8 +1509,59 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, >> spin_unlock_irqrestore(&memcg->move_lock, *flags); >> } >> >> +#define K(x) ((x) << (PAGE_SHIFT-10)) >> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg) >> +{ >> + struct mem_cgroup *mi; >> + unsigned int i; >> + >> + if (!memcg->use_hierarchy && memcg != root_mem_cgroup) { > Why do you need to have this condition check ? > Yes, the check is unnecessary... I'll remove it next version. >> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { >> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) >> + continue; >> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i], >> + K(mem_cgroup_read_stat(memcg, i))); > Hm, how about using the same style with show_free_areas() ? > I'm also trying do so. show_free_areas() prints the memory related info in two style: one is in page unit and the oher is in KB (I've no idea why we distinct them), but I think the KB format is more readable. >> + } >> + >> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) >> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i], >> + mem_cgroup_read_events(memcg, i)); >> + > I don't think EVENTS info is useful for oom. > It seems you're right. : ) >> + for (i = 0; i < NR_LRU_LISTS; i++) >> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i], >> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i)))); > How far does your new information has different format than usual oom ? > Could you show a sample and difference in changelog ? > > Of course, I prefer both of them has similar format. > > > The new memcg-oom info excludes global state out and prints the memcg statistics instead which seems more brevity. I'll add a sample next time. Thanks for reminding me! Thanks, Sha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation 2012-11-07 8:40 [PATCH V2 0/2] Provide more precise dump info for memcg-oom Sha Zhengju 2012-11-07 8:41 ` [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Sha Zhengju @ 2012-11-07 8:41 ` Sha Zhengju 2012-11-07 18:07 ` David Rientjes 2012-11-07 22:34 ` Michal Hocko 2012-11-07 19:31 ` [PATCH V2 0/2] Provide more precise dump info for memcg-oom Andrew Morton 2 siblings, 2 replies; 14+ messages in thread From: Sha Zhengju @ 2012-11-07 8:41 UTC (permalink / raw) To: linux-mm, cgroups, mhocko, kamezawa.hiroyu, akpm, rientjes Cc: linux-kernel, Sha Zhengju From: Sha Zhengju <handai.szj@taobao.com> If memcg oom happening, don't scan all system tasks to dump memory state of eligible tasks, instead we iterates only over the process attached to the oom memcg and avoid the rcu lock. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> --- include/linux/memcontrol.h | 7 +++++ include/linux/oom.h | 2 + mm/memcontrol.c | 14 +++++++++++ mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- 4 files changed, 56 insertions(+), 22 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c91e3c1..4322ca8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -122,6 +122,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +extern void dump_tasks_memcg(const struct mem_cgroup *memcg, + const nodemask_t *nodemask); extern void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage); @@ -337,6 +339,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void +dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) +{ +} + static inline void mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, unsigned long *flags) { diff --git a/include/linux/oom.h b/include/linux/oom.h index 20b5c46..9ba3344 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); +extern inline void dump_per_task(struct task_struct *p, + const nodemask_t *nodemask); extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2df5e72..fe648f8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) return min(limit, memsw); } +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) +{ + struct cgroup_iter it; + struct task_struct *task; + struct cgroup *cgroup = memcg->css.cgroup; + + cgroup_iter_start(cgroup, &it); + while ((task = cgroup_iter_next(cgroup, &it))) { + dump_per_task(task, nodemask); + } + + cgroup_iter_end(cgroup, &it); +} + static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, int order) { diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4b8a6dd..aaf6237 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, return chosen; } +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) +{ + struct task_struct *task; + + if (oom_unkillable_task(p, NULL, nodemask)) + return; + + task = find_lock_task_mm(p); + if (!task) { + /* + * This is a kthread or all of p's threads have already + * detached their mm's. There's no need to report + * them; they can't be oom killed anyway. + */ + return; + } + + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", + task->pid, from_kuid(&init_user_ns, task_uid(task)), + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), + task->mm->nr_ptes, + get_mm_counter(task->mm, MM_SWAPENTS), + task->signal->oom_score_adj, task->comm); + task_unlock(task); +} + /** * dump_tasks - dump current memory state of all system tasks * @memcg: current's memory controller, if constrained @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) { struct task_struct *p; - struct task_struct *task; pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); - rcu_read_lock(); - for_each_process(p) { - if (oom_unkillable_task(p, memcg, nodemask)) - continue; - - task = find_lock_task_mm(p); - if (!task) { - /* - * This is a kthread or all of p's threads have already - * detached their mm's. There's no need to report - * them; they can't be oom killed anyway. - */ - continue; - } - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", - task->pid, from_kuid(&init_user_ns, task_uid(task)), - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - task->mm->nr_ptes, - get_mm_counter(task->mm, MM_SWAPENTS), - task->signal->oom_score_adj, task->comm); - task_unlock(task); + if (memcg) { + dump_tasks_memcg(memcg, nodemask); + return; } + + rcu_read_lock(); + for_each_process(p) + dump_per_task(p, nodemask); rcu_read_unlock(); } -- 1.7.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation 2012-11-07 8:41 ` [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Sha Zhengju @ 2012-11-07 18:07 ` David Rientjes 2012-11-07 22:34 ` Michal Hocko 1 sibling, 0 replies; 14+ messages in thread From: David Rientjes @ 2012-11-07 18:07 UTC (permalink / raw) To: Sha Zhengju Cc: linux-mm, cgroups, mhocko, kamezawa.hiroyu, akpm, linux-kernel, Sha Zhengju On Wed, 7 Nov 2012, Sha Zhengju wrote: > From: Sha Zhengju <handai.szj@taobao.com> > > If memcg oom happening, don't scan all system tasks to dump memory state of > eligible tasks, instead we iterates only over the process attached to the oom > memcg and avoid the rcu lock. > Avoiding the rcu lock isn't actually that impressive here, the cgroup iterator will use it's own lock for that memcg. > Signed-off-by: Sha Zhengju <handai.szj@taobao.com> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > include/linux/memcontrol.h | 7 +++++ > include/linux/oom.h | 2 + > mm/memcontrol.c | 14 +++++++++++ > mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- > 4 files changed, 56 insertions(+), 22 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index c91e3c1..4322ca8 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -122,6 +122,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); > void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > +extern void dump_tasks_memcg(const struct mem_cgroup *memcg, > + const nodemask_t *nodemask); Shouldn't need the nodemask parameter, just have dump_tasks_memcg() pass NULL to dump_per_task(), we won't be isolating to tasks with mempolicies restricted to a particular set of nodes since we're in the memcg oom path here, not the global page allocator oom path. > extern void mem_cgroup_replace_page_cache(struct page *oldpage, > struct page *newpage); > > @@ -337,6 +339,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) > { > } > > +static inline void > +dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > +} > + > static inline void mem_cgroup_begin_update_page_stat(struct page *page, > bool *locked, unsigned long *flags) > { > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 20b5c46..9ba3344 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, > unsigned long totalpages, const nodemask_t *nodemask, > bool force_kill); > > +extern inline void dump_per_task(struct task_struct *p, > + const nodemask_t *nodemask); This is a global symbol, so dump_per_task() doesn't make a lot of sense: it would need to be prefixed with "oom_" so perhaps oom_dump_task() is better? > extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *mask, bool force_kill); > extern int register_oom_notifier(struct notifier_block *nb); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2df5e72..fe648f8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > + struct cgroup_iter it; > + struct task_struct *task; > + struct cgroup *cgroup = memcg->css.cgroup; > + > + cgroup_iter_start(cgroup, &it); > + while ((task = cgroup_iter_next(cgroup, &it))) { > + dump_per_task(task, nodemask); > + } > + > + cgroup_iter_end(cgroup, &it); > +} > + > static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > int order) > { > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 4b8a6dd..aaf6237 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > return chosen; > } > > +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) No inline. > +{ > + struct task_struct *task; > + > + if (oom_unkillable_task(p, NULL, nodemask)) > + return; > + > + task = find_lock_task_mm(p); > + if (!task) { > + /* > + * This is a kthread or all of p's threads have already > + * detached their mm's. There's no need to report > + * them; they can't be oom killed anyway. > + */ > + return; > + } > + > + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > + task->pid, from_kuid(&init_user_ns, task_uid(task)), > + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > + task->mm->nr_ptes, > + get_mm_counter(task->mm, MM_SWAPENTS), > + task->signal->oom_score_adj, task->comm); > + task_unlock(task); > +} > + > /** > * dump_tasks - dump current memory state of all system tasks > * @memcg: current's memory controller, if constrained > @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > { > struct task_struct *p; > - struct task_struct *task; > > pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); > - rcu_read_lock(); > - for_each_process(p) { > - if (oom_unkillable_task(p, memcg, nodemask)) > - continue; > - > - task = find_lock_task_mm(p); > - if (!task) { > - /* > - * This is a kthread or all of p's threads have already > - * detached their mm's. There's no need to report > - * them; they can't be oom killed anyway. > - */ > - continue; > - } > > - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > - task->pid, from_kuid(&init_user_ns, task_uid(task)), > - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > - task->mm->nr_ptes, > - get_mm_counter(task->mm, MM_SWAPENTS), > - task->signal->oom_score_adj, task->comm); > - task_unlock(task); > + if (memcg) { > + dump_tasks_memcg(memcg, nodemask); > + return; > } > + > + rcu_read_lock(); > + for_each_process(p) > + dump_per_task(p, nodemask); > rcu_read_unlock(); > } > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation 2012-11-07 8:41 ` [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Sha Zhengju 2012-11-07 18:07 ` David Rientjes @ 2012-11-07 22:34 ` Michal Hocko 2012-11-08 13:58 ` Sha Zhengju 1 sibling, 1 reply; 14+ messages in thread From: Michal Hocko @ 2012-11-07 22:34 UTC (permalink / raw) To: Sha Zhengju Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, rientjes, linux-kernel, Sha Zhengju On Wed 07-11-12 16:41:59, Sha Zhengju wrote: > From: Sha Zhengju <handai.szj@taobao.com> > > If memcg oom happening, don't scan all system tasks to dump memory state of > eligible tasks, instead we iterates only over the process attached to the oom > memcg and avoid the rcu lock. you have replaced rcu lock by css_set_lock which is, well, heavier than rcu. Besides that the patch is not correct because you have excluded all tasks that are from subgroups because you iterate only through the top level one. I am not sure the whole optimization would be a win even if implemented correctly. Well, we scan through more tasks currently and most of them are not relevant but then you would need to exclude task_in_mem_cgroup from oom_unkillable_task and that would be more code churn than the win. > Signed-off-by: Sha Zhengju <handai.szj@taobao.com> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > include/linux/memcontrol.h | 7 +++++ > include/linux/oom.h | 2 + > mm/memcontrol.c | 14 +++++++++++ > mm/oom_kill.c | 55 ++++++++++++++++++++++++++----------------- > 4 files changed, 56 insertions(+), 22 deletions(-) > [...] > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 20b5c46..9ba3344 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -57,6 +57,8 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, > unsigned long totalpages, const nodemask_t *nodemask, > bool force_kill); > > +extern inline void dump_per_task(struct task_struct *p, > + const nodemask_t *nodemask); > extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *mask, bool force_kill); > extern int register_oom_notifier(struct notifier_block *nb); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2df5e72..fe648f8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1665,6 +1665,20 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +void dump_tasks_memcg(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > +{ > + struct cgroup_iter it; > + struct task_struct *task; > + struct cgroup *cgroup = memcg->css.cgroup; > + > + cgroup_iter_start(cgroup, &it); > + while ((task = cgroup_iter_next(cgroup, &it))) { > + dump_per_task(task, nodemask); > + } > + > + cgroup_iter_end(cgroup, &it); > +} > + > static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > int order) > { > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 4b8a6dd..aaf6237 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -367,6 +367,32 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > return chosen; > } > > +inline void dump_per_task(struct task_struct *p, const nodemask_t *nodemask) > +{ > + struct task_struct *task; > + > + if (oom_unkillable_task(p, NULL, nodemask)) > + return; > + > + task = find_lock_task_mm(p); > + if (!task) { > + /* > + * This is a kthread or all of p's threads have already > + * detached their mm's. There's no need to report > + * them; they can't be oom killed anyway. > + */ > + return; > + } > + > + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > + task->pid, from_kuid(&init_user_ns, task_uid(task)), > + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > + task->mm->nr_ptes, > + get_mm_counter(task->mm, MM_SWAPENTS), > + task->signal->oom_score_adj, task->comm); > + task_unlock(task); > +} > + > /** > * dump_tasks - dump current memory state of all system tasks > * @memcg: current's memory controller, if constrained > @@ -381,32 +407,17 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) > { > struct task_struct *p; > - struct task_struct *task; > > pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); > - rcu_read_lock(); > - for_each_process(p) { > - if (oom_unkillable_task(p, memcg, nodemask)) > - continue; > - > - task = find_lock_task_mm(p); > - if (!task) { > - /* > - * This is a kthread or all of p's threads have already > - * detached their mm's. There's no need to report > - * them; they can't be oom killed anyway. > - */ > - continue; > - } > > - pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", > - task->pid, from_kuid(&init_user_ns, task_uid(task)), > - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), > - task->mm->nr_ptes, > - get_mm_counter(task->mm, MM_SWAPENTS), > - task->signal->oom_score_adj, task->comm); > - task_unlock(task); > + if (memcg) { > + dump_tasks_memcg(memcg, nodemask); > + return; > } > + > + rcu_read_lock(); > + for_each_process(p) > + dump_per_task(p, nodemask); > rcu_read_unlock(); > } > > -- > 1.7.6.1 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation 2012-11-07 22:34 ` Michal Hocko @ 2012-11-08 13:58 ` Sha Zhengju 0 siblings, 0 replies; 14+ messages in thread From: Sha Zhengju @ 2012-11-08 13:58 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, cgroups, kamezawa.hiroyu, akpm, rientjes, linux-kernel, Sha Zhengju On 11/08/2012 06:34 AM, Michal Hocko wrote: > On Wed 07-11-12 16:41:59, Sha Zhengju wrote: >> From: Sha Zhengju<handai.szj@taobao.com> >> >> If memcg oom happening, don't scan all system tasks to dump memory state of >> eligible tasks, instead we iterates only over the process attached to the oom >> memcg and avoid the rcu lock. > you have replaced rcu lock by css_set_lock which is, well, heavier than > rcu. Besides that the patch is not correct because you have excluded > all tasks that are from subgroups because you iterate only through the > top level one. > I am not sure the whole optimization would be a win even if implemented > correctly. Well, we scan through more tasks currently and most of them > are not relevant but then you would need to exclude task_in_mem_cgroup > from oom_unkillable_task and that would be more code churn than the > win. Thanks for your and David's advice. This piece is trying to save some expense while dumping memcg tasks, but failed to scanning subgroups by iterating the cgroup. I'm agreed with your cost&win opinion, so I decide to give up this one. : ) Thanks, Sha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH V2 0/2] Provide more precise dump info for memcg-oom 2012-11-07 8:40 [PATCH V2 0/2] Provide more precise dump info for memcg-oom Sha Zhengju 2012-11-07 8:41 ` [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Sha Zhengju 2012-11-07 8:41 ` [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Sha Zhengju @ 2012-11-07 19:31 ` Andrew Morton 2 siblings, 0 replies; 14+ messages in thread From: Andrew Morton @ 2012-11-07 19:31 UTC (permalink / raw) To: Sha Zhengju Cc: linux-mm, cgroups, mhocko, kamezawa.hiroyu, rientjes, linux-kernel, Sha Zhengju On Wed, 7 Nov 2012 16:40:02 +0800 Sha Zhengju <handai.szj@gmail.com> wrote: > When memcg oom is happening the current memcg related dump information > is limited for debugging. The patches provide more detailed memcg page statistics > and also take hierarchy into consideration. Within the changelogs, please include a sample of the proposed output so we can properly review the proposal. Also it would be useful to provide some justification for the decisions in this patch: which data is displayed and, particularly, which is not? Why is the displayed information useful to developers, etc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2012-11-08 13:58 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-11-07 8:40 [PATCH V2 0/2] Provide more precise dump info for memcg-oom Sha Zhengju 2012-11-07 8:41 ` [PATCH 1/2] memcg, oom: provide more precise dump info while memcg oom happening Sha Zhengju 2012-11-07 18:02 ` David Rientjes 2012-11-08 12:37 ` Sha Zhengju 2012-11-08 12:44 ` Michal Hocko 2012-11-07 22:17 ` Michal Hocko 2012-11-08 12:45 ` Sha Zhengju 2012-11-08 9:07 ` Kamezawa Hiroyuki 2012-11-08 13:12 ` Sha Zhengju 2012-11-07 8:41 ` [PATCH 2/2] oom: rework dump_tasks to optimize memcg-oom situation Sha Zhengju 2012-11-07 18:07 ` David Rientjes 2012-11-07 22:34 ` Michal Hocko 2012-11-08 13:58 ` Sha Zhengju 2012-11-07 19:31 ` [PATCH V2 0/2] Provide more precise dump info for memcg-oom Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).